Parallelizing Brent's Method with CUDA

Headline Results

35.31×

Kernel speedup over the single-threaded CPU baseline

8.79×

End-to-end speedup — median of 10 trials on an RTX 3080 at a 2²² batch

Built With

CUDAC++PythonCMakeGPU ComputingNumerical Computing

What It Is

Brent's method is a robust scalar root-finder, but a single root solve is inherently sequential. This project parallelizes it the other way — across problems rather than within one — using batch/ensemble parallelism: each GPU thread runs its own independent Brent's instance. Batch sizes sweep from 2¹⁰ up to 2²²(~4.19M independent problems at the top end), saturating the device with millions of concurrent solves. It is, to my knowledge, the first CUDA implementation of Brent's method.

Correctness & Rigor

Performance only counts if the answers are exact. The CPU and GPU paths produce bit-identical fp64 results, validated to <1e-10 against a Python ground-truth reference across all three languages — Python, C++, and CUDA. The test battery covers 22 hand-crafted cases, 16,384 randomly generated monotonic cubics, and edge cases, exercised on both the CPU and GPU paths.

Benchmarking Methodology

Each batch size runs 3 warmup trials followed by 10 measured trials, reported as the median, with an automatic retry whenever the coefficient of variation exceeds 5% to keep timings stable. Benchmarks were collected on an NVIDIA RTX 3080 (Ampere, SM 8.6) with CUDA Toolkit 13.1. Headline figures: 35.31× at the kernel level and 8.79× end-to-end at the 2²² batch.

Source Code & Report

The complete source is on GitHub (GPL-3.0). The repository also includes the full written report, presentation slides, and raw benchmarks under docs/.