Parallelizing Brent's Method with CUDA
M.S. Graduate Project · GPU-Accelerated Root-Finding
A first CUDA implementation of Brent's root-finding method — batch/ensemble parallelism running one independent solver per GPU thread.
Headline Results
Kernel speedup over the single-threaded CPU baseline
End-to-end speedup — median of 10 trials on an RTX 3080 at a 222 batch
Built With
What It Is
Brent's method is a robust scalar root-finder, but a single root solve is inherently sequential. This project parallelizes it the other way — across problems rather than within one — using batch/ensemble parallelism: each GPU thread runs its own independent Brent's instance. Batch sizes sweep from 210 up to 222(~4.19M independent problems at the top end), saturating the device with millions of concurrent solves. It is, to my knowledge, the first CUDA implementation of Brent's method.
Correctness & Rigor
Performance only counts if the answers are exact. The CPU and GPU paths produce bit-identical fp64 results, validated to <1e-10 against a Python ground-truth reference across all three languages — Python, C++, and CUDA. The test battery covers 22 hand-crafted cases, 16,384 randomly generated monotonic cubics, and edge cases, exercised on both the CPU and GPU paths.
Benchmarking Methodology
Each batch size runs 3 warmup trials followed by 10 measured trials, reported as the median, with an automatic retry whenever the coefficient of variation exceeds 5% to keep timings stable. Benchmarks were collected on an NVIDIA RTX 3080 (Ampere, SM 8.6) with CUDA Toolkit 13.1. Headline figures: 35.31× at the kernel level and 8.79× end-to-end at the 222 batch.
Source Code & Report
The complete source is on GitHub (GPL-3.0). The repository also includes the full written report, presentation slides, and raw benchmarks under docs/.