This project implements and optimizes a 1D convolution CUDA kernel, ultimately achieving the top rank on LeetGPU for the NVIDIA T4 leaderboard.
You can find the blog post detailing the optimizations here.

The kernel computes:
where
For large input sizes (e.g. 1 million elements) and wide kernels (e.g. 2047 elements), optimized for NVIDIA T4 GPU.
- Naive
- Shared Memory
- Register Blocking
- Micro-Optimizations
To reproduce these results, run the script below. It performs a hyperparameter search and prints performance metrics for all kernels.
./hyperparams.shBenchmark (on NVIDIA T4):
naive : 4.60465 ms, 0.889101 GFLOPS
smem : 4.67256 ms, 0.876179 GFLOPS
reg_block : 0.894524 ms, 4.57673 GFLOPS
micro_opt : 0.868636 ms, 4.71314 GFLOPS
So we are able to achieve a speedup of 5.3x over the naive implementation.