Skip to content

youssef62/conv1d-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Fast 1D Convolution in CUDA – #1 on LeetGPU (T4)

This project implements and optimizes a 1D convolution CUDA kernel, ultimately achieving the top rank on LeetGPU for the NVIDIA T4 leaderboard. You can find the blog post detailing the optimizations here. Leaderboard

The kernel computes: $$\text{output}[i] = \sum_{j=0}^{\text{mask-size}-1} \text{input}[i+j] \cdot \text{mask}[j]$$

where $i$ ranges from $0$ to $\text{input-size} - \text{mask-size}$.

For large input sizes (e.g. 1 million elements) and wide kernels (e.g. 2047 elements), optimized for NVIDIA T4 GPU.

Approaches

  • Naive
  • Shared Memory
  • Register Blocking
  • Micro-Optimizations

📊 Results

To reproduce these results, run the script below. It performs a hyperparameter search and prints performance metrics for all kernels.

./hyperparams.sh

Benchmark (on NVIDIA T4):

naive     : 4.60465 ms, 0.889101 GFLOPS
smem      : 4.67256 ms, 0.876179 GFLOPS
reg_block : 0.894524 ms, 4.57673 GFLOPS
micro_opt : 0.868636 ms, 4.71314 GFLOPS

So we are able to achieve a speedup of 5.3x over the naive implementation.

About

This project implements and optimizes a 1D convolution CUDA kernel, ultimately achieving the top rank on LeetGPU for the NVIDIA T4 leaderboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors