CUDA Matrix Multiplication: From Naive to Optimized

Executive Summary

This project demonstrates three progressively optimized CUDA implementations of matrix multiplication, revealing the key concepts behind GPU performance tuning. Each kernel is benchmarked against a CPU baseline and NVIDIA's cuBLAS library to quantify optimization benefits.

One-liner pitch: Three progressively optimized CUDA implementations of matrix multiplication — naive, tiled shared memory, and register-blocked — each benchmarked against the CPU baseline and cuBLAS, with a full performance analysis explaining exactly why each optimization improves throughput.

Environment Setup

For complete Windows installation and verification steps, see:

SETUP_WINDOWS.md
SETUP_WINDOWS.txt

Hardware Target

GPU: NVIDIA GeForce GTX 1050 Ti (Pascal Architecture, SM 6.1)

CUDA Cores: 768
Memory Bandwidth: ~112 GB/s (DDR5 equivalent)
Peak FP32 Throughput: 1900 GFLOP/s (768 cores × 1500 MHz × 2 FMA ops)
L2 Cache: 1 MB
Shared Memory per Block: 96 KB

Kernel Implementations

V1: Naive Matrix Multiplication

__global__ void kernel_v1_naive(const float* A, const float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];  // ← Global memory every iteration!
        }
        C[row * N + col] = sum;
    }
}

Performance Characteristics:

Arithmetic Intensity: $AI = \frac{2N^3}{4N^2} = \frac{N}{2}$ effective bytes per operation
For N=1024: AI ≈ 0.0005 FLOP/byte (memory-bound)
Memory Accesses: Each of the N iterations reads A[row, k] and B[k, col] from global memory
Global Memory Stalls: 600+ cycles per access (uncoalesced, no reuse)

Expected Performance: ~0.8-1 GFLOP/s (1-2% of peak)

Why it's slow:

Every thread-multiply reads from global memory (latency: 400-800 cycles)
No data reuse between threads
Poor memory coalescing (irregular access patterns)

V2: Tiled Shared Memory Optimization

__global__ void kernel_v2_tiled(const float* A, const float* B, float* C, int N) {
    const int TILE = 16;  // or 32
    __shared__ float tileA[TILE][TILE];
    __shared__ float tileB[TILE][TILE];
    
    for (int tile_k = 0; tile_k < N; tile_k += TILE) {
        // Load from global → shared (once per tile)
        if (row < N && tile_k + threadIdx.x < N)
            tileA[threadIdx.y][threadIdx.x] = A[row * N + (tile_k + threadIdx.x)];
        if (tile_k + threadIdx.y < N && col < N)
            tileB[threadIdx.y][threadIdx.x] = B[(tile_k + threadIdx.y) * N + col];
        
        __syncthreads();  // ← Synchronize before computation
        
        // Compute using shared memory (fast, low latency)
        for (int k = 0; k < TILE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];  // ← ~4 cycle latency
        }
        
        __syncthreads();  // ← Synchronize before next tile load
    }
    
    C[row * N + col] = sum;
}

Performance Improvements:

Metric	Naive	Tiled 16	Tiled 32
Arithmetic Intensity	0.0005	1.0	2.0
Shared Mem Latency	N/A	4 cycles	4 cycles
Memory Access Pattern	600+ cyc	4 cyc	4 cyc
Expected GFLOP/s	0.8	5-8	25-30
Speedup vs Naive	1×	6-10×	30-40×

Why it's better:

TILE² data reuse: Each global memory access serves TILE multiply-adds
Shared memory cache: 4-cycle latency vs 600-cycle global latency
Better coalescing: Organized tile loads from global memory
TILE=32 > TILE=16: Larger tiles mean fewer global loads per GFLOP

Key Insight: Shared memory bandwidth is ~100× better than global memory!

V3: Register-Blocked Optimization

__global__ void kernel_v3_register_blocked(const float* A, const float* B, float* C, int N) {
    const int TILE = 32;
    const int BX = 4, BY = 4;  // Each thread computes 4×4 output block
    
    float result[BY][BX];  // ← Stored in registers (0-cycle latency!)
    
    // Load tiles into shared memory (same as V2)
    for (int tile_k = 0; tile_k < N; tile_k += TILE) {
        // Load tiles...
        
        // Compute using registers + shared memory
        for (int k = 0; k < TILE; k++) {
            float valA[BY];
            float valB[BX];
            // Load values into registers, then compute
            for (int i = 0; i < BY; i++) {
                for (int j = 0; j < BX; j++) {
                    result[i][j] += valA[i] * valB[j];
                }
            }
        }
    }
    
    // Write results (each thread writes 4×4 block)
    for (int i = 0; i < BY; i++)
        for (int j = 0; j < BX; j++)
            C[...] = result[i][j];
}

Performance Improvements:

Metric	V2 (Tiled 32)	V3 (Reg-Blocked)
Registers per Thread	~10-15	~30-40
Shared Memory per Block	2 KB	2 KB
L1 Cache Hits	~60%	~85%
Expected GFLOP/s	25-30	50-100
Speedup vs Naive	30-40×	60-100×

Why it's better:

Register storage: 0-cycle access (vs 4-cycle for shared mem)
Reduced shared memory pressure: More data in registers = more threads
Better occupancy: Each thread uses fewer registers overall
More arithmetic per memory access: Fewer global memory reads

Trade-off: Requires careful tuning to maximize register utilization without exceeding register file limits (125,000 registers/SM on Pascal).

Roofline Analysis

The roofline model explains why each kernel hits a different performance ceiling:

Throughput (GFLOP/s)
       │
  1900 ┤  ┌─────────────── Peak Compute (FP32)
       │  │
       │  │    ╱─ Roofline boundary
   100 ┤  │   ╱
       │  │  ╱
    50 ┤  │╱     Reg-Blocked (compute-bound)
       │ ╱│
    10 ┤╱ │ Tiled (32) (memory-bound)
       │  │
     5 ┤  │  Tiled (16)
       │  │    Naive CUDA
     1 ┤  │___________________
       │  
       └──┴──┬──────┬──────┬───── → Arithmetic Intensity (FLOP/byte)
         0.1  0.5   1.0   10

Memory Bandwidth Ceiling: 112 GB/s × AI = ~450 GFLOP/s at AI=4

Reading the chart:

Left side (memory-bound): Performance limited by bandwidth, not compute cores
Right side (compute-bound): All cores active, can't go faster than peak
Moving right: Higher AI (more reuse) pushes kernels toward compute-bound

For GTX 1050 Ti: $$AI_{\text{inflection}} = \frac{\text{Peak Compute}}{\text{Peak Bandwidth}} = \frac{1900 \text{ GFLOP/s}}{112 \text{ GB/s}} \approx 17 \text{ FLOP/byte}$$

This means: To reach peak compute, need ~17 FLOPs per byte of memory traffic.

Optimization Techniques Explained

1. Shared Memory Tiling

Problem: Global memory has 600+ cycle latency, bandwidth shared among all cores. Solution: Load data into shared memory (fast, per-block resource), reuse within block. Cost: Limited shared memory (96 KB), synchronization overhead.

2. Memory Coalescing

Problem: Uncoalesced global memory reads are slow (multiple transactions). Solution: Arrange thread access patterns so reads are linear in memory. Example: All 32 threads in a warp read consecutive addresses → 1 transaction.

3. Bank Conflict Avoidance

Problem: Shared memory has 32 banks; conflicts = serialization. Solution: Pad arrays float[TILE][TILE+1] to shift rows across banks. Cost: Small extra memory; large latency savings.

4. Register Blocking

Problem: Shared memory still has ~4-cycle latency; not as fast as registers. Solution: Keep working set in registers, minimize shared memory pressure. Benefit: Reduces thread count needed → higher register/thread → better ILP.

5. Occupancy vs Performance

Trade-off: More registers/thread = fewer threads/block = lower occupancy. Modern GPUs: Low occupancy (20-40%) can actually be better if ILP is high! Rule: Measure; don't assume high occupancy = high performance.

Building and Running

Prerequisites

NVIDIA CUDA Toolkit 11.0+ (nvcc compiler)
cuBLAS library (usually included with CUDA)
Python 3.7+ (for plotting scripts)
Matplotlib, NumPy (install: pip install matplotlib numpy)

Build

Using Make:

make all          # Build both benchmark and test executables
make test         # Run correctness tests
make bench        # Run benchmarks (outputs JSON)

Using CMake:

mkdir build
cd build
cmake ..
make matmul_bench test_correctness
./matmul_bench    # Run benchmark

Run Benchmarks

make bench
# Outputs: results/benchmark_results.json

Generate Plots

cd python
python3 benchmark_plot.py    # Main performance plot
python3 roofline.py          # Roofline model diagram

Expected Results (GTX 1050 Ti)

Matrix Size	CPU	Naive CUDA	Tiled (16)	Tiled (32)	Reg-Blocked	cuBLAS	Peak
N=128	~0.8 ms	~5 ms	~1 ms	~0.5 ms	~0.3 ms	~0.2 ms	-
N=1024	~5 ms	~42 ms	~28 ms	~3.5 ms	~2.1 ms	~1.4 ms	- (CPU baseline)
GFLOP/s (1024)	~0.4	~5	~42	~300	~500	~700	~1900
% of Peak	0.02%	0.2%	2%	15%	25%	37%	100%

Key Observations:

Naive → Tiled (16): 8-10× faster (memory bandwidth utilization)
Tiled (16) → Tiled (32): 8-10× faster (larger TILE = less data reuse waste)
Tiled (32) → Reg-Blocked: 2× faster (register file better than shared mem)
Reg-Blocked vs cuBLAS: ~30-40% gap (cuBLAS uses PTX assembly + tuned strategies)

Why cuBLAS is Still Faster

Your register-blocked kernel reaches ~70% of cuBLAS performance. The remaining gap comes from:

Assembly-Level Optimizations (SASS):
- cuBLAS writes in PTX → compiled to SASS (GPU assembly)
- Can use undocumented hardware features
- Better instruction scheduling than CUDA C compiler
Tensor Cores (if available):
- GTX 1050 Ti doesn't have tensor cores
- RTX cards use them for 8-10× better matrix multiply
Tuned Tile Sizes per GPU:
- cuBLAS measures ideal TILE size for each GPU type
- We used TILE=32 fixed; cuBLAS might use 64 or 128
Vectorized Memory Accesses:
- float4 loads instead of float loads → 4× bandwidth
- Harder to express in CUDA C, natural in assembly
Kernel Fusion & Profiling:
- cuBLAS includes optimizations for mixed-precision, accumulation, etc.
- Highly optimized by NVIDIA engineers (not feasible in one project)

Verdict: 70% of cuBLAS performance from hand-tuned CUDA C is excellent and demonstrates the optimization principles!

File Structure

cuda-matmul/
├── src/
│   ├── kernel_v1_naive.cu           # Naive implementation
│   ├── kernel_v2_tiled.cu           # Tiled shared memory (TILE=16 & 32)
│   ├── kernel_v3_register.cu        # Register-blocked
│   ├── kernel_cublas.cu             # cuBLAS wrapper
│   └── benchmark.cu                 # Benchmarking harness
├── include/
│   ├── kernels.h                    # Kernel declarations
│   └── utils.h                      # CUDA_CHECK, timers, CPU reference
├── python/
│   ├── benchmark_plot.py            # Generate performance plots
│   └── roofline.py                  # Roofline model visualization
├── tests/
│   └── test_correctness.cu          # Verify all kernels produce correct output
├── results/
│   ├── matmul_benchmark.png         # Performance comparison plot
│   ├── roofline_model.png           # Roofline diagram
│   └── benchmark_data.json          # Raw benchmark numbers
├── CMakeLists.txt                   # CMake build system
├── Makefile                         # Make build system
└── README.md                        # This file

Key Learnings

Memory bandwidth is the enemy: In modern GPUs, getting data matters more than arithmetic.
Roofline before code: Plot your expected performance before optimizing.
Shared memory + tiling: 10-100× speedup from one optimization pattern.
Profiling is essential: Use nvprof, nsys, or NVIDIA Nsight to measure (not guess).
Diminishing returns: Each optimization is harder than the last. Tiled is enough for many apps.

Resources

NVIDIA CUDA C++ Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
Roofline Model Paper: Williams, Waterman, Patterson (2009)
Optimized Matrix Multiply: https://www.nvidia.com/en-us/research/ai-computing/
cuBLAS Documentation: https://docs.nvidia.com/cuda/cublas/

Author Notes

This project demonstrates real GPU optimization principles used in production libraries like cuBLAS, TensorFlow, PyTorch, and HPC codes. The progression from naive → tiled → register-blocked shows how each optimization layer attacks a specific bottleneck. Understanding why each helps is more valuable than the code itself.

Bottom line: GPU performance tuning is 80% understanding your hardware and 20% coding. Master the roofline model, memory hierarchy, and occupancy concepts, and optimization becomes systematic rather than trial-and-error.

Happy optimizing! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
include		include
python		python
results		results
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
3_GNU_Make_Installed.txt		3_GNU_Make_Installed.txt
CMakeLists.txt		CMakeLists.txt
GIT_SETUP_GUIDE.md		GIT_SETUP_GUIDE.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
README.txt		README.txt
SETUP_WINDOWS.md		SETUP_WINDOWS.md
SETUP_WINDOWS.txt		SETUP_WINDOWS.txt
SPECIFICATION.txt		SPECIFICATION.txt
build_with_cmake.cmd		build_with_cmake.cmd
build_with_make.cmd		build_with_make.cmd
setup_git.cmd		setup_git.cmd
use_cuda12.cmd		use_cuda12.cmd
validate_env.cmd		validate_env.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Matrix Multiplication: From Naive to Optimized

Executive Summary

Environment Setup

Hardware Target

Kernel Implementations

V1: Naive Matrix Multiplication

V2: Tiled Shared Memory Optimization

V3: Register-Blocked Optimization

Roofline Analysis

Optimization Techniques Explained

1. Shared Memory Tiling

2. Memory Coalescing

3. Bank Conflict Avoidance

4. Register Blocking

5. Occupancy vs Performance

Building and Running

Prerequisites

Build

Run Benchmarks

Generate Plots

Expected Results (GTX 1050 Ti)

Why cuBLAS is Still Faster

File Structure

Key Learnings

Resources

Author Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Matrix Multiplication: From Naive to Optimized

Executive Summary

Environment Setup

Hardware Target

Kernel Implementations

V1: Naive Matrix Multiplication

V2: Tiled Shared Memory Optimization

V3: Register-Blocked Optimization

Roofline Analysis

Optimization Techniques Explained

1. Shared Memory Tiling

2. Memory Coalescing

3. Bank Conflict Avoidance

4. Register Blocking

5. Occupancy vs Performance

Building and Running

Prerequisites

Build

Run Benchmarks

Generate Plots

Expected Results (GTX 1050 Ti)

Why cuBLAS is Still Faster

File Structure

Key Learnings

Resources

Author Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages