- CPU vs GPU (low latency vs high throughput)
-
Latency:
How long it takes to complete a single operation (e.g., the time to cook one meal). -
Throughput:
How many operations can be completed per unit time (e.g., how many meals the restaurant can serve per hour). -
Memory Bandwidth:
The maximum rate at which data can be read from or written to memory by the processor.
- Optimized for low-latency execution of single-threaded tasks.
- Characteristics:
- Complex arithmetic and operand delivery logic.
- Large caches to minimize effective latency.
- Branch prediction and execution control logic.
- Significant chip area used for latency reduction hardware.
- Goal: Reduce the delay of individual instructions.
- Optimized for high-throughput execution of massively parallel tasks.
- Characteristics:
- Fewer resources spent on latency-hiding logic.
- More chip area dedicated to arithmetic execution units.
- Many memory access channels.
- Designed to tolerate latency instead of eliminating it.
- Goal: Maximize the number of operations per second.
-
CPU:
- Few cores with heavy cache + control logic.
- Good for sequential workloads and OS tasks.
- Limited parallelism but very low per-instruction latency.
-
GPU:
- Thousands of simpler cores.
- Relies on having a large number of threads to hide latency.
- Best for workloads like graphics rendering, simulations, or ML training.
- CPUs: Better for branching, irregular memory access, and OS-level tasks.
- GPUs: Better for data-parallel workloads with predictable, repetitive operations.
- CPUs = latency-optimized, general-purpose, complex logic.
- GPUs = throughput-optimized, massive parallelism, latency hiding.
- GPUs depend on many threads to tolerate memory latency.
- Memory bandwidth is a critical bottleneck in GPU performance.
A Streaming Multiprocessor (SM) consists of:
- Control unit – manages scheduling and instruction dispatch.
- Cores – execute threads.
- On-chip memory – shared memory, registers, and caches.
- Global memory – large memory outside the SM (off-chip).
+----------------+
| SM |
|----------------|
| Control |
| Cores |
| On-chip memory |
+----------------+
↑
+----------------+
| Global memory |
+----------------+- Threads of the same Block → assigned to the same SM.
- Multiple Blocks → can be assigned to the same SM.
- Only a limited number of Blocks can execute simultaneously per SM.
- Extra blocks are executed one after another.
- A Block assigned to an SM is divided into warps (each warp = 32 threads).
- If block size is not a multiple of 32, the last warp is padded with inactive threads.
- Each SM executes threads in warps using the SIMD (Single Instruction, Multiple Data) model.
- Every 8 cores form a group that shares fetch & dispatch for one warp.
- Warps are the fundamental execution unit.
- Threads in the same warp → same processing block.
- Each SM executes multiple warps concurrently.
- With enough warps, hardware can always find one that is ready to execute.
- No waiting needed → zero-overhead thread scheduling.
- To tolerate memory latency, an SM must have many threads assigned.
- CPU vs GPU scheduling and stuff

SM
├── Warp 1 → not ready
├── Warp 2 → ready
├── Warp 3 → ready
└── Warp 4 → ready- Scheduler switches warps instantly → hides latency.
- Occupancy = (# warps assigned) / (# warps possible).
- High occupancy → better latency hiding.
- But 100% occupancy is not always optimal (register/shared memory trade-offs matter).
-
Dynamic partitioning:
- Threads are flexibly distributed among blocks.
- SM can execute many blocks with few threads or few blocks with many threads.
- This improves utilization.
-
Fixed partitioning:
- Leads to wasted thread slots if resources don’t match block/thread requirements.
- Use block sizes in multiples of 32 to avoid wasted threads.
- High occupancy is useful for hiding memory latency.
- Dynamic partitioning allows flexible use of SM resources.
- Warps are the true execution unit in CUDA, not individual threads.
- Latency hiding is achieved by warp switching, not by stalling cores.
- Bottleneck: Accessing data in global memory is slow due to limited bandwidth.
- CUDA apps rely heavily on data parallelism → require efficient memory usage.
- Memory Coalescing: Combining multiple memory accesses into fewer transactions to maximize bandwidth.
- DRAM operations:
- Each access reads a whole DRAM row (many words), but only needed word is used => inefficiency.
- Access patterns matter; scattered accesses waste bandwidth.
- Goal: Ensure threads in a warp access consecutive memory addresses to exploit DRAM efficiency.
- Recent CUDA devices: Offer cache support for global memory → reduces penalties for uncoalesced accesses.
- DRAM access involves row activation + column access.
- Latency sources:
- Activating rows, charge sensing, and restoring.
- Burst mode: Consecutive addresses are faster (row stays open).
- Accessing strided memory patterns → inefficient (extra row activations).
- Optimization: Arrange data to exploit row locality.
-
-
- Default: One thread handles smallest output element.
- Thread Coarsening: Assigns each thread to compute multiple outputs.
- Advantages:
- Reduces redundant work (e.g., repeated loads).
- Better cache utilization.
- Increases instruction-level parallelism.
- Trade-offs:
- May increase register/shared memory usage → lower occupancy.
- Too much coarsening can serialize execution and reduce efficiency.
- Example: Matrix multiplication
- Instead of one thread per output, a thread computes multiple tiles.
- Improves performance if hardware resources allow.
-
Reduce kernel launch and scheduling overhead Launching millions of threads has overhead. Coarsening reduces the total thread count.
-
Increase instruction-level parallelism (ILP) A single thread now performs more independent operations. The compiler can better reorder instructions and hide memory latency.
-
Better register reuse If a thread needs to load some data that will be reused for several nearby computations, coarsening avoids redundant loads across different threads.
-
Avoid underutilization Sometimes, you don’t have enough parallel work to saturate the GPU (e.g., small problem sizes). Coarsening ensures you still use the GPU efficiently.
- Memory bandwidth is a critical bottleneck in GPU performance.
- Memory coalescing: Ensure warp threads access consecutive addresses.
- DRAM optimization: Use burst-friendly access patterns.
- Thread coarsening: Trade-off between reduced redundancy and increased resource usage.
- DRAM bursting: Consecutive memory locations are accessed in the DRAM core array in parallel.
- Alone, bursting is not enough to meet CPU/GPU bandwidth needs.
- Modern DRAM systems add two levels of parallel organization:
- Channels
- Banks
- Definition: Each channel is a memory controller bus connecting multiple banks to the processor.
- A processor can have multiple channels (e.g., 4 or 8).
- Data transfer bandwidth of a bus = width × clock frequency.
- Example:
- DDR (Double Data Rate) transfers 2 data per cycle.
- A 64-bit DDR bus @ 1 GHz →
64 bits = 8B → 8B × 2 × 1 GHz = 16 GB/s.
- CPUs/GPUs require 32–256 GB/s bandwidth, meaning many channels are needed.
- A bank = array of DRAM cells + sensing amplifiers for accessing them.
- Why multiple banks?
- Each access has high latency (activating cells, moving data).
- If only one bank per channel → bus mostly idle (low utilization).
- Multiple banks hide this latency by overlapping accesses.
- Single-bank channel (A):
- Access latency (gray) >> data transfer time (dark).
- Example: latency:transfer = 20:1 → utilization = 1/21 = 4.8%.
- A 16 GB/s channel → only 0.76 GB/s effective.
- Two-bank channel (B):
- Bank 1 starts access while Bank 0 transfers data.
- Overlap hides latency → doubles utilization.
- If latency:transfer = R, then need at least
R+1banks per channel bus. - Example: ratio = 20 → need ≥ 21 banks.
- Benefits of many banks:
- Reduces bank conflicts (multiple requests to same bank).
- Provides enough cell capacity.
- Idea: Spread array elements across banks & channels.
- Prevents small arrays from using only one channel/bank.
- Example distribution:
- M[0], M[1] → bank 0, channel 0
- M[2], M[3] → bank 0, channel 1
- M[4], M[5] → bank 0, channel 2
- M[6], M[7] → bank 0, channel 3
- M[8], M[9] → bank 1, channel 0
- … and so on (wrap around channels, then increment bank).
- To achieve full memory bandwidth:
- Many threads must issue requests simultaneously.
- Requests should be evenly distributed across channels & banks.
- Accesses must be coalesced (aligned & grouped efficiently).
- Poor distribution → multiple threads hit the same channel → bottleneck.
- Each thread block loads a tile of the input matrix.
- Interleaved distribution ensures:
- Accesses spread across banks/channels.
- Coalesced accesses happen in parallel.
- Example (Phase 0, Block 0,0):
- Loads M[0], M[1], M[4], M[5].
- Spread across different banks → parallel transfers.
- Good mapping = high utilization of bus bandwidth.
- Channels: Independent buses, increase overall width of memory system.
- Banks: Parallel units within a channel, hide latency via overlap.
- Bank conflicts: Reduce efficiency; many banks minimize chance.
- Interleaved distribution: Spreads data evenly → avoids hotspots.
- Threads & coalescing: GPU performance depends on matching parallel threads with DRAM’s parallel structure.






