CPU vs GPU (low latency vs high throughput)

Key Concepts

Latency:
How long it takes to complete a single operation (e.g., the time to cook one meal).
Throughput:
How many operations can be completed per unit time (e.g., how many meals the restaurant can serve per hour).
Memory Bandwidth:
The maximum rate at which data can be read from or written to memory by the processor.

CPU Design (Latency-Oriented)

Optimized for low-latency execution of single-threaded tasks.
Characteristics:
- Complex arithmetic and operand delivery logic.
- Large caches to minimize effective latency.
- Branch prediction and execution control logic.
- Significant chip area used for latency reduction hardware.
Goal: Reduce the delay of individual instructions.

GPU Design (Throughput-Oriented)

Optimized for high-throughput execution of massively parallel tasks.
Characteristics:
- Fewer resources spent on latency-hiding logic.
- More chip area dedicated to arithmetic execution units.
- Many memory access channels.
- Designed to tolerate latency instead of eliminating it.
Goal: Maximize the number of operations per second.

Practical Example

CPU:
- Few cores with heavy cache + control logic.
- Good for sequential workloads and OS tasks.
- Limited parallelism but very low per-instruction latency.
GPU:
- Thousands of simpler cores.
- Relies on having a large number of threads to hide latency.
- Best for workloads like graphics rendering, simulations, or ML training.

Trade-offs

CPUs: Better for branching, irregular memory access, and OS-level tasks.
GPUs: Better for data-parallel workloads with predictable, repetitive operations.

CPUs devote more area to cache & control logic.
GPUs devote more area to ALUs and memory channels.

Key Takeaways

CPUs = latency-optimized, general-purpose, complex logic.
GPUs = throughput-optimized, massive parallelism, latency hiding.
GPUs depend on many threads to tolerate memory latency.
Memory bandwidth is a critical bottleneck in GPU performance.

Streaming Multiprocessor (SM) - CUDA Notes

Overview

A Streaming Multiprocessor (SM) consists of:

Control unit – manages scheduling and instruction dispatch.
Cores – execute threads.
On-chip memory – shared memory, registers, and caches.
Global memory – large memory outside the SM (off-chip).

    +----------------+
    |      SM        |
    |----------------|
    | Control        |
    | Cores          |
    | On-chip memory |
    +----------------+
            ↑
    +----------------+
    | Global memory  |
    +----------------+

Thread and Block Execution

Threads of the same Block → assigned to the same SM.
Multiple Blocks → can be assigned to the same SM.
Only a limited number of Blocks can execute simultaneously per SM.
Extra blocks are executed one after another.

Warps

A Block assigned to an SM is divided into warps (each warp = 32 threads).
If block size is not a multiple of 32, the last warp is padded with inactive threads.
Each SM executes threads in warps using the SIMD (Single Instruction, Multiple Data) model.

Warp Scheduling

Every 8 cores form a group that shares fetch & dispatch for one warp.
Warps are the fundamental execution unit.
Threads in the same warp → same processing block.
Each SM executes multiple warps concurrently.

Latency Tolerance / Hiding

With enough warps, hardware can always find one that is ready to execute.
No waiting needed → zero-overhead thread scheduling.
To tolerate memory latency, an SM must have many threads assigned.
CPU vs GPU scheduling and stuff

SM
├── Warp 1 → not ready
├── Warp 2 → ready
├── Warp 3 → ready
└── Warp 4 → ready

Scheduler switches warps instantly → hides latency.

Occupancy

Occupancy = (# warps assigned) / (# warps possible).
High occupancy → better latency hiding.
But 100% occupancy is not always optimal (register/shared memory trade-offs matter).

Partitioning

Dynamic partitioning:
- Threads are flexibly distributed among blocks.
- SM can execute many blocks with few threads or few blocks with many threads.
- This improves utilization.
Fixed partitioning:
- Leads to wasted thread slots if resources don’t match block/thread requirements.

Key Takeaways

Use block sizes in multiples of 32 to avoid wasted threads.
High occupancy is useful for hiding memory latency.
Dynamic partitioning allows flexible use of SM resources.
Warps are the true execution unit in CUDA, not individual threads.
Latency hiding is achieved by warp switching, not by stalling cores.

Memory Coalescing

Bottleneck: Accessing data in global memory is slow due to limited bandwidth.
CUDA apps rely heavily on data parallelism → require efficient memory usage.
Memory Coalescing: Combining multiple memory accesses into fewer transactions to maximize bandwidth.
DRAM operations:
- Each access reads a whole DRAM row (many words), but only needed word is used => inefficiency.
- Access patterns matter; scattered accesses waste bandwidth.
Goal: Ensure threads in a warp access consecutive memory addresses to exploit DRAM efficiency.
Recent CUDA devices: Offer cache support for global memory → reduces penalties for uncoalesced accesses.

Why is DRAM So Slow?

DRAM access involves row activation + column access.
Latency sources:
- Activating rows, charge sensing, and restoring.
Burst mode: Consecutive addresses are faster (row stays open).
Accessing strided memory patterns → inefficient (extra row activations).
Optimization: Arrange data to exploit row locality.

6.3 Thread Coarsening

Default: One thread handles smallest output element.
Thread Coarsening: Assigns each thread to compute multiple outputs.
Advantages:
- Reduces redundant work (e.g., repeated loads).
- Better cache utilization.
- Increases instruction-level parallelism.
Trade-offs:
- May increase register/shared memory usage → lower occupancy.
- Too much coarsening can serialize execution and reduce efficiency.
Example: Matrix multiplication
- Instead of one thread per output, a thread computes multiple tiles.
- Improves performance if hardware resources allow.

Reduce kernel launch and scheduling overhead Launching millions of threads has overhead. Coarsening reduces the total thread count.
Increase instruction-level parallelism (ILP) A single thread now performs more independent operations. The compiler can better reorder instructions and hide memory latency.
Better register reuse If a thread needs to load some data that will be reused for several nearby computations, coarsening avoids redundant loads across different threads.
Avoid underutilization Sometimes, you don’t have enough parallel work to saturate the GPU (e.g., small problem sizes). Coarsening ensures you still use the GPU efficiently.

Key Takeaways

Memory bandwidth is a critical bottleneck in GPU performance.
Memory coalescing: Ensure warp threads access consecutive addresses.
DRAM optimization: Use burst-friendly access patterns.
Thread coarsening: Trade-off between reduced redundancy and increased resource usage.

https://chatgpt.com/c/68c181c5-c12c-8330-8d71-68558824f926

1. Basics of DRAM Bursting

DRAM bursting: Consecutive memory locations are accessed in the DRAM core array in parallel.
Alone, bursting is not enough to meet CPU/GPU bandwidth needs.
Modern DRAM systems add two levels of parallel organization:
1. Channels
2. Banks

2. Channels

Definition: Each channel is a memory controller bus connecting multiple banks to the processor.
A processor can have multiple channels (e.g., 4 or 8).
Data transfer bandwidth of a bus = width × clock frequency.
Example:
- DDR (Double Data Rate) transfers 2 data per cycle.
- A 64-bit DDR bus @ 1 GHz → 64 bits = 8B → 8B × 2 × 1 GHz = 16 GB/s.
CPUs/GPUs require 32–256 GB/s bandwidth, meaning many channels are needed.

3. Banks

A bank = array of DRAM cells + sensing amplifiers for accessing them.
Why multiple banks?
- Each access has high latency (activating cells, moving data).
- If only one bank per channel → bus mostly idle (low utilization).
- Multiple banks hide this latency by overlapping accesses.

4. Bank Utilization Example (Fig. 6.8)

Single-bank channel (A):
- Access latency (gray) >> data transfer time (dark).
- Example: latency:transfer = 20:1 → utilization = 1/21 = 4.8%.
- A 16 GB/s channel → only 0.76 GB/s effective.
Two-bank channel (B):
- Bank 1 starts access while Bank 0 transfers data.
- Overlap hides latency → doubles utilization.

5. General Rule

If latency:transfer = R, then need at least R+1 banks per channel bus.
Example: ratio = 20 → need ≥ 21 banks.
Benefits of many banks:
- Reduces bank conflicts (multiple requests to same bank).
- Provides enough cell capacity.

6. Interleaved Data Distribution (Fig. 6.9)

Idea: Spread array elements across banks & channels.
Prevents small arrays from using only one channel/bank.
Example distribution:
- M[0], M[1] → bank 0, channel 0
- M[2], M[3] → bank 0, channel 1
- M[4], M[5] → bank 0, channel 2
- M[6], M[7] → bank 0, channel 3
- M[8], M[9] → bank 1, channel 0
- … and so on (wrap around channels, then increment bank).

7. Relation to Thread Execution

To achieve full memory bandwidth:
- Many threads must issue requests simultaneously.
- Requests should be evenly distributed across channels & banks.
- Accesses must be coalesced (aligned & grouped efficiently).
Poor distribution → multiple threads hit the same channel → bottleneck.

8. Matrix Multiplication Example (Figs. 6.10, 6.11)

Each thread block loads a tile of the input matrix.
Interleaved distribution ensures:
- Accesses spread across banks/channels.
- Coalesced accesses happen in parallel.
Example (Phase 0, Block 0,0):
- Loads M[0], M[1], M[4], M[5].
- Spread across different banks → parallel transfers.
Good mapping = high utilization of bus bandwidth.

9. Key Takeaways

Channels: Independent buses, increase overall width of memory system.
Banks: Parallel units within a channel, hide latency via overlap.
Bank conflicts: Reduce efficiency; many banks minimize chance.
Interleaved distribution: Spreads data evenly → avoids hotspots.
Threads & coalescing: GPU performance depends on matching parallel threads with DRAM’s parallel structure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key Concepts

CPU Design (Latency-Oriented)

GPU Design (Throughput-Oriented)

Practical Example

Trade-offs

Key Takeaways

Streaming Multiprocessor (SM) - CUDA Notes

Overview

Thread and Block Execution

Warps

Warp Scheduling

Latency Tolerance / Hiding

Occupancy

Partitioning

Key Takeaways

Memory Coalescing

Why is DRAM So Slow?

6.3 Thread Coarsening

Key Takeaways

1. Basics of DRAM Bursting

2. Channels

3. Banks

4. Bank Utilization Example (Fig. 6.8)

5. General Rule

6. Interleaved Data Distribution (Fig. 6.9)

7. Relation to Thread Execution

8. Matrix Multiplication Example (Figs. 6.10, 6.11)

9. Key Takeaways

FilesExpand file tree

note.md

Latest commit

History

note.md

File metadata and controls

Key Concepts

CPU Design (Latency-Oriented)

GPU Design (Throughput-Oriented)

Practical Example

Trade-offs

Key Takeaways

Streaming Multiprocessor (SM) - CUDA Notes

Overview

Thread and Block Execution

Warps

Warp Scheduling

Latency Tolerance / Hiding

Occupancy

Partitioning

Key Takeaways

Memory Coalescing

Why is DRAM So Slow?

6.3 Thread Coarsening

Key Takeaways

1. Basics of DRAM Bursting

2. Channels

3. Banks

4. Bank Utilization Example (Fig. 6.8)

5. General Rule

6. Interleaved Data Distribution (Fig. 6.9)

7. Relation to Thread Execution

8. Matrix Multiplication Example (Figs. 6.10, 6.11)

9. Key Takeaways