Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC by voipmonitor · Pull Request #2036 · NVIDIA/nccl

voipmonitor · 2026-03-08T23:27:25Z

Summary

AMD inter-socket bandwidth is hardcoded to 16 GB/s for all generations, while Intel has 4 model-specific tiers (BDW 6 → ERP 40 GB/s). Modern AMD EPYC (Milan, Genoa, Turin) Infinity Fabric delivers 32-48+ GB/s, causing NCCL to severely underestimate available bandwidth on AMD dual-socket systems.
LL128 protocol is unconditionally disabled for PATH_SYS connections. On AMD multi-socket systems, inter-GPU paths always traverse the SMP interconnect (PATH_SYS), forcing fallback to SIMPLE protocol. The only workaround is NCCL_GRAPH_FILE with manually crafted XML overriding path types and speeds.

Changes

1. AMD CPU generation detection (`graph.h`, `topo.cc`)

Added CPUID Family-based model detection for AMD, mirroring existing Intel logic:

Generation	CPUID Family	NCCL familyId	Model constant
Zen 1/2 (Naples/Rome)	17h	0x8F	`AMD_ZEN12`
Zen 3/4 (Milan/Genoa)	19h	0xAF	`AMD_ZEN34`
Zen 5 (Turin)	1Ah	0xBF	`AMD_ZEN5`

2. Generation-specific inter-socket bandwidth (`topo.h`, `topo.cc`)

Replaced flat AMD_BW = 16.0 with per-generation values:

Generation	Bandwidth	Rationale
Zen 1/2	16.0 GB/s	Unchanged (2x xGMI 1.0/2.0)
Zen 3/4	32.0 GB/s	3-4x xGMI 3.0/4.0 links
Zen 5	48.0 GB/s	Enhanced Infinity Fabric

3. LL128 over PATH_SYS on AMD (`tuning.cc`)

For Hopper/Blackwell GPUs on AMD systems: enable LL128 when typeInter == PATH_SYS and bwInter >= 32 GB/s (Zen 3+). This allows low-latency protocols over Infinity Fabric instead of forcing SIMPLE protocol fallback.

Motivation

On AMD EPYC dual-socket systems with Hopper GPUs, NCCL:

Reports only 16 GB/s SYS bandwidth (vs actual 32-48+ GB/s)
Disables LL128 protocol entirely for inter-socket communication
Falls back to SIMPLE protocol with degraded small/medium message performance

The only current workaround is:

NCCL_GRAPH_FILE=/path/to/custom_graph.xml

with a hand-crafted XML that overrides typeinter and speedinter values.

Test plan

Verify CPUID family detection on AMD Naples (Family 17h) → model ZEN12, bw=16.0
Verify CPUID family detection on AMD Genoa (Family 19h) → model ZEN34, bw=32.0
Verify CPUID family detection on AMD Turin (Family 1Ah) → model ZEN5, bw=48.0
Run nccl-tests AllReduce on dual-socket AMD EPYC + Hopper GPUs, verify LL128 is selected for small/medium messages
Compare throughput with/without patch vs NCCL_GRAPH_FILE workaround
Verify no regression on Intel systems (unchanged code path)
Verify no regression on single-socket AMD systems (PATH_SYS not used)

xiaofanl-nvidia · 2026-03-09T04:06:42Z

++ @thomasgillis , @marksantesson

sjeaugey · 2026-03-30T09:05:01Z

A few comments...

Family ID

Regarding Family ID, it looks like the NCCL code is wrong.

From Wikipedia (if you have a better source, please share):

The actual processor family is derived from the Family ID and Extended Family ID fields. If the Family ID field is equal to 15, the family is equal to the sum of the Extended Family ID and the Family ID fields. Otherwise, the family is equal to the value of the Family ID field.

I didn't read that far probably and simply assumed that the extended family ID was a simple bit extension like for the model ID. We should fix the code to just add the extended family ID if familyId == 15:

-    int familyId = cpuid1.familyId + (cpuid1.extFamilyId << 4);
+    int familyId = cpuid1.familyId;
+    if (familyId == 15) familyId += cpuid1.extFamilyId;

Beyond that change, we may also want to properly handle this:

The actual processor model is derived from the Model, Extended Model ID and Family ID fields. If the Family ID field is either 6 or 15, the model is equal to the sum of the Extended Model ID field shifted left by 4 bits and the Model field. Otherwise, the model is equal to the value of the Model field.

Checking familyId before we add modelId and extModelId << 4:

-     int familyId = cpuid1.familyId + (cpuid1.extFamilyId << 4);
-     int modelId = cpuid1.modelId + (cpuid1.extModelId << 4);
+     int familyId = cpuid1.familyId;
+     int modelId = cpuid1.modelId;
+     if (familyId == 15 || familyId == 6) modelId += cpuid1.extModelId << 4;
+     if (familyId == 15) familyId += cpuid1.extFamilyId;

LL128

It is not safe to enable LL128 when data goes through the CPU. CPUs do break 128B stores into 2x64B cache lines, and may reorder them if relaxed ordering is enabled. There is no way for NCCL to know whether RO is enabled, hence we cannot enable LL128 by default.

AMD CPU Bandwidth

It would be good to run a simple 2-GPU benchmark on AMD systems of different generations to check whether we can actually reach something close to the bandwidth we add here (32GB/s and 48GB/s). For example, if the actual bandwidth is only 32 GB/s when it's 48GB/s on paper, setting the value to 48 can lead to suboptimal decisions from NCCL, using many SMs when it could have used less.

voipmonitor · 2026-03-31T11:47:46Z

Thank you for the detailed review @sjeaugey!

Addressed all three points in the latest push:

1. Family ID fix — Fixed per your suggestion in xml.cc. The corrected formula now follows the x86 CPUID spec:

familyId: add extFamilyId only when base familyId == 15
modelId: add extModelId << 4 only when familyId is 6 or 15

Updated the AMD family ID comparisons to use the corrected values (23/25/26 decimal).

2. LL128 — Reverted completely. Understood — CPUs breaking 128B stores into 2×64B cache lines with potential reordering under relaxed ordering makes LL128 unsafe over PATH_SYS. Thanks for the explanation.

3. AMD CPU Bandwidth — Ran benchmarks on our dual-socket AMD EPYC 9575F (Turin, Zen 5) with 8× RTX PRO 6000 Blackwell:

Test	Avg busBw	Peak busBw
2-GPU cross-socket (GPU0↔GPU4, SYS)	32.18 GB/s	35.66 GB/s
2-GPU same-socket (GPU0↔GPU1, NODE)	37.90 GB/s	43.92 GB/s
8-GPU all (ring, 2× SYS crossings)	21.12 GB/s	23.86 GB/s

Based on measured data:

Zen 5 (Turin): 32.0 GB/s — matches the cross-socket average
Zen 3/4 (Milan/Genoa): 24.0 GB/s — conservative estimate (we don't have this hardware to test)
Zen 1/2 (Naples/Rome): 16.0 GB/s — unchanged

The 8-GPU results are consistent: ring crosses SYS twice, so the ~24 GB/s busBw aligns with a 32 GB/s per-link limit.

We don't have access to Milan/Genoa systems to validate the 24.0 value — happy to adjust if someone can run the same 2-GPU cross-socket benchmark on those platforms.

sjeaugey · 2026-03-31T12:33:00Z

Great, thanks. Could you fuse the two commits into one (no need to see the back and forth in the git history)[1] and use git commit -s to include your DCO [2]?

[1] You can do that with git reset b10a320ad3c15c108373fdcc0efe113dfe3bfec0 && git commit -s --amend.
[2] This is needed for all contributions, per our contributing guide.

NCCL uses a single flat bandwidth value (16 GB/s) for all AMD CPUs, while Intel has 4 model-specific tiers. This leads to suboptimal topology decisions on modern AMD platforms where inter-socket bandwidth is significantly higher. This patch: 1. Adds per-generation AMD CPU model detection using CPUID family IDs: - Zen 1/2 (Naples/Rome, family 23): 16 GB/s (unchanged) - Zen 3/4 (Milan/Genoa, family 25): 24 GB/s - Zen 5 (Turin, family 26): 32 GB/s 2. Fixes the CPUID familyId/modelId computation in xml.cc per x86 spec: - familyId: add extFamilyId only when base familyId == 15 - modelId: add extModelId << 4 only when familyId is 6 or 15 Bandwidth values were validated on dual-socket AMD EPYC 9575F (Turin) with cross-socket GPU P2P measurements averaging 32.18 GB/s. Zen 3/4 value (24 GB/s) is a conservative estimate pending hardware validation. Signed-off-by: Martin Vit <martin@voipmonitor.org>

voipmonitor · 2026-03-31T14:04:43Z

Done — squashed both commits into one with DCO sign-off:

e735a21 Fix AMD inter-CPU bandwidth detection: add per-generation values

Signed-off-by: Martin Vit martin@voipmonitor.org

voipmonitor · 2026-03-31T14:51:41Z

LL128 safety validation on AMD EPYC 9575F (Turin)

Following up on the LL128 safety concern — we ran extensive validation tests to check for data corruption when LL128 crosses the inter-CPU socket link (Infinity Fabric) on our Turin system.

Methodology

LL128 was force-enabled via NCCL_PROTO=LL128 on 8 GPUs (4 per socket, cross-socket ring). All tests use nccl-tests with built-in data validation (#wrong column compares AllReduce result against CPU reference). NCCL 2.29.7+cuda13.2.

PCIe Relaxed Ordering is enabled on all GPUs (RlxdOrd+ in lspci -vvs), which is the condition under which CPU cache-line reordering could theoretically occur.

Test results

Test	Operations	Sizes	Result
AllReduce Ring, 10K iters × 6 sizes	60,000	512K–3M	0 errors
AllReduce Ring, 100K iters × 5 sizes	500,000	256K–4M	0 errors
AllReduce Tree, 100K iters × 5 sizes	500,000	256K–4M	0 errors
AllReduce half/float/double, 20K iters	180,000	512K–2M	0 errors
ReduceScatter, 50K iters × 4 sizes	200,000	512K–4M	0 errors
AllReduce under Infinity Fabric saturation	100,000	256K–4M	0 errors

The saturation test ran 32 threads (16 per direction) performing continuous cross-socket memcpy of 256MB buffers via numactl --cpunodebind=1 --membind=0 (and reverse), fully saturating the Infinity Fabric in both directions while LL128 AllReduce ran simultaneously. Latency increased ~100x due to contention but zero data corruption was observed.

Total: ~1.5 million LL128 operations across the inter-socket Infinity Fabric link, including under full memory saturation — zero corruptions.

This does not prove LL128 is safe on all AMD platforms (we can only speak for Turin/Zen 5), but it suggests that the 128B write atomicity concern may not manifest on current AMD Infinity Fabric implementations. We are using LL128 in production via a tuner plugin with NCCL_PROTO=LL,LL128,Simple override.

marksantesson · 2026-04-01T21:40:40Z

/mirror

marksantesson · 2026-04-07T22:35:45Z

@voipmonitor , thank you for this submission! I wanted to give a quick status update... I am working on getting this brought into NCCL. I am expecting it to be part of the v2.30 Update 1 release.

voipmonitor · 2026-04-09T14:06:05Z

@voipmonitor , thank you for this submission! I wanted to give a quick status update... I am working on getting this brought into NCCL. I am expecting it to be part of the v2.30 Update 1 release.

Hello, would you please also check #2080 ?

voipmonitor force-pushed the fix/amd-intercpu-bw-and-ll128 branch from b10a320 to e735a21 Compare March 31, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC#2036

Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC#2036
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor:fix/amd-intercpu-bw-and-ll128

voipmonitor commented Mar 8, 2026

Uh oh!

xiaofanl-nvidia commented Mar 9, 2026

Uh oh!

sjeaugey commented Mar 30, 2026 •

edited

Loading

Uh oh!

voipmonitor commented Mar 31, 2026

Uh oh!

sjeaugey commented Mar 31, 2026 •

edited

Loading

Uh oh!

voipmonitor commented Mar 31, 2026

Uh oh!

voipmonitor commented Mar 31, 2026

Uh oh!

marksantesson commented Apr 1, 2026

Uh oh!

marksantesson commented Apr 7, 2026

Uh oh!

voipmonitor commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

voipmonitor commented Mar 8, 2026

Summary

Changes

1. AMD CPU generation detection (graph.h, topo.cc)

2. Generation-specific inter-socket bandwidth (topo.h, topo.cc)

3. LL128 over PATH_SYS on AMD (tuning.cc)

Motivation

Test plan

Uh oh!

xiaofanl-nvidia commented Mar 9, 2026

Uh oh!

sjeaugey commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voipmonitor commented Mar 31, 2026

Uh oh!

sjeaugey commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voipmonitor commented Mar 31, 2026

Uh oh!

voipmonitor commented Mar 31, 2026

LL128 safety validation on AMD EPYC 9575F (Turin)

Methodology

Test results

Uh oh!

marksantesson commented Apr 1, 2026

Uh oh!

marksantesson commented Apr 7, 2026

Uh oh!

voipmonitor commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. AMD CPU generation detection (`graph.h`, `topo.cc`)

2. Generation-specific inter-socket bandwidth (`topo.h`, `topo.cc`)

3. LL128 over PATH_SYS on AMD (`tuning.cc`)

sjeaugey commented Mar 30, 2026 •

edited

Loading

sjeaugey commented Mar 31, 2026 •

edited

Loading