Skip to content

Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC#2036

Open
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor:fix/amd-intercpu-bw-and-ll128
Open

Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC#2036
voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
voipmonitor:fix/amd-intercpu-bw-and-ll128

Conversation

@voipmonitor
Copy link
Copy Markdown

Summary

  • AMD inter-socket bandwidth is hardcoded to 16 GB/s for all generations, while Intel has 4 model-specific tiers (BDW 6 → ERP 40 GB/s). Modern AMD EPYC (Milan, Genoa, Turin) Infinity Fabric delivers 32-48+ GB/s, causing NCCL to severely underestimate available bandwidth on AMD dual-socket systems.

  • LL128 protocol is unconditionally disabled for PATH_SYS connections. On AMD multi-socket systems, inter-GPU paths always traverse the SMP interconnect (PATH_SYS), forcing fallback to SIMPLE protocol. The only workaround is NCCL_GRAPH_FILE with manually crafted XML overriding path types and speeds.

Changes

1. AMD CPU generation detection (graph.h, topo.cc)

Added CPUID Family-based model detection for AMD, mirroring existing Intel logic:

Generation CPUID Family NCCL familyId Model constant
Zen 1/2 (Naples/Rome) 17h 0x8F AMD_ZEN12
Zen 3/4 (Milan/Genoa) 19h 0xAF AMD_ZEN34
Zen 5 (Turin) 1Ah 0xBF AMD_ZEN5

2. Generation-specific inter-socket bandwidth (topo.h, topo.cc)

Replaced flat AMD_BW = 16.0 with per-generation values:

Generation Bandwidth Rationale
Zen 1/2 16.0 GB/s Unchanged (2x xGMI 1.0/2.0)
Zen 3/4 32.0 GB/s 3-4x xGMI 3.0/4.0 links
Zen 5 48.0 GB/s Enhanced Infinity Fabric

3. LL128 over PATH_SYS on AMD (tuning.cc)

For Hopper/Blackwell GPUs on AMD systems: enable LL128 when typeInter == PATH_SYS and bwInter >= 32 GB/s (Zen 3+). This allows low-latency protocols over Infinity Fabric instead of forcing SIMPLE protocol fallback.

Motivation

On AMD EPYC dual-socket systems with Hopper GPUs, NCCL:

  1. Reports only 16 GB/s SYS bandwidth (vs actual 32-48+ GB/s)
  2. Disables LL128 protocol entirely for inter-socket communication
  3. Falls back to SIMPLE protocol with degraded small/medium message performance

The only current workaround is:

NCCL_GRAPH_FILE=/path/to/custom_graph.xml

with a hand-crafted XML that overrides typeinter and speedinter values.

Test plan

  • Verify CPUID family detection on AMD Naples (Family 17h) → model ZEN12, bw=16.0
  • Verify CPUID family detection on AMD Genoa (Family 19h) → model ZEN34, bw=32.0
  • Verify CPUID family detection on AMD Turin (Family 1Ah) → model ZEN5, bw=48.0
  • Run nccl-tests AllReduce on dual-socket AMD EPYC + Hopper GPUs, verify LL128 is selected for small/medium messages
  • Compare throughput with/without patch vs NCCL_GRAPH_FILE workaround
  • Verify no regression on Intel systems (unchanged code path)
  • Verify no regression on single-socket AMD systems (PATH_SYS not used)

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

++ @thomasgillis , @marksantesson

@sjeaugey
Copy link
Copy Markdown
Member

sjeaugey commented Mar 30, 2026

A few comments...

Family ID

Regarding Family ID, it looks like the NCCL code is wrong.

From Wikipedia (if you have a better source, please share):

The actual processor family is derived from the Family ID and Extended Family ID fields. If the Family ID field is equal to 15, the family is equal to the sum of the Extended Family ID and the Family ID fields. Otherwise, the family is equal to the value of the Family ID field.

I didn't read that far probably and simply assumed that the extended family ID was a simple bit extension like for the model ID. We should fix the code to just add the extended family ID if familyId == 15:

-    int familyId = cpuid1.familyId + (cpuid1.extFamilyId << 4);
+    int familyId = cpuid1.familyId;
+    if (familyId == 15) familyId += cpuid1.extFamilyId;

Beyond that change, we may also want to properly handle this:

The actual processor model is derived from the Model, Extended Model ID and Family ID fields. If the Family ID field is either 6 or 15, the model is equal to the sum of the Extended Model ID field shifted left by 4 bits and the Model field. Otherwise, the model is equal to the value of the Model field.

Checking familyId before we add modelId and extModelId << 4:

-     int familyId = cpuid1.familyId + (cpuid1.extFamilyId << 4);
-     int modelId = cpuid1.modelId + (cpuid1.extModelId << 4);
+     int familyId = cpuid1.familyId;
+     int modelId = cpuid1.modelId;
+     if (familyId == 15 || familyId == 6) modelId += cpuid1.extModelId << 4;
+     if (familyId == 15) familyId += cpuid1.extFamilyId;

LL128

It is not safe to enable LL128 when data goes through the CPU. CPUs do break 128B stores into 2x64B cache lines, and may reorder them if relaxed ordering is enabled. There is no way for NCCL to know whether RO is enabled, hence we cannot enable LL128 by default.

AMD CPU Bandwidth

It would be good to run a simple 2-GPU benchmark on AMD systems of different generations to check whether we can actually reach something close to the bandwidth we add here (32GB/s and 48GB/s). For example, if the actual bandwidth is only 32 GB/s when it's 48GB/s on paper, setting the value to 48 can lead to suboptimal decisions from NCCL, using many SMs when it could have used less.

@voipmonitor
Copy link
Copy Markdown
Author

Thank you for the detailed review @sjeaugey!

Addressed all three points in the latest push:

1. Family ID fix — Fixed per your suggestion in xml.cc. The corrected formula now follows the x86 CPUID spec:

  • familyId: add extFamilyId only when base familyId == 15
  • modelId: add extModelId << 4 only when familyId is 6 or 15

Updated the AMD family ID comparisons to use the corrected values (23/25/26 decimal).

2. LL128 — Reverted completely. Understood — CPUs breaking 128B stores into 2×64B cache lines with potential reordering under relaxed ordering makes LL128 unsafe over PATH_SYS. Thanks for the explanation.

3. AMD CPU Bandwidth — Ran benchmarks on our dual-socket AMD EPYC 9575F (Turin, Zen 5) with 8× RTX PRO 6000 Blackwell:

Test Avg busBw Peak busBw
2-GPU cross-socket (GPU0↔GPU4, SYS) 32.18 GB/s 35.66 GB/s
2-GPU same-socket (GPU0↔GPU1, NODE) 37.90 GB/s 43.92 GB/s
8-GPU all (ring, 2× SYS crossings) 21.12 GB/s 23.86 GB/s

Based on measured data:

  • Zen 5 (Turin): 32.0 GB/s — matches the cross-socket average
  • Zen 3/4 (Milan/Genoa): 24.0 GB/s — conservative estimate (we don't have this hardware to test)
  • Zen 1/2 (Naples/Rome): 16.0 GB/s — unchanged

The 8-GPU results are consistent: ring crosses SYS twice, so the ~24 GB/s busBw aligns with a 32 GB/s per-link limit.

We don't have access to Milan/Genoa systems to validate the 24.0 value — happy to adjust if someone can run the same 2-GPU cross-socket benchmark on those platforms.

@sjeaugey
Copy link
Copy Markdown
Member

sjeaugey commented Mar 31, 2026

Great, thanks. Could you fuse the two commits into one (no need to see the back and forth in the git history)[1] and use git commit -s to include your DCO [2]?

[1] You can do that with git reset b10a320ad3c15c108373fdcc0efe113dfe3bfec0 && git commit -s --amend.
[2] This is needed for all contributions, per our contributing guide.

NCCL uses a single flat bandwidth value (16 GB/s) for all AMD CPUs,
while Intel has 4 model-specific tiers. This leads to suboptimal
topology decisions on modern AMD platforms where inter-socket bandwidth
is significantly higher.

This patch:

1. Adds per-generation AMD CPU model detection using CPUID family IDs:
   - Zen 1/2 (Naples/Rome, family 23): 16 GB/s (unchanged)
   - Zen 3/4 (Milan/Genoa, family 25): 24 GB/s
   - Zen 5 (Turin, family 26): 32 GB/s

2. Fixes the CPUID familyId/modelId computation in xml.cc per x86 spec:
   - familyId: add extFamilyId only when base familyId == 15
   - modelId: add extModelId << 4 only when familyId is 6 or 15

Bandwidth values were validated on dual-socket AMD EPYC 9575F (Turin)
with cross-socket GPU P2P measurements averaging 32.18 GB/s. Zen 3/4
value (24 GB/s) is a conservative estimate pending hardware validation.

Signed-off-by: Martin Vit <martin@voipmonitor.org>
@voipmonitor voipmonitor force-pushed the fix/amd-intercpu-bw-and-ll128 branch from b10a320 to e735a21 Compare March 31, 2026 14:04
@voipmonitor
Copy link
Copy Markdown
Author

Done — squashed both commits into one with DCO sign-off:

e735a21 Fix AMD inter-CPU bandwidth detection: add per-generation values

Signed-off-by: Martin Vit martin@voipmonitor.org

@voipmonitor
Copy link
Copy Markdown
Author

LL128 safety validation on AMD EPYC 9575F (Turin)

Following up on the LL128 safety concern — we ran extensive validation tests to check for data corruption when LL128 crosses the inter-CPU socket link (Infinity Fabric) on our Turin system.

Methodology

LL128 was force-enabled via NCCL_PROTO=LL128 on 8 GPUs (4 per socket, cross-socket ring). All tests use nccl-tests with built-in data validation (#wrong column compares AllReduce result against CPU reference). NCCL 2.29.7+cuda13.2.

PCIe Relaxed Ordering is enabled on all GPUs (RlxdOrd+ in lspci -vvs), which is the condition under which CPU cache-line reordering could theoretically occur.

Test results

Test Operations Sizes Result
AllReduce Ring, 10K iters × 6 sizes 60,000 512K–3M 0 errors
AllReduce Ring, 100K iters × 5 sizes 500,000 256K–4M 0 errors
AllReduce Tree, 100K iters × 5 sizes 500,000 256K–4M 0 errors
AllReduce half/float/double, 20K iters 180,000 512K–2M 0 errors
ReduceScatter, 50K iters × 4 sizes 200,000 512K–4M 0 errors
AllReduce under Infinity Fabric saturation 100,000 256K–4M 0 errors

The saturation test ran 32 threads (16 per direction) performing continuous cross-socket memcpy of 256MB buffers via numactl --cpunodebind=1 --membind=0 (and reverse), fully saturating the Infinity Fabric in both directions while LL128 AllReduce ran simultaneously. Latency increased ~100x due to contention but zero data corruption was observed.

Total: ~1.5 million LL128 operations across the inter-socket Infinity Fabric link, including under full memory saturation — zero corruptions.

This does not prove LL128 is safe on all AMD platforms (we can only speak for Turin/Zen 5), but it suggests that the 128B write atomicity concern may not manifest on current AMD Infinity Fabric implementations. We are using LL128 in production via a tuner plugin with NCCL_PROTO=LL,LL128,Simple override.

@marksantesson
Copy link
Copy Markdown
Collaborator

/mirror

@marksantesson
Copy link
Copy Markdown
Collaborator

@voipmonitor , thank you for this submission! I wanted to give a quick status update... I am working on getting this brought into NCCL. I am expecting it to be part of the v2.30 Update 1 release.

@voipmonitor
Copy link
Copy Markdown
Author

@voipmonitor , thank you for this submission! I wanted to give a quick status update... I am working on getting this brought into NCCL. I am expecting it to be part of the v2.30 Update 1 release.

Hello, would you please also check #2080 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants