Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC#2036
Fix AMD inter-CPU bandwidth and LL128 protocol for multi-socket EPYC#2036voipmonitor wants to merge 1 commit intoNVIDIA:masterfrom
Conversation
|
A few comments... Family ID Regarding Family ID, it looks like the NCCL code is wrong. From Wikipedia (if you have a better source, please share):
I didn't read that far probably and simply assumed that the extended family ID was a simple bit extension like for the model ID. We should fix the code to just add the extended family ID if familyId == 15: Beyond that change, we may also want to properly handle this:
Checking familyId before we add LL128 It is not safe to enable LL128 when data goes through the CPU. CPUs do break 128B stores into 2x64B cache lines, and may reorder them if relaxed ordering is enabled. There is no way for NCCL to know whether RO is enabled, hence we cannot enable LL128 by default. AMD CPU Bandwidth It would be good to run a simple 2-GPU benchmark on AMD systems of different generations to check whether we can actually reach something close to the bandwidth we add here (32GB/s and 48GB/s). For example, if the actual bandwidth is only 32 GB/s when it's 48GB/s on paper, setting the value to 48 can lead to suboptimal decisions from NCCL, using many SMs when it could have used less. |
|
Thank you for the detailed review @sjeaugey! Addressed all three points in the latest push: 1. Family ID fix — Fixed per your suggestion in
Updated the AMD family ID comparisons to use the corrected values (23/25/26 decimal). 2. LL128 — Reverted completely. Understood — CPUs breaking 128B stores into 2×64B cache lines with potential reordering under relaxed ordering makes LL128 unsafe over PATH_SYS. Thanks for the explanation. 3. AMD CPU Bandwidth — Ran benchmarks on our dual-socket AMD EPYC 9575F (Turin, Zen 5) with 8× RTX PRO 6000 Blackwell:
Based on measured data:
The 8-GPU results are consistent: ring crosses SYS twice, so the ~24 GB/s busBw aligns with a 32 GB/s per-link limit. We don't have access to Milan/Genoa systems to validate the 24.0 value — happy to adjust if someone can run the same 2-GPU cross-socket benchmark on those platforms. |
|
Great, thanks. Could you fuse the two commits into one (no need to see the back and forth in the git history)[1] and use [1] You can do that with |
NCCL uses a single flat bandwidth value (16 GB/s) for all AMD CPUs, while Intel has 4 model-specific tiers. This leads to suboptimal topology decisions on modern AMD platforms where inter-socket bandwidth is significantly higher. This patch: 1. Adds per-generation AMD CPU model detection using CPUID family IDs: - Zen 1/2 (Naples/Rome, family 23): 16 GB/s (unchanged) - Zen 3/4 (Milan/Genoa, family 25): 24 GB/s - Zen 5 (Turin, family 26): 32 GB/s 2. Fixes the CPUID familyId/modelId computation in xml.cc per x86 spec: - familyId: add extFamilyId only when base familyId == 15 - modelId: add extModelId << 4 only when familyId is 6 or 15 Bandwidth values were validated on dual-socket AMD EPYC 9575F (Turin) with cross-socket GPU P2P measurements averaging 32.18 GB/s. Zen 3/4 value (24 GB/s) is a conservative estimate pending hardware validation. Signed-off-by: Martin Vit <martin@voipmonitor.org>
b10a320 to
e735a21
Compare
|
Done — squashed both commits into one with DCO sign-off: Signed-off-by: Martin Vit martin@voipmonitor.org |
LL128 safety validation on AMD EPYC 9575F (Turin)Following up on the LL128 safety concern — we ran extensive validation tests to check for data corruption when LL128 crosses the inter-CPU socket link (Infinity Fabric) on our Turin system. MethodologyLL128 was force-enabled via PCIe Relaxed Ordering is enabled on all GPUs ( Test results
The saturation test ran 32 threads (16 per direction) performing continuous cross-socket Total: ~1.5 million LL128 operations across the inter-socket Infinity Fabric link, including under full memory saturation — zero corruptions. This does not prove LL128 is safe on all AMD platforms (we can only speak for Turin/Zen 5), but it suggests that the 128B write atomicity concern may not manifest on current AMD Infinity Fabric implementations. We are using LL128 in production via a tuner plugin with |
|
/mirror |
|
@voipmonitor , thank you for this submission! I wanted to give a quick status update... I am working on getting this brought into NCCL. I am expecting it to be part of the v2.30 Update 1 release. |
Hello, would you please also check #2080 ? |
Summary
AMD inter-socket bandwidth is hardcoded to 16 GB/s for all generations, while Intel has 4 model-specific tiers (BDW 6 → ERP 40 GB/s). Modern AMD EPYC (Milan, Genoa, Turin) Infinity Fabric delivers 32-48+ GB/s, causing NCCL to severely underestimate available bandwidth on AMD dual-socket systems.
LL128 protocol is unconditionally disabled for PATH_SYS connections. On AMD multi-socket systems, inter-GPU paths always traverse the SMP interconnect (PATH_SYS), forcing fallback to SIMPLE protocol. The only workaround is
NCCL_GRAPH_FILEwith manually crafted XML overriding path types and speeds.Changes
1. AMD CPU generation detection (
graph.h,topo.cc)Added CPUID Family-based model detection for AMD, mirroring existing Intel logic:
AMD_ZEN12AMD_ZEN34AMD_ZEN52. Generation-specific inter-socket bandwidth (
topo.h,topo.cc)Replaced flat
AMD_BW = 16.0with per-generation values:3. LL128 over PATH_SYS on AMD (
tuning.cc)For Hopper/Blackwell GPUs on AMD systems: enable LL128 when
typeInter == PATH_SYSandbwInter >= 32 GB/s(Zen 3+). This allows low-latency protocols over Infinity Fabric instead of forcing SIMPLE protocol fallback.Motivation
On AMD EPYC dual-socket systems with Hopper GPUs, NCCL:
The only current workaround is:
with a hand-crafted XML that overrides
typeinterandspeedintervalues.Test plan
NCCL_GRAPH_FILEworkaround