How is this issue impacting you?
Lower performance than expected
Share Your Debug Logs
Hello
Current version of ncclTopoGetInterCpuBw function has no support of GNR family of Intel Xeon CPUs
https://github.com/NVIDIA/nccl/blob/v2.30.3-1/src/graph/topo.cc#L73
if (cpu->cpu.arch == NCCL_TOPO_CPU_ARCH_X86 && cpu->cpu.vendor == NCCL_TOPO_CPU_VENDOR_INTEL) {
*bw =
cpu->cpu.model == NCCL_TOPO_CPU_MODEL_INTEL_ERP ? ERP_QPI_BW :
cpu->cpu.model == NCCL_TOPO_CPU_MODEL_INTEL_SRP ? SRP_QPI_BW :
cpu->cpu.model == NCCL_TOPO_CPU_MODEL_INTEL_SKL ? SKL_QPI_BW :
BDW_QPI_BW;
}
I think that familyId == 6 && modelId == 0xAD will detect GNR Xeon chips, and they have UPI speed of 24 GT/s per channel (with multiple UPI links between sockets)
https://www.intel.com/content/www/us/en/products/sku/242668/intel-xeon-6507p-processor-48m-cache-3-50-ghz/specifications.html
I think for NCCL graph this will be GNR_QPI_BW equal to 48.0
Some sources also mention modelId 0xAE as GRANITERAPIDS D, but they are probably single socket only.
Current version may allocate less channels for 2 NUMA GNR machines with multiple PCIe-only GPUs without NVlink. I had 'SYS[22.0]' in NCCL_DEBUG with current code, and 'SYS[48.0]' after fixing, and busbw of all_reduce_perf improved after the fix.
Steps to Reproduce the Issue
No response
NCCL Version
2.30.3
Your platform details
No response
Error Message & Behavior
No response
How is this issue impacting you?
Lower performance than expected
Share Your Debug Logs
Hello
Current version of ncclTopoGetInterCpuBw function has no support of GNR family of Intel Xeon CPUs
https://github.com/NVIDIA/nccl/blob/v2.30.3-1/src/graph/topo.cc#L73
I think that
familyId == 6 && modelId == 0xADwill detect GNR Xeon chips, and they have UPI speed of 24 GT/s per channel (with multiple UPI links between sockets)https://www.intel.com/content/www/us/en/products/sku/242668/intel-xeon-6507p-processor-48m-cache-3-50-ghz/specifications.html
I think for NCCL graph this will be GNR_QPI_BW equal to 48.0
Some sources also mention modelId 0xAE as GRANITERAPIDS D, but they are probably single socket only.
Current version may allocate less channels for 2 NUMA GNR machines with multiple PCIe-only GPUs without NVlink. I had 'SYS[22.0]' in NCCL_DEBUG with current code, and 'SYS[48.0]' after fixing, and busbw of all_reduce_perf improved after the fix.
Steps to Reproduce the Issue
No response
NCCL Version
2.30.3
Your platform details
No response
Error Message & Behavior
No response