Test cherry-pick by jskrobola · Pull Request #2100 · NVIDIA/nccl

jskrobola · 2026-04-13T22:13:07Z

Description

Related Issues

Changes & Impact

Performance Impact

jskrobola · 2026-04-13T22:13:22Z

/mirror-test

jskrobola · 2026-04-14T16:30:54Z

/mirror-test master

jskrobola · 2026-04-14T16:32:50Z

/mirror-test v2.30

jskrobola · 2026-04-14T16:40:18Z

/mirror-test v2.30

jskrobola · 2026-04-14T16:54:35Z

/mirror-test master

jskrobola · 2026-04-14T17:00:40Z

/mirror-test master

jskrobola · 2026-04-14T17:03:11Z

/mirror-test master

jskrobola · 2026-04-14T17:09:15Z

/mirror-test master

jskrobola · 2026-04-14T17:09:50Z

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

jskrobola · 2026-04-14T17:12:07Z

/mirror-test master

jskrobola · 2026-04-14T17:12:50Z

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

jskrobola · 2026-04-14T17:16:04Z

/mirror-test v2.30

jskrobola · 2026-04-14T17:19:02Z

/mirror

jskrobola · 2026-04-16T14:08:23Z

/mirror

jskrobola · 2026-04-16T14:10:48Z

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

cudaMemPoolCreate with maxSize=0 (the current NCCL default) causes the CUDA driver to reserve virtual address space equal to 2x the GPU's physical memory per pool. This was empirically verified on multiple internal systems (the latest one was 8xH200s, 139.80 GiB each, CUDA 13.0, driver 580.126.09): Process VA before pool creation: 0.04 TiB Process VA after 20 pool creations: 5.51 TiB VA increase: 5.47 TiB (280.00 GiB per pool) VA per pool / device memory: 2.00x The CUDA documentation only states that maxSize=0 "defaults to a system dependent value" but the 2x ratio above, though undocumented, is consistent. Since NCCL creates a shadow pool (ncclShadowPool) per communicator via ncclShadowPoolAlloc, the VA cost scales linearly with communicator count. On systems with a 48-bit virtual address space (256 TiB), this leads to VA exhaustion with complex multi-communicator training regimens: GB300 (~279 GiB HBM): ~558 GiB VA per pool 200 communicators x 1 pool x 558 GiB ≈ 109 TiB 200 communicators x 2 pools x 558 GiB ≈ 217 TiB Affected systems include: ARM64 Grace (GB300/GB200/NVL72) with CONFIG_ARM64_VA_BITS=48 and 64K pages (3-level page tables) which is the Ubuntu linux-nvidia-64k config x86_64 with 4-level page tables, 48-bit / 256 TiB Potentially NVIDIA DGX OS 7 on Grace which also seems to ship with 48-bit VA Systems with 5-level page tables (57-bit / 128 PiB on x86) are not affected due to the much larger VA space. The VA reservation is invisible to cudaMemPoolAttrReservedMemCurrent (which only tracks physical reservation) and can only be observed via /proc/self/maps. This makes the problem non-obvious to diagnose. Note on the kernel side: the upstream Linux kernel (including NVIDIA's NV-Kernels tree) already defaults to CONFIG_ARM64_VA_BITS_52 for ARM64, which would give 4 PiB of VA with no additional page walk cost (still 3-level with 64K pages). However, Ubuntu's linux-nvidia-64k package overrides this to 48-bit. Even with a 52-bit VA kernel, the ARM64 mmap subsystem returns addresses from the 48-bit range by default for backward compatibility. Relevant applications must pass an mmap hint above 2^48 to use the 52-bit range. Additionally The CUDA driver currently passes 0 as the hint in cuMemAddressReserve, so a driver change may also be required to benefit from a 52-bit kernel. Regardless of kernel/driver changes, capping maxSize in NCCL is the right fix because it eliminates the wasteful VA reservation at the source. Fix Set props.maxSize before calling cudaMemPoolCreate in ncclShadowPoolAlloc. The shadow pool is used for small device-side metadata objects (plan descriptors, channel info structs allocated in 64K pages), not bulk data transfer buffers, so a 1 GiB default cap is generous for this use case while reducing per-pool VA from ~2x device memory (280 GiB on H200, 558 GiB on GB300) to a fixed 1 GiB bound. The cap is configurable via the NCCL_SHADOW_MEMPOOL_MAX_SIZE environment variable (in bytes), following NCCL's existing NCCL_PARAM convention: # Use default 1 GiB cap (no env var needed) # Override to 4 GiB: export NCCL_SHADOW_MEMPOOL_MAX_SIZE=4294967296 # Disable cap (revert to CUDA driver default of 2x device memory): export NCCL_SHADOW_MEMPOOL_MAX_SIZE=0 Signed-off-by: Jay Skrobola <jskrobola@nvidia.com>

jskrobola · 2026-04-16T14:14:21Z

/mirror

jskrobola force-pushed the js-cherry-pick-test-2 branch from 168f5fb to bc881da Compare April 14, 2026 17:15

jskrobola force-pushed the js-cherry-pick-test-2 branch from bc881da to 33965e1 Compare April 16, 2026 14:07

jskrobola force-pushed the js-cherry-pick-test-2 branch from 33965e1 to 66e833f Compare April 16, 2026 14:13

Conversation

jskrobola commented Apr 13, 2026

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

jskrobola commented Apr 13, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 14, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

jskrobola commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant