allocator: cap shadow pool VA reservation via props.maxSize#2064
allocator: cap shadow pool VA reservation via props.maxSize#2064sdimitro wants to merge 1 commit intoNVIDIA:v2.30from
Conversation
|
/mirror |
|
Hi @AddyLaddy 👋 ! Thanks for queueing this for a test run - any updates? |
|
@sdimitro thank you for the contribution. CI looks good. We are waiting for internal review at this point. |
3318c43 to
34d39ef
Compare
|
Hi @xiaofanl-nvidia - I just rebased over the v2.30 branch but now DCO fails because of a signature from a commit that's not mine (mine has |
|
I think that's because you didn't change the target branch from master to v2.30. I changed it and now I only see one commit (not 50) . Still, I'm struggling to get DCO to re-run on only that commit and show success. |
34d39ef to
6234009
Compare
|
@sjeaugey thanks! I just amended and force pushed and that re-run the DCO. |
|
/mirror |
|
hi @xiaofanl-nvidia 👋 any updates on this? |
|
Update - there was a question during code review on whether 1GB default is sufficient for all use cases, esp for large nvlink domain sizes if we have buffers that could use more than 1GB. I'll check more on it now that 2.30 release work is done. |
|
@sdimitro this was approved so I'm ready to merge the commit. However, could you please help shorten your commit message? It hits a limit and cuts off your DCO at the end. The limit seems to be ~2000 characters and ~350 words. |
Creating CUDA memory pools with the default maxSize=0 causes the driver to reserve virtual address space equal to twice the GPU's physical memory, leading to severe VA exhaustion on 48-bit systems during multi-communicator training. This commit caps the shadow pool's maxSize to a default of 1 GiB, Users can easily customize or disable this limit with a new environment variable. Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com>
6234009 to
d90b422
Compare
|
@xiaofanl-nvidia Great to hear! I shortened the message per your request! |
|
/mirror |
Problem
cudaMemPoolCreate with maxSize=0 (the current NCCL default) causes the CUDA driver to reserve virtual address space equal to 2x the GPU's physical memory per pool. This was empirically verified on multiple internal systems (the latest one was 8xH200s, 139.80 GiB each, CUDA 13.0, driver 580.126.09):
Process VA before pool creation: 0.04 TiB
Process VA after 20 pool creations: 5.51 TiB
VA increase: 5.47 TiB (280.00 GiB per pool)
VA per pool / device memory: 2.00x
The CUDA documentation only states that maxSize=0 "defaults to a system dependent value" but the 2x ratio above, though undocumented, is consistent.
Since NCCL creates a shadow pool (ncclShadowPool) per communicator via ncclShadowPoolAlloc, the VA cost scales linearly with communicator count. On systems with a 48-bit virtual address space (256 TiB), this leads to VA exhaustion with complex multi-communicator training regimens:
GB300 (~279 GiB HBM): ~558 GiB VA per pool
200 communicators x 1 pool x 558 GiB ≈ 109 TiB
200 communicators x 2 pools x 558 GiB ≈ 217 TiB
Affected systems include:
Systems with 5-level page tables (57-bit / 128 PiB on x86) are not affected due to the much larger VA space.
The VA reservation is invisible to cudaMemPoolAttrReservedMemCurrent (which only tracks physical reservation) and can only be observed via /proc/self/maps. This makes the problem non-obvious to diagnose.
Note on the kernel side: the upstream Linux kernel (including NVIDIA's NV-Kernels tree) already defaults to CONFIG_ARM64_VA_BITS_52 for ARM64, which would give 4 PiB of VA with no additional page walk cost (still 3-level with 64K pages). However, Ubuntu's linux-nvidia-64k package overrides this to 48-bit. Even with a 52-bit VA kernel, the ARM64 mmap subsystem returns addresses from the 48-bit range by default for backward compatibility. Relevant applications must pass an mmap hint above 2^48 to use the 52-bit range.
Additionally The CUDA driver currently passes 0 as the hint in cuMemAddressReserve, so a driver change may also be required to benefit from a 52-bit kernel. Regardless of kernel/driver changes, capping maxSize in NCCL is the right fix because it eliminates the wasteful VA reservation at the source.
Fix
Set props.maxSize before calling cudaMemPoolCreate in ncclShadowPoolAlloc. The shadow pool is used for small device-side metadata objects (plan descriptors, channel info structs allocated in 64K pages), not bulk data transfer buffers, so a 1 GiB default cap is generous for this use case while reducing per-pool VA from ~2x device memory (280 GiB on H200, 558 GiB on GB300) to a fixed 1 GiB bound.
The cap is configurable via the NCCL_SHADOW_MEMPOOL_MAX_SIZE environment variable (in bytes), following NCCL's existing NCCL_PARAM convention: