Skip to content

allocator: cap shadow pool VA reservation via props.maxSize#2064

Open
sdimitro wants to merge 1 commit intoNVIDIA:v2.30from
sdimitro:sd/cap-shadow-mempool-va
Open

allocator: cap shadow pool VA reservation via props.maxSize#2064
sdimitro wants to merge 1 commit intoNVIDIA:v2.30from
sdimitro:sd/cap-shadow-mempool-va

Conversation

@sdimitro
Copy link
Copy Markdown

@sdimitro sdimitro commented Mar 23, 2026

Problem

cudaMemPoolCreate with maxSize=0 (the current NCCL default) causes the CUDA driver to reserve virtual address space equal to 2x the GPU's physical memory per pool. This was empirically verified on multiple internal systems (the latest one was 8xH200s, 139.80 GiB each, CUDA 13.0, driver 580.126.09):

Process VA before pool creation: 0.04 TiB
Process VA after 20 pool creations: 5.51 TiB
VA increase: 5.47 TiB (280.00 GiB per pool)
VA per pool / device memory: 2.00x

The CUDA documentation only states that maxSize=0 "defaults to a system dependent value" but the 2x ratio above, though undocumented, is consistent.

Since NCCL creates a shadow pool (ncclShadowPool) per communicator via ncclShadowPoolAlloc, the VA cost scales linearly with communicator count. On systems with a 48-bit virtual address space (256 TiB), this leads to VA exhaustion with complex multi-communicator training regimens:

GB300 (~279 GiB HBM): ~558 GiB VA per pool
200 communicators x 1 pool x 558 GiB ≈ 109 TiB
200 communicators x 2 pools x 558 GiB ≈ 217 TiB

Affected systems include:

  • ARM64 Grace (GB300/GB200/NVL72) with CONFIG_ARM64_VA_BITS=48 and 64K pages (3-level page tables) which is the Ubuntu linux-nvidia-64k config
  • x86_64 with 4-level page tables, 48-bit / 256 TiB
  • Potentially NVIDIA DGX OS 7 on Grace which also seems to ship with 48-bit VA

Systems with 5-level page tables (57-bit / 128 PiB on x86) are not affected due to the much larger VA space.

The VA reservation is invisible to cudaMemPoolAttrReservedMemCurrent (which only tracks physical reservation) and can only be observed via /proc/self/maps. This makes the problem non-obvious to diagnose.

Note on the kernel side: the upstream Linux kernel (including NVIDIA's NV-Kernels tree) already defaults to CONFIG_ARM64_VA_BITS_52 for ARM64, which would give 4 PiB of VA with no additional page walk cost (still 3-level with 64K pages). However, Ubuntu's linux-nvidia-64k package overrides this to 48-bit. Even with a 52-bit VA kernel, the ARM64 mmap subsystem returns addresses from the 48-bit range by default for backward compatibility. Relevant applications must pass an mmap hint above 2^48 to use the 52-bit range.

Additionally The CUDA driver currently passes 0 as the hint in cuMemAddressReserve, so a driver change may also be required to benefit from a 52-bit kernel. Regardless of kernel/driver changes, capping maxSize in NCCL is the right fix because it eliminates the wasteful VA reservation at the source.

Fix

Set props.maxSize before calling cudaMemPoolCreate in ncclShadowPoolAlloc. The shadow pool is used for small device-side metadata objects (plan descriptors, channel info structs allocated in 64K pages), not bulk data transfer buffers, so a 1 GiB default cap is generous for this use case while reducing per-pool VA from ~2x device memory (280 GiB on H200, 558 GiB on GB300) to a fixed 1 GiB bound.

The cap is configurable via the NCCL_SHADOW_MEMPOOL_MAX_SIZE environment variable (in bytes), following NCCL's existing NCCL_PARAM convention:

  # Use default 1 GiB cap (no env var needed)

  # Override to 4 GiB:
  export NCCL_SHADOW_MEMPOOL_MAX_SIZE=4294967296

  # Disable cap (revert to CUDA driver default of 2x device memory):
  export NCCL_SHADOW_MEMPOOL_MAX_SIZE=0

@sdimitro sdimitro closed this Mar 23, 2026
@sdimitro sdimitro reopened this Mar 23, 2026
@AddyLaddy
Copy link
Copy Markdown
Collaborator

/mirror

@sdimitro
Copy link
Copy Markdown
Author

Hi @AddyLaddy 👋 ! Thanks for queueing this for a test run - any updates?

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

@sdimitro thank you for the contribution. CI looks good. We are waiting for internal review at this point.
To prepare for merge, could you please add DCO signoff to your commit and rebase to v2.30 branch? I can re-mirror afterwards. Thanks!

@sdimitro sdimitro force-pushed the sd/cap-shadow-mempool-va branch from 3318c43 to 34d39ef Compare April 1, 2026 12:42
@sdimitro
Copy link
Copy Markdown
Author

sdimitro commented Apr 1, 2026

Hi @xiaofanl-nvidia - I just rebased over the v2.30 branch but now DCO fails because of a signature from a commit that's not mine (mine has Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com> ). Let me know if that's a problem and feel free to re-mirror

@sjeaugey sjeaugey changed the base branch from master to v2.30 April 1, 2026 14:53
@sjeaugey
Copy link
Copy Markdown
Member

sjeaugey commented Apr 1, 2026

I think that's because you didn't change the target branch from master to v2.30. I changed it and now I only see one commit (not 50) .

Still, I'm struggling to get DCO to re-run on only that commit and show success.

@sdimitro sdimitro force-pushed the sd/cap-shadow-mempool-va branch from 34d39ef to 6234009 Compare April 1, 2026 15:23
@sdimitro
Copy link
Copy Markdown
Author

sdimitro commented Apr 1, 2026

@sjeaugey thanks! I just amended and force pushed and that re-run the DCO.

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

/mirror

@sdimitro
Copy link
Copy Markdown
Author

sdimitro commented Apr 7, 2026

hi @xiaofanl-nvidia 👋 any updates on this?

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

Update - there was a question during code review on whether 1GB default is sufficient for all use cases, esp for large nvlink domain sizes if we have buffers that could use more than 1GB. I'll check more on it now that 2.30 release work is done.

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

xiaofanl-nvidia commented Apr 15, 2026

@sdimitro this was approved so I'm ready to merge the commit. However, could you please help shorten your commit message? It hits a limit and cuts off your DCO at the end.

The limit seems to be ~2000 characters and ~350 words.

Creating CUDA memory pools with the default maxSize=0 causes the driver
to reserve virtual address space equal to twice the GPU's physical memory,
leading to severe VA exhaustion on 48-bit systems during multi-communicator
training. This commit caps the shadow pool's maxSize to a default of 1 GiB,
Users can easily customize or disable this limit with a new environment
variable.

Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com>
@sdimitro sdimitro force-pushed the sd/cap-shadow-mempool-va branch from 6234009 to d90b422 Compare April 16, 2026 11:55
@sdimitro
Copy link
Copy Markdown
Author

@xiaofanl-nvidia Great to hear! I shortened the message per your request!

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

/mirror

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants