allocator: cap shadow pool VA reservation via props.maxSize by sdimitro · Pull Request #2064 · NVIDIA/nccl

sdimitro · 2026-03-23T15:23:04Z

Problem

cudaMemPoolCreate with maxSize=0 (the current NCCL default) causes the CUDA driver to reserve virtual address space equal to 2x the GPU's physical memory per pool. This was empirically verified on multiple internal systems (the latest one was 8xH200s, 139.80 GiB each, CUDA 13.0, driver 580.126.09):

Process VA before pool creation: 0.04 TiB
Process VA after 20 pool creations: 5.51 TiB
VA increase: 5.47 TiB (280.00 GiB per pool)
VA per pool / device memory: 2.00x

The CUDA documentation only states that maxSize=0 "defaults to a system dependent value" but the 2x ratio above, though undocumented, is consistent.

Since NCCL creates a shadow pool (ncclShadowPool) per communicator via ncclShadowPoolAlloc, the VA cost scales linearly with communicator count. On systems with a 48-bit virtual address space (256 TiB), this leads to VA exhaustion with complex multi-communicator training regimens:

GB300 (~279 GiB HBM): ~558 GiB VA per pool
200 communicators x 1 pool x 558 GiB ≈ 109 TiB
200 communicators x 2 pools x 558 GiB ≈ 217 TiB

Affected systems include:

ARM64 Grace (GB300/GB200/NVL72) with CONFIG_ARM64_VA_BITS=48 and 64K pages (3-level page tables) which is the Ubuntu linux-nvidia-64k config
x86_64 with 4-level page tables, 48-bit / 256 TiB
Potentially NVIDIA DGX OS 7 on Grace which also seems to ship with 48-bit VA

Systems with 5-level page tables (57-bit / 128 PiB on x86) are not affected due to the much larger VA space.

The VA reservation is invisible to cudaMemPoolAttrReservedMemCurrent (which only tracks physical reservation) and can only be observed via /proc/self/maps. This makes the problem non-obvious to diagnose.

Note on the kernel side: the upstream Linux kernel (including NVIDIA's NV-Kernels tree) already defaults to CONFIG_ARM64_VA_BITS_52 for ARM64, which would give 4 PiB of VA with no additional page walk cost (still 3-level with 64K pages). However, Ubuntu's linux-nvidia-64k package overrides this to 48-bit. Even with a 52-bit VA kernel, the ARM64 mmap subsystem returns addresses from the 48-bit range by default for backward compatibility. Relevant applications must pass an mmap hint above 2^48 to use the 52-bit range.

Additionally The CUDA driver currently passes 0 as the hint in cuMemAddressReserve, so a driver change may also be required to benefit from a 52-bit kernel. Regardless of kernel/driver changes, capping maxSize in NCCL is the right fix because it eliminates the wasteful VA reservation at the source.

Fix

Set props.maxSize before calling cudaMemPoolCreate in ncclShadowPoolAlloc. The shadow pool is used for small device-side metadata objects (plan descriptors, channel info structs allocated in 64K pages), not bulk data transfer buffers, so a 1 GiB default cap is generous for this use case while reducing per-pool VA from ~2x device memory (280 GiB on H200, 558 GiB on GB300) to a fixed 1 GiB bound.

The cap is configurable via the NCCL_SHADOW_MEMPOOL_MAX_SIZE environment variable (in bytes), following NCCL's existing NCCL_PARAM convention:

  # Use default 1 GiB cap (no env var needed)

  # Override to 4 GiB:
  export NCCL_SHADOW_MEMPOOL_MAX_SIZE=4294967296

  # Disable cap (revert to CUDA driver default of 2x device memory):
  export NCCL_SHADOW_MEMPOOL_MAX_SIZE=0

AddyLaddy · 2026-03-25T18:51:32Z

/mirror

sdimitro · 2026-03-30T12:49:15Z

Hi @AddyLaddy 👋 ! Thanks for queueing this for a test run - any updates?

xiaofanl-nvidia · 2026-04-01T03:48:54Z

@sdimitro thank you for the contribution. CI looks good. We are waiting for internal review at this point.
To prepare for merge, could you please add DCO signoff to your commit and rebase to v2.30 branch? I can re-mirror afterwards. Thanks!

sdimitro · 2026-04-01T12:45:51Z

Hi @xiaofanl-nvidia - I just rebased over the v2.30 branch but now DCO fails because of a signature from a commit that's not mine (mine has Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com> ). Let me know if that's a problem and feel free to re-mirror

sjeaugey · 2026-04-01T14:58:21Z

I think that's because you didn't change the target branch from master to v2.30. I changed it and now I only see one commit (not 50) .

Still, I'm struggling to get DCO to re-run on only that commit and show success.

sdimitro · 2026-04-01T15:23:49Z

@sjeaugey thanks! I just amended and force pushed and that re-run the DCO.

xiaofanl-nvidia · 2026-04-01T16:25:08Z

/mirror

sdimitro · 2026-04-07T15:13:03Z

hi @xiaofanl-nvidia 👋 any updates on this?

xiaofanl-nvidia · 2026-04-15T18:35:16Z

Update - there was a question during code review on whether 1GB default is sufficient for all use cases, esp for large nvlink domain sizes if we have buffers that could use more than 1GB. I'll check more on it now that 2.30 release work is done.

xiaofanl-nvidia · 2026-04-15T21:04:42Z

@sdimitro this was approved so I'm ready to merge the commit. However, could you please help shorten your commit message? It hits a limit and cuts off your DCO at the end.

The limit seems to be ~2000 characters and ~350 words.

Creating CUDA memory pools with the default maxSize=0 causes the driver to reserve virtual address space equal to twice the GPU's physical memory, leading to severe VA exhaustion on 48-bit systems during multi-communicator training. This commit caps the shadow pool's maxSize to a default of 1 GiB, Users can easily customize or disable this limit with a new environment variable. Signed-off-by: Serapheim Dimitropoulos <sdimitropoulos@coreweave.com>

sdimitro · 2026-04-16T11:55:57Z

@xiaofanl-nvidia Great to hear! I shortened the message per your request!

xiaofanl-nvidia · 2026-04-17T02:26:46Z

/mirror

sdimitro closed this Mar 23, 2026

sdimitro reopened this Mar 23, 2026

xiaofanl-nvidia requested review from bhramesh-nvidia and zhenhaohe April 1, 2026 03:46

sdimitro force-pushed the sd/cap-shadow-mempool-va branch from 3318c43 to 34d39ef Compare April 1, 2026 12:42

sjeaugey changed the base branch from master to v2.30 April 1, 2026 14:53

sdimitro force-pushed the sd/cap-shadow-mempool-va branch from 34d39ef to 6234009 Compare April 1, 2026 15:23

sdimitro force-pushed the sd/cap-shadow-mempool-va branch from 6234009 to d90b422 Compare April 16, 2026 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocator: cap shadow pool VA reservation via props.maxSize#2064

allocator: cap shadow pool VA reservation via props.maxSize#2064
sdimitro wants to merge 1 commit intoNVIDIA:v2.30from
sdimitro:sd/cap-shadow-mempool-va

sdimitro commented Mar 23, 2026 •

edited

Loading

Uh oh!

AddyLaddy commented Mar 25, 2026

Uh oh!

sdimitro commented Mar 30, 2026

Uh oh!

xiaofanl-nvidia commented Apr 1, 2026

Uh oh!

sdimitro commented Apr 1, 2026

Uh oh!

sjeaugey commented Apr 1, 2026

Uh oh!

sdimitro commented Apr 1, 2026

Uh oh!

xiaofanl-nvidia commented Apr 1, 2026

Uh oh!

sdimitro commented Apr 7, 2026

Uh oh!

xiaofanl-nvidia commented Apr 15, 2026

Uh oh!

xiaofanl-nvidia commented Apr 15, 2026 •

edited

Loading

Uh oh!

sdimitro commented Apr 16, 2026

Uh oh!

xiaofanl-nvidia commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sdimitro commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

AddyLaddy commented Mar 25, 2026

Uh oh!

sdimitro commented Mar 30, 2026

Uh oh!

xiaofanl-nvidia commented Apr 1, 2026

Uh oh!

sdimitro commented Apr 1, 2026

Uh oh!

sjeaugey commented Apr 1, 2026

Uh oh!

sdimitro commented Apr 1, 2026

Uh oh!

xiaofanl-nvidia commented Apr 1, 2026

Uh oh!

sdimitro commented Apr 7, 2026

Uh oh!

xiaofanl-nvidia commented Apr 15, 2026

Uh oh!

xiaofanl-nvidia commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdimitro commented Apr 16, 2026

Uh oh!

xiaofanl-nvidia commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sdimitro commented Mar 23, 2026 •

edited

Loading

xiaofanl-nvidia commented Apr 15, 2026 •

edited

Loading