Skip to content

Test cherry-pick#2100

Open
jskrobola wants to merge 1 commit intoNVIDIA:masterfrom
jskrobola:js-cherry-pick-test-2
Open

Test cherry-pick#2100
jskrobola wants to merge 1 commit intoNVIDIA:masterfrom
jskrobola:js-cherry-pick-test-2

Conversation

@jskrobola
Copy link
Copy Markdown
Collaborator

Description

Related Issues

Changes & Impact

Performance Impact

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test master

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test v2.30

1 similar comment
@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test v2.30

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test master

3 similar comments
@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test master

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test master

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test master

@jskrobola
Copy link
Copy Markdown
Collaborator Author

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test master

@jskrobola
Copy link
Copy Markdown
Collaborator Author

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

@jskrobola jskrobola force-pushed the js-cherry-pick-test-2 branch from 168f5fb to bc881da Compare April 14, 2026 17:15
@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror-test v2.30

@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror

@jskrobola jskrobola force-pushed the js-cherry-pick-test-2 branch from bc881da to 33965e1 Compare April 16, 2026 14:07
@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror

@jskrobola
Copy link
Copy Markdown
Collaborator Author

Mirroring to the internal repository failed.

The automated mirror did not complete. This is likely due to a conflict. Please ensure the PR is targeting the proper branch, and is rebased to include recent changes.

cudaMemPoolCreate with maxSize=0 (the current NCCL default) causes the CUDA driver to reserve virtual address space equal to 2x the GPU's physical memory per pool. This was empirically verified on multiple internal systems (the latest one was 8xH200s, 139.80 GiB each, CUDA 13.0, driver 580.126.09):

Process VA before pool creation: 0.04 TiB
Process VA after 20 pool creations: 5.51 TiB
VA increase: 5.47 TiB (280.00 GiB per pool)
VA per pool / device memory: 2.00x

The CUDA documentation only states that maxSize=0 "defaults to a system dependent value" but the 2x ratio above, though undocumented, is consistent.

Since NCCL creates a shadow pool (ncclShadowPool) per communicator via ncclShadowPoolAlloc, the VA cost scales linearly with communicator count. On systems with a 48-bit virtual address space (256 TiB), this leads to VA exhaustion with complex multi-communicator training regimens:

GB300 (~279 GiB HBM): ~558 GiB VA per pool
200 communicators x 1 pool x 558 GiB ≈ 109 TiB
200 communicators x 2 pools x 558 GiB ≈ 217 TiB

Affected systems include:

ARM64 Grace (GB300/GB200/NVL72) with CONFIG_ARM64_VA_BITS=48 and 64K pages (3-level page tables) which is the Ubuntu linux-nvidia-64k config
x86_64 with 4-level page tables, 48-bit / 256 TiB
Potentially NVIDIA DGX OS 7 on Grace which also seems to ship with 48-bit VA
Systems with 5-level page tables (57-bit / 128 PiB on x86) are not affected due to the much larger VA space.

The VA reservation is invisible to cudaMemPoolAttrReservedMemCurrent (which only tracks physical reservation) and can only be observed via /proc/self/maps. This makes the problem non-obvious to diagnose.

Note on the kernel side: the upstream Linux kernel (including NVIDIA's NV-Kernels tree) already defaults to CONFIG_ARM64_VA_BITS_52 for ARM64, which would give 4 PiB of VA with no additional page walk cost (still 3-level with 64K pages). However, Ubuntu's linux-nvidia-64k package overrides this to 48-bit. Even with a 52-bit VA kernel, the ARM64 mmap subsystem returns addresses from the 48-bit range by default for backward compatibility. Relevant applications must pass an mmap hint above 2^48 to use the 52-bit range.

Additionally The CUDA driver currently passes 0 as the hint in cuMemAddressReserve, so a driver change may also be required to benefit from a 52-bit kernel. Regardless of kernel/driver changes, capping maxSize in NCCL is the right fix because it eliminates the wasteful VA reservation at the source.

Fix
Set props.maxSize before calling cudaMemPoolCreate in ncclShadowPoolAlloc. The shadow pool is used for small device-side metadata objects (plan descriptors, channel info structs allocated in 64K pages), not bulk data transfer buffers, so a 1 GiB default cap is generous for this use case while reducing per-pool VA from ~2x device memory (280 GiB on H200, 558 GiB on GB300) to a fixed 1 GiB bound.

The cap is configurable via the NCCL_SHADOW_MEMPOOL_MAX_SIZE environment variable (in bytes), following NCCL's existing NCCL_PARAM convention:

  # Use default 1 GiB cap (no env var needed)

  # Override to 4 GiB:
  export NCCL_SHADOW_MEMPOOL_MAX_SIZE=4294967296

  # Disable cap (revert to CUDA driver default of 2x device memory):
  export NCCL_SHADOW_MEMPOOL_MAX_SIZE=0

Signed-off-by: Jay Skrobola <jskrobola@nvidia.com>
@jskrobola jskrobola force-pushed the js-cherry-pick-test-2 branch from 33965e1 to 66e833f Compare April 16, 2026 14:13
@jskrobola
Copy link
Copy Markdown
Collaborator Author

/mirror

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant