Goal: Expand NCCL EP Low-Latency kernel support to include hidden=1024 and top-k up to 24, enabling LatentMoE models like NVIDIA Nemotron Super 120B.
Who benefits: Anyone serving LatentMoE models with NCCL EP over EFA or InfiniBand. LatentMoE projects tokens to a smaller dimension before dispatch, reducing RDMA traffic ~4x vs standard architectures.
Architecture/infrastructure: Validated on 2× p5en.48xlarge (16× H200), EFA 3.2 Tbps, GIN Proxy mode. Should apply to any NCCL EP LL deployment.
How it improves workflows: Currently requires rebuilding NCCL from patched source. A 3-line change would enable out-of-the-box support. We would be happy to submit a PR if preferred.
Priority: Medium — workaround exists (rebuild from source), but it adds friction for adoption.
Summary
NCCL EP Low-Latency mode currently supports hidden dimensions {2048, 2560, 4096, 5120, 6144, 7168, 8192} and top-k routing up to 9. This prevents serving LatentMoE models like NVIDIA Nemotron 3 Super 120B-A12B, which uses:
- hidden=1024 (tokens projected from 4096 to 1024 before expert dispatch)
- top-k=22 (512 routed experts, top-22 routing)
We have successfully patched NCCL EP LL to support this model and validated it end-to-end with vLLM serving on 16× H200 GPUs over EFA, processing 6,000 requests with 100% success. The patches are minimal (3 lines changed across 2 files) and we believe they could be upstreamed to benefit the broader community.
Proposed Changes
1. Add case 1024 to SWITCH_HIDDEN in contrib/nccl_ep/device/macros.cuh
#define SWITCH_HIDDEN(case_macro) \
switch (hidden) { \
+ case 1024: case_macro(1024); \
case 2048: case_macro(2048); \
case 2560: case_macro(2560); \
Validation: 1024 % (32 * 8) = 1024 % 256 = 0 — passes the EP_STATIC_ASSERT(kHidden % (32 * kNumElemsPerRead) == 0) in the dispatch kernel.
2. Raise kNumMaxTopK from 9 to 24 in contrib/nccl_ep/device/low_latency.cu (line 638)
- constexpr int kNumMaxTopK = 9;
+ constexpr int kNumMaxTopK = 24;
Validation: The static assert kNumMaxTopK + 1 <= numWarpGroups * numWarpsPerGroup requires 25 <= numWarpGroups * numWarpsPerGroup. For 512 experts on H200 (132 SMs): numWarpGroups = ceil_div(512, 132) = 4, numWarpsPerGroup = 32 / 4 = 8, so 4 * 8 = 32 >= 25. Passes on any GPU with >= 22 SMs.
3. Raise kCombineMaxTopk from 9 to 24 in contrib/nccl_ep/device/low_latency.cu (line 1466)
-constexpr int kCombineMaxTopk = 9;
+constexpr int kCombineMaxTopk = 24;
Validation: The combine kernel's static assert requires kCombineMaxTopk <= 32, which 24 <= 32 satisfies.
Motivation: LatentMoE Models
LatentMoE architectures (used by Nemotron Super and potentially other future models) project hidden states to a smaller latent dimension before expert dispatch. This has a significant advantage for Expert Parallelism: each token dispatched over the network is 4× smaller (1024 vs 4096), reducing RDMA traffic proportionally.
In our benchmarks, Nemotron Super's TPOT p50 (144–184 ms) was between the 30B model with hidden=2048 (103–120 ms) and the larger models with hidden=4096–7168 (197–245 ms), confirming the latency benefit of the smaller dispatch payload.
Validation Results
We validated these patches on 2× p5en.48xlarge (16× H200 GPUs) with EFA + NCCL EP LL + GIN Proxy mode, using vLLM 0.18 with enforce_eager and max_model_len=4096.
Kernel-level (ep_bench)
All static asserts pass. Dispatch and combine complete successfully with hidden=1024, top-k=22, 512 experts.
End-to-end serving (vLLM + ShareGPT)
| Rate (req/s) |
Output tok/s |
TPOT p50 (ms) |
ITL p50 (ms) |
TTFT p50 (ms) |
Success |
| 0.5 |
94.4 |
144 |
139 |
404 |
1000/1000 |
| 1.0 |
183.1 |
151 |
143 |
420 |
1000/1000 |
| 2.0 |
336.4 |
161 |
149 |
447 |
1000/1000 |
| 4.0 |
569.6 |
172 |
155 |
473 |
1000/1000 |
| 8.0 |
839.0 |
184 |
171 |
502 |
1000/1000 |
| inf |
545.7 |
312 |
149 |
2,952 |
1000/1000 |
6,000/6,000 requests succeeded (100%) across all 6 rate levels.
Note: Throughput degradation at rate=inf (545 vs 839 tok/s at rate=8) is due to Mamba-2 state cache memory pressure at 1K concurrent requests — this is a model-level characteristic, not an NCCL EP issue.
Additional Context
- The 3 patches can be applied as simple
sed one-liners and rebuilt in ~5 minutes with make -C contrib/nccl_ep
- No changes to NCCL core (
libnccl.so) are required — only libnccl_ep.so
- We would be happy to submit this as a PR if preferred
Other Hidden Dimensions Likely Needed
As MoE architectures diversify, these dimensions will likely be requested:
| Model Family |
Hidden (dispatch) |
Top-K |
Status |
| DeepSeek-V2/R1 |
7168 |
8 |
Supported |
| Qwen3 MoE |
2048/4096 |
8 |
Supported |
| Mixtral |
4096 |
2 |
Supported |
| Nemotron (LatentMoE) |
1024 |
22 |
Needs patch |
| OLMoE |
2048 |
8 |
Supported |
| Future LatentMoE |
512–1024 |
variable |
May need patch |
A compile-time or build-time option for additional hidden dimensions and max top-K would avoid accumulating one-off patches as new architectures emerge.
Environment
- Hardware: 2× AWS p5en.48xlarge (16× NVIDIA H200 150.1 GB HBM3e)
- Interconnect: AWS EFA 3.2 Tbps, GDRCopy 2.5.1
- NCCL: Master branch (post-v2.29.3-1) with EP + GIN Device API
- CUDA: 12.9.1
- vLLM: 0.18.0
- Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
Goal: Expand NCCL EP Low-Latency kernel support to include
hidden=1024andtop-kup to 24, enabling LatentMoE models like NVIDIA Nemotron Super 120B.Who benefits: Anyone serving LatentMoE models with NCCL EP over EFA or InfiniBand. LatentMoE projects tokens to a smaller dimension before dispatch, reducing RDMA traffic ~4x vs standard architectures.
Architecture/infrastructure: Validated on 2× p5en.48xlarge (16× H200), EFA 3.2 Tbps, GIN Proxy mode. Should apply to any NCCL EP LL deployment.
How it improves workflows: Currently requires rebuilding NCCL from patched source. A 3-line change would enable out-of-the-box support. We would be happy to submit a PR if preferred.
Priority: Medium — workaround exists (rebuild from source), but it adds friction for adoption.
Summary
NCCL EP Low-Latency mode currently supports hidden dimensions {2048, 2560, 4096, 5120, 6144, 7168, 8192} and top-k routing up to 9. This prevents serving LatentMoE models like NVIDIA Nemotron 3 Super 120B-A12B, which uses:
We have successfully patched NCCL EP LL to support this model and validated it end-to-end with vLLM serving on 16× H200 GPUs over EFA, processing 6,000 requests with 100% success. The patches are minimal (3 lines changed across 2 files) and we believe they could be upstreamed to benefit the broader community.
Proposed Changes
1. Add
case 1024toSWITCH_HIDDENincontrib/nccl_ep/device/macros.cuh#define SWITCH_HIDDEN(case_macro) \ switch (hidden) { \ + case 1024: case_macro(1024); \ case 2048: case_macro(2048); \ case 2560: case_macro(2560); \Validation:
1024 % (32 * 8) = 1024 % 256 = 0— passes theEP_STATIC_ASSERT(kHidden % (32 * kNumElemsPerRead) == 0)in the dispatch kernel.2. Raise
kNumMaxTopKfrom 9 to 24 incontrib/nccl_ep/device/low_latency.cu(line 638)Validation: The static assert
kNumMaxTopK + 1 <= numWarpGroups * numWarpsPerGrouprequires25 <= numWarpGroups * numWarpsPerGroup. For 512 experts on H200 (132 SMs):numWarpGroups = ceil_div(512, 132) = 4,numWarpsPerGroup = 32 / 4 = 8, so4 * 8 = 32 >= 25. Passes on any GPU with >= 22 SMs.3. Raise
kCombineMaxTopkfrom 9 to 24 incontrib/nccl_ep/device/low_latency.cu(line 1466)Validation: The combine kernel's static assert requires
kCombineMaxTopk <= 32, which24 <= 32satisfies.Motivation: LatentMoE Models
LatentMoE architectures (used by Nemotron Super and potentially other future models) project hidden states to a smaller latent dimension before expert dispatch. This has a significant advantage for Expert Parallelism: each token dispatched over the network is 4× smaller (1024 vs 4096), reducing RDMA traffic proportionally.
In our benchmarks, Nemotron Super's TPOT p50 (144–184 ms) was between the 30B model with hidden=2048 (103–120 ms) and the larger models with hidden=4096–7168 (197–245 ms), confirming the latency benefit of the smaller dispatch payload.
Validation Results
We validated these patches on 2× p5en.48xlarge (16× H200 GPUs) with EFA + NCCL EP LL + GIN Proxy mode, using vLLM 0.18 with
enforce_eagerandmax_model_len=4096.Kernel-level (
ep_bench)All static asserts pass. Dispatch and combine complete successfully with hidden=1024, top-k=22, 512 experts.
End-to-end serving (vLLM + ShareGPT)
6,000/6,000 requests succeeded (100%) across all 6 rate levels.
Note: Throughput degradation at rate=inf (545 vs 839 tok/s at rate=8) is due to Mamba-2 state cache memory pressure at 1K concurrent requests — this is a model-level characteristic, not an NCCL EP issue.
Additional Context
sedone-liners and rebuilt in ~5 minutes withmake -C contrib/nccl_eplibnccl.so) are required — onlylibnccl_ep.soOther Hidden Dimensions Likely Needed
As MoE architectures diversify, these dimensions will likely be requested:
A compile-time or build-time option for additional hidden dimensions and max top-K would avoid accumulating one-off patches as new architectures emerge.
Environment