Skip to content

[RFE]: NCCL EP LL: support hidden=1024 and top-k > 9 for LatentMoE models (Nemotron Super) #2103

@nkumaraws

Description

@nkumaraws

Goal: Expand NCCL EP Low-Latency kernel support to include hidden=1024 and top-k up to 24, enabling LatentMoE models like NVIDIA Nemotron Super 120B.

Who benefits: Anyone serving LatentMoE models with NCCL EP over EFA or InfiniBand. LatentMoE projects tokens to a smaller dimension before dispatch, reducing RDMA traffic ~4x vs standard architectures.

Architecture/infrastructure: Validated on 2× p5en.48xlarge (16× H200), EFA 3.2 Tbps, GIN Proxy mode. Should apply to any NCCL EP LL deployment.

How it improves workflows: Currently requires rebuilding NCCL from patched source. A 3-line change would enable out-of-the-box support. We would be happy to submit a PR if preferred.

Priority: Medium — workaround exists (rebuild from source), but it adds friction for adoption.


Summary

NCCL EP Low-Latency mode currently supports hidden dimensions {2048, 2560, 4096, 5120, 6144, 7168, 8192} and top-k routing up to 9. This prevents serving LatentMoE models like NVIDIA Nemotron 3 Super 120B-A12B, which uses:

  • hidden=1024 (tokens projected from 4096 to 1024 before expert dispatch)
  • top-k=22 (512 routed experts, top-22 routing)

We have successfully patched NCCL EP LL to support this model and validated it end-to-end with vLLM serving on 16× H200 GPUs over EFA, processing 6,000 requests with 100% success. The patches are minimal (3 lines changed across 2 files) and we believe they could be upstreamed to benefit the broader community.

Proposed Changes

1. Add case 1024 to SWITCH_HIDDEN in contrib/nccl_ep/device/macros.cuh

 #define SWITCH_HIDDEN(case_macro) \
     switch (hidden) { \
+        case 1024: case_macro(1024); \
         case 2048: case_macro(2048); \
         case 2560: case_macro(2560); \

Validation: 1024 % (32 * 8) = 1024 % 256 = 0 — passes the EP_STATIC_ASSERT(kHidden % (32 * kNumElemsPerRead) == 0) in the dispatch kernel.

2. Raise kNumMaxTopK from 9 to 24 in contrib/nccl_ep/device/low_latency.cu (line 638)

-    constexpr int kNumMaxTopK = 9;
+    constexpr int kNumMaxTopK = 24;

Validation: The static assert kNumMaxTopK + 1 <= numWarpGroups * numWarpsPerGroup requires 25 <= numWarpGroups * numWarpsPerGroup. For 512 experts on H200 (132 SMs): numWarpGroups = ceil_div(512, 132) = 4, numWarpsPerGroup = 32 / 4 = 8, so 4 * 8 = 32 >= 25. Passes on any GPU with >= 22 SMs.

3. Raise kCombineMaxTopk from 9 to 24 in contrib/nccl_ep/device/low_latency.cu (line 1466)

-constexpr int kCombineMaxTopk = 9;
+constexpr int kCombineMaxTopk = 24;

Validation: The combine kernel's static assert requires kCombineMaxTopk <= 32, which 24 <= 32 satisfies.

Motivation: LatentMoE Models

LatentMoE architectures (used by Nemotron Super and potentially other future models) project hidden states to a smaller latent dimension before expert dispatch. This has a significant advantage for Expert Parallelism: each token dispatched over the network is 4× smaller (1024 vs 4096), reducing RDMA traffic proportionally.

In our benchmarks, Nemotron Super's TPOT p50 (144–184 ms) was between the 30B model with hidden=2048 (103–120 ms) and the larger models with hidden=4096–7168 (197–245 ms), confirming the latency benefit of the smaller dispatch payload.

Validation Results

We validated these patches on 2× p5en.48xlarge (16× H200 GPUs) with EFA + NCCL EP LL + GIN Proxy mode, using vLLM 0.18 with enforce_eager and max_model_len=4096.

Kernel-level (ep_bench)

All static asserts pass. Dispatch and combine complete successfully with hidden=1024, top-k=22, 512 experts.

End-to-end serving (vLLM + ShareGPT)

Rate (req/s) Output tok/s TPOT p50 (ms) ITL p50 (ms) TTFT p50 (ms) Success
0.5 94.4 144 139 404 1000/1000
1.0 183.1 151 143 420 1000/1000
2.0 336.4 161 149 447 1000/1000
4.0 569.6 172 155 473 1000/1000
8.0 839.0 184 171 502 1000/1000
inf 545.7 312 149 2,952 1000/1000

6,000/6,000 requests succeeded (100%) across all 6 rate levels.

Note: Throughput degradation at rate=inf (545 vs 839 tok/s at rate=8) is due to Mamba-2 state cache memory pressure at 1K concurrent requests — this is a model-level characteristic, not an NCCL EP issue.

Additional Context

  • The 3 patches can be applied as simple sed one-liners and rebuilt in ~5 minutes with make -C contrib/nccl_ep
  • No changes to NCCL core (libnccl.so) are required — only libnccl_ep.so
  • We would be happy to submit this as a PR if preferred

Other Hidden Dimensions Likely Needed

As MoE architectures diversify, these dimensions will likely be requested:

Model Family Hidden (dispatch) Top-K Status
DeepSeek-V2/R1 7168 8 Supported
Qwen3 MoE 2048/4096 8 Supported
Mixtral 4096 2 Supported
Nemotron (LatentMoE) 1024 22 Needs patch
OLMoE 2048 8 Supported
Future LatentMoE 512–1024 variable May need patch

A compile-time or build-time option for additional hidden dimensions and max top-K would avoid accumulating one-off patches as new architectures emerge.

Environment

  • Hardware: 2× AWS p5en.48xlarge (16× NVIDIA H200 150.1 GB HBM3e)
  • Interconnect: AWS EFA 3.2 Tbps, GDRCopy 2.5.1
  • NCCL: Master branch (post-v2.29.3-1) with EP + GIN Device API
  • CUDA: 12.9.1
  • vLLM: 0.18.0
  • Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions