[RFE]: NCCL EP LL: support hidden=1024 and top-k > 9 for LatentMoE models (Nemotron Super)

**Goal**: Expand NCCL EP Low-Latency kernel support to include `hidden=1024` and `top-k` up to 24, enabling LatentMoE models like NVIDIA Nemotron Super 120B.

**Who benefits**: Anyone serving LatentMoE models with NCCL EP over EFA or InfiniBand. LatentMoE projects tokens to a smaller dimension before dispatch, reducing RDMA traffic ~4x vs standard architectures.

**Architecture/infrastructure**: Validated on 2× p5en.48xlarge (16× H200), EFA 3.2 Tbps, GIN Proxy mode. Should apply to any NCCL EP LL deployment.

**How it improves workflows**: Currently requires rebuilding NCCL from patched source. A 3-line change would enable out-of-the-box support. We would be happy to submit a PR if preferred.

**Priority**: Medium — workaround exists (rebuild from source), but it adds friction for adoption.

---

## Summary

NCCL EP Low-Latency mode currently supports hidden dimensions {2048, 2560, 4096, 5120, 6144, 7168, 8192} and top-k routing up to 9. This prevents serving **LatentMoE** models like [NVIDIA Nemotron 3 Super 120B-A12B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), which uses:

- **hidden=1024** (tokens projected from 4096 to 1024 before expert dispatch)
- **top-k=22** (512 routed experts, top-22 routing)

We have successfully patched NCCL EP LL to support this model and validated it end-to-end with vLLM serving on 16× H200 GPUs over EFA, processing 6,000 requests with 100% success. The patches are minimal (3 lines changed across 2 files) and we believe they could be upstreamed to benefit the broader community.

## Proposed Changes

### 1. Add `case 1024` to `SWITCH_HIDDEN` in `contrib/nccl_ep/device/macros.cuh`

```diff
 #define SWITCH_HIDDEN(case_macro) \
     switch (hidden) { \
+        case 1024: case_macro(1024); \
         case 2048: case_macro(2048); \
         case 2560: case_macro(2560); \
```

**Validation:** `1024 % (32 * 8) = 1024 % 256 = 0` — passes the `EP_STATIC_ASSERT(kHidden % (32 * kNumElemsPerRead) == 0)` in the dispatch kernel.

### 2. Raise `kNumMaxTopK` from 9 to 24 in `contrib/nccl_ep/device/low_latency.cu` (line 638)

```diff
-    constexpr int kNumMaxTopK = 9;
+    constexpr int kNumMaxTopK = 24;
```

**Validation:** The static assert `kNumMaxTopK + 1 <= numWarpGroups * numWarpsPerGroup` requires `25 <= numWarpGroups * numWarpsPerGroup`. For 512 experts on H200 (132 SMs): `numWarpGroups = ceil_div(512, 132) = 4`, `numWarpsPerGroup = 32 / 4 = 8`, so `4 * 8 = 32 >= 25`. Passes on any GPU with >= 22 SMs.

### 3. Raise `kCombineMaxTopk` from 9 to 24 in `contrib/nccl_ep/device/low_latency.cu` (line 1466)

```diff
-constexpr int kCombineMaxTopk = 9;
+constexpr int kCombineMaxTopk = 24;
```

**Validation:** The combine kernel's static assert requires `kCombineMaxTopk <= 32`, which `24 <= 32` satisfies.

## Motivation: LatentMoE Models

LatentMoE architectures (used by Nemotron Super and potentially other future models) project hidden states to a smaller latent dimension *before* expert dispatch. This has a significant advantage for Expert Parallelism: each token dispatched over the network is 4× smaller (1024 vs 4096), reducing RDMA traffic proportionally.

In our benchmarks, Nemotron Super's TPOT p50 (144–184 ms) was between the 30B model with hidden=2048 (103–120 ms) and the larger models with hidden=4096–7168 (197–245 ms), confirming the latency benefit of the smaller dispatch payload.

## Validation Results

We validated these patches on **2× p5en.48xlarge (16× H200 GPUs)** with **EFA + NCCL EP LL + GIN Proxy mode**, using vLLM 0.18 with `enforce_eager` and `max_model_len=4096`.

### Kernel-level (`ep_bench`)

All static asserts pass. Dispatch and combine complete successfully with hidden=1024, top-k=22, 512 experts.

### End-to-end serving (vLLM + ShareGPT)

| Rate (req/s) | Output tok/s | TPOT p50 (ms) | ITL p50 (ms) | TTFT p50 (ms) | Success |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 0.5 | 94.4 | 144 | 139 | 404 | 1000/1000 |
| 1.0 | 183.1 | 151 | 143 | 420 | 1000/1000 |
| 2.0 | 336.4 | 161 | 149 | 447 | 1000/1000 |
| 4.0 | 569.6 | 172 | 155 | 473 | 1000/1000 |
| 8.0 | 839.0 | 184 | 171 | 502 | 1000/1000 |
| inf | 545.7 | 312 | 149 | 2,952 | 1000/1000 |

**6,000/6,000 requests succeeded (100%)** across all 6 rate levels.

Note: Throughput degradation at rate=inf (545 vs 839 tok/s at rate=8) is due to Mamba-2 state cache memory pressure at 1K concurrent requests — this is a model-level characteristic, not an NCCL EP issue.

## Additional Context

- The 3 patches can be applied as simple `sed` one-liners and rebuilt in ~5 minutes with `make -C contrib/nccl_ep`
- No changes to NCCL core (`libnccl.so`) are required — only `libnccl_ep.so`
- We would be happy to submit this as a PR if preferred


## Other Hidden Dimensions Likely Needed

As MoE architectures diversify, these dimensions will likely be requested:

| Model Family | Hidden (dispatch) | Top-K | Status |
|-------------|:---:|:---:|--------|
| DeepSeek-V2/R1 | 7168 | 8 | Supported |
| Qwen3 MoE | 2048/4096 | 8 | Supported |
| Mixtral | 4096 | 2 | Supported |
| **Nemotron (LatentMoE)** | **1024** | **22** | **Needs patch** |
| OLMoE | 2048 | 8 | Supported |
| Future LatentMoE | 512–1024 | variable | May need patch |

A compile-time or build-time option for additional hidden dimensions and max top-K would avoid accumulating one-off patches as new architectures emerge.

## Environment

- **Hardware**: 2× AWS p5en.48xlarge (16× NVIDIA H200 150.1 GB HBM3e)
- **Interconnect**: AWS EFA 3.2 Tbps, GDRCopy 2.5.1
- **NCCL**: Master branch (post-v2.29.3-1) with EP + GIN Device API
- **CUDA**: 12.9.1
- **vLLM**: 0.18.0
- **Model**: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE]: NCCL EP LL: support hidden=1024 and top-k > 9 for LatentMoE models (Nemotron Super) #2103

Summary

Proposed Changes

1. Add `case 1024` to `SWITCH_HIDDEN` in `contrib/nccl_ep/device/macros.cuh`

2. Raise `kNumMaxTopK` from 9 to 24 in `contrib/nccl_ep/device/low_latency.cu` (line 638)

3. Raise `kCombineMaxTopk` from 9 to 24 in `contrib/nccl_ep/device/low_latency.cu` (line 1466)

Motivation: LatentMoE Models

Validation Results

Kernel-level (`ep_bench`)

End-to-end serving (vLLM + ShareGPT)

Additional Context

Other Hidden Dimensions Likely Needed

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rate (req/s)	Output tok/s	TPOT p50 (ms)	ITL p50 (ms)	TTFT p50 (ms)	Success
0.5	94.4	144	139	404	1000/1000
1.0	183.1	151	143	420	1000/1000
2.0	336.4	161	149	447	1000/1000
4.0	569.6	172	155	473	1000/1000
8.0	839.0	184	171	502	1000/1000
inf	545.7	312	149	2,952	1000/1000

Model Family	Hidden (dispatch)	Top-K	Status
DeepSeek-V2/R1	7168	8	Supported
Qwen3 MoE	2048/4096	8	Supported
Mixtral	4096	2	Supported
Nemotron (LatentMoE)	1024	22	Needs patch
OLMoE	2048	8	Supported
Future LatentMoE	512–1024	variable	May need patch

[RFE]: NCCL EP LL: support hidden=1024 and top-k > 9 for LatentMoE models (Nemotron Super) #2103

Description

Summary

Proposed Changes

1. Add case 1024 to SWITCH_HIDDEN in contrib/nccl_ep/device/macros.cuh

2. Raise kNumMaxTopK from 9 to 24 in contrib/nccl_ep/device/low_latency.cu (line 638)

3. Raise kCombineMaxTopk from 9 to 24 in contrib/nccl_ep/device/low_latency.cu (line 1466)

Motivation: LatentMoE Models

Validation Results

Kernel-level (ep_bench)

End-to-end serving (vLLM + ShareGPT)

Additional Context

Other Hidden Dimensions Likely Needed

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Add `case 1024` to `SWITCH_HIDDEN` in `contrib/nccl_ep/device/macros.cuh`

2. Raise `kNumMaxTopK` from 9 to 24 in `contrib/nccl_ep/device/low_latency.cu` (line 638)

3. Raise `kCombineMaxTopk` from 9 to 24 in `contrib/nccl_ep/device/low_latency.cu` (line 1466)

Kernel-level (`ep_bench`)