[Question]: nccl-ep: Low-Latency Dispatch communication buffer optimization

### Question

Hello! Thanks for your great contributions in providing the `nccl-ep` extension. I have a few questions regarding the low-latency dispatch buffer optimization described in your paper, which I couldn't quite figure out from the current codebase. I am looking forward to your insights!

In your paper [NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL](https://arxiv.org/abs/2603.13606) Section IV.D, your team provided a scheme to optimize the layout to reduce low-latency dispatch buffer usage. I quote:

> To address the issue in the dispatch, we extend the message header with routing information $R(r, t)$, avoiding layout-based routing. In the Send Tokens step (Section IV-C0a), a token is sent once per destination rank, as opposed to once per expert. The sub-region indexing is unified for dispatch $idx_D(e, r) = r$, not depending on the calling side. The dispatch buffer is reduced by a factor of number of local experts $L$, scaling as $O(N\cdot B\cdot P)$.

Based on my understanding of the original DeepEP implementation, the dispatch receive buffer is organized in a 3D layout: `[expert_id][src_rank_id][slot_id]`. Thus, the dispatch buffer size scales as `O(Experts * numRanks * maxTokensPerRank * MsgSize)` (which is also multiplied by 2 for double buffering). 

However, when I looked into your open-source implementation, it seems that the code is still applying the previous DeepEP-like expert-major layout. I quote:

https://github.com/NVIDIA/nccl/blob/49839dfde34086ee9aaf9a1c69ae3404f13773e7/contrib/nccl_ep/device/low_latency.cu#L483-L489

Here, the destination pointer is still calculated as `dstExpertLocalIdx * numRanks * maxTokensPerRank * ...`, which perfectly reflects the $O(E \cdot B \cdot P)$ scaling rather than the optimized $O(N \cdot B \cdot P)$ mentioned in the paper.

Given this context, I have two questions:
1. Is the $O(N \cdot B \cdot P)$ optimization with the $R(r,t)$ header routing planned for a future PR, or is it currently maintained in an internal branch?
2. From a technical perspective, how exactly will the scheme described in the paper be implemented using the $R(r,t)$ header? Furthermore, how is the memory layout on the destination rank organized so that the experts can fetch the data efficiently without introducing extra overhead for shuffling or re-layout? (The previous DeepEP combine send has used complex TMA instructions to efficiently fetch the data)

Thanks again for your time and excellent work!

	const auto recvPtr = reinterpret_cast<uint64_t>(recvBuf) +
	dstExpertLocalIdx * numRanks * maxTokensPerRank * numBytesPerMsg +
	currRank * maxTokensPerRank * numBytesPerMsg +
	slotIdx * numBytesPerMsg;
	size_t expectedDstOffset = recvOff +
	dstExpertLocalIdx * numRanks * maxTokensPerRank * numBytesPerMsg +
	currRank * maxTokensPerRank * numBytesPerMsg + slotIdx * numBytesPerMsg;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: nccl-ep: Low-Latency Dispatch communication buffer optimization #2088

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question]: nccl-ep: Low-Latency Dispatch communication buffer optimization #2088

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions