Skip to content

[Question]: nccl-ep: Low-Latency Dispatch communication buffer optimization #2088

@mumupika

Description

@mumupika

Question

Hello! Thanks for your great contributions in providing the nccl-ep extension. I have a few questions regarding the low-latency dispatch buffer optimization described in your paper, which I couldn't quite figure out from the current codebase. I am looking forward to your insights!

In your paper NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL Section IV.D, your team provided a scheme to optimize the layout to reduce low-latency dispatch buffer usage. I quote:

To address the issue in the dispatch, we extend the message header with routing information $R(r, t)$, avoiding layout-based routing. In the Send Tokens step (Section IV-C0a), a token is sent once per destination rank, as opposed to once per expert. The sub-region indexing is unified for dispatch $idx_D(e, r) = r$, not depending on the calling side. The dispatch buffer is reduced by a factor of number of local experts $L$, scaling as $O(N\cdot B\cdot P)$.

Based on my understanding of the original DeepEP implementation, the dispatch receive buffer is organized in a 3D layout: [expert_id][src_rank_id][slot_id]. Thus, the dispatch buffer size scales as O(Experts * numRanks * maxTokensPerRank * MsgSize) (which is also multiplied by 2 for double buffering).

However, when I looked into your open-source implementation, it seems that the code is still applying the previous DeepEP-like expert-major layout. I quote:

const auto recvPtr = reinterpret_cast<uint64_t>(recvBuf) +
dstExpertLocalIdx * numRanks * maxTokensPerRank * numBytesPerMsg +
currRank * maxTokensPerRank * numBytesPerMsg +
slotIdx * numBytesPerMsg;
size_t expectedDstOffset = recvOff +
dstExpertLocalIdx * numRanks * maxTokensPerRank * numBytesPerMsg +
currRank * maxTokensPerRank * numBytesPerMsg + slotIdx * numBytesPerMsg;

Here, the destination pointer is still calculated as dstExpertLocalIdx * numRanks * maxTokensPerRank * ..., which perfectly reflects the $O(E \cdot B \cdot P)$ scaling rather than the optimized $O(N \cdot B \cdot P)$ mentioned in the paper.

Given this context, I have two questions:

  1. Is the $O(N \cdot B \cdot P)$ optimization with the $R(r,t)$ header routing planned for a future PR, or is it currently maintained in an internal branch?
  2. From a technical perspective, how exactly will the scheme described in the paper be implemented using the $R(r,t)$ header? Furthermore, how is the memory layout on the destination rank organized so that the experts can fetch the data efficiently without introducing extra overhead for shuffling or re-layout? (The previous DeepEP combine send has used complex TMA instructions to efficiently fetch the data)

Thanks again for your time and excellent work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions