Question
Hello! Thanks for your great contributions in providing the nccl-ep extension. I have a few questions regarding the low-latency dispatch buffer optimization described in your paper, which I couldn't quite figure out from the current codebase. I am looking forward to your insights!
In your paper NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL Section IV.D, your team provided a scheme to optimize the layout to reduce low-latency dispatch buffer usage. I quote:
To address the issue in the dispatch, we extend the message header with routing information $R(r, t)$, avoiding layout-based routing. In the Send Tokens step (Section IV-C0a), a token is sent once per destination rank, as opposed to once per expert. The sub-region indexing is unified for dispatch $idx_D(e, r) = r$, not depending on the calling side. The dispatch buffer is reduced by a factor of number of local experts $L$, scaling as $O(N\cdot B\cdot P)$.
Based on my understanding of the original DeepEP implementation, the dispatch receive buffer is organized in a 3D layout: [expert_id][src_rank_id][slot_id]. Thus, the dispatch buffer size scales as O(Experts * numRanks * maxTokensPerRank * MsgSize) (which is also multiplied by 2 for double buffering).
However, when I looked into your open-source implementation, it seems that the code is still applying the previous DeepEP-like expert-major layout. I quote:
|
const auto recvPtr = reinterpret_cast<uint64_t>(recvBuf) + |
|
dstExpertLocalIdx * numRanks * maxTokensPerRank * numBytesPerMsg + |
|
currRank * maxTokensPerRank * numBytesPerMsg + |
|
slotIdx * numBytesPerMsg; |
|
size_t expectedDstOffset = recvOff + |
|
dstExpertLocalIdx * numRanks * maxTokensPerRank * numBytesPerMsg + |
|
currRank * maxTokensPerRank * numBytesPerMsg + slotIdx * numBytesPerMsg; |
Here, the destination pointer is still calculated as dstExpertLocalIdx * numRanks * maxTokensPerRank * ..., which perfectly reflects the $O(E \cdot B \cdot P)$ scaling rather than the optimized $O(N \cdot B \cdot P)$ mentioned in the paper.
Given this context, I have two questions:
- Is the $O(N \cdot B \cdot P)$ optimization with the $R(r,t)$ header routing planned for a future PR, or is it currently maintained in an internal branch?
- From a technical perspective, how exactly will the scheme described in the paper be implemented using the $R(r,t)$ header? Furthermore, how is the memory layout on the destination rank organized so that the experts can fetch the data efficiently without introducing extra overhead for shuffling or re-layout? (The previous DeepEP combine send has used complex TMA instructions to efficiently fetch the data)
Thanks again for your time and excellent work!
Question
Hello! Thanks for your great contributions in providing the
nccl-epextension. I have a few questions regarding the low-latency dispatch buffer optimization described in your paper, which I couldn't quite figure out from the current codebase. I am looking forward to your insights!In your paper NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL Section IV.D, your team provided a scheme to optimize the layout to reduce low-latency dispatch buffer usage. I quote:
Based on my understanding of the original DeepEP implementation, the dispatch receive buffer is organized in a 3D layout:
[expert_id][src_rank_id][slot_id]. Thus, the dispatch buffer size scales asO(Experts * numRanks * maxTokensPerRank * MsgSize)(which is also multiplied by 2 for double buffering).However, when I looked into your open-source implementation, it seems that the code is still applying the previous DeepEP-like expert-major layout. I quote:
nccl/contrib/nccl_ep/device/low_latency.cu
Lines 483 to 489 in 49839df
Here, the destination pointer is still calculated as$O(E \cdot B \cdot P)$ scaling rather than the optimized $O(N \cdot B \cdot P)$ mentioned in the paper.
dstExpertLocalIdx * numRanks * maxTokensPerRank * ..., which perfectly reflects theGiven this context, I have two questions:
Thanks again for your time and excellent work!