[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating by kwen2501 · Pull Request #2085 · NVIDIA/nccl

kwen2501 · 2026-04-02T01:43:58Z

Summary

Add ncclEpUpdateHandle API that rebinds topk_idx on an existing handle without reallocating GPU buffers. This avoids unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing) in MoE hot loops where only the routing changes between iterations.
Refactor ncclEpCreateHandle to delegate the computation phase to ncclEpUpdateHandle, eliminating code duplication.
Fix dense_prob_buffer sizing to use max_tokens_per_rank instead of num_tokens so the buffer is reusable across different token counts.

RFC: #2084

API

ncclResult_t ncclEpUpdateHandle(
    ncclEpHandle_t handle,
    const ncclNDTensor_t* topk_idx,
    ncclNDTensor_t* const* local_tensors,
    unsigned int num_local_tensors,
    cudaStream_t stream
);

Usage

ncclEpCreateHandle(&handle, group, topk_idx_0, ...);

// Each iteration: rebind routing without realloc
ncclEpUpdateHandle(handle, topk_idx_1, ...);
ncclEpDispatch(handle, ...);
ncclEpCombine(handle, ...);

ncclEpHandleDestroy(handle);

Changes

nccl_ep.h: Add ncclEpUpdateHandle declaration
nccl_ep.cc: Add ncclEpUpdateHandle implementation (resets buffers + re-runs convert/allgather/preprocess). Refactor ncclEpCreateHandle to be allocation-only, delegating computation to ncclEpUpdateHandle.
nccl_wrapper.py: Add ctypes binding for ncclEpUpdateHandle

Co-authored-by: Claude noreply@anthropic.com

Made with Cursor

…ndle` The `ncclAllGather` and `call_metadata_preprocessing` are on the same CUDA stream, so intra-stream ordering already guarantees the allgather completes before the preprocessing kernel launches. Additionally, it seems nowhere else would use the allgather result `global_routing_map`. Co-authored-by: Claude <noreply@anthropic.com>

Signed-off-by: Ke Wen <kwen@nvidia.com>

ncclEpCreateHandle allocates ~8 GPU buffers whose sizes depend only on group-level constants, yet the current API forces a full destroy+create cycle whenever topk_idx changes between iterations. This causes unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing). Add ncclEpUpdateHandle that resets and re-runs only the topk_idx-dependent computation (convert_topk_to_routing_map, ncclAllGather, and call_metadata_preprocessing) on an existing handle. Refactor ncclEpCreateHandle to delegate to ncclEpUpdateHandle for the computation phase, eliminating code duplication. Also fix dense_prob_buffer sizing to use max_tokens_per_rank instead of num_tokens so the buffer is large enough for reuse across different token counts. Co-authored-by: Claude <noreply@anthropic.com> RFC: NVIDIA#2084 Signed-off-by: Ke Wen <kwen@nvidia.com> Made-with: Cursor

kwen2501 · 2026-04-13T22:06:09Z

/mirror

jskrobola · 2026-04-13T22:10:50Z

Mirroring to GitLab failed.

Error (summary)

git cherry-pick failed (conflicts or unsupported history). Resolve on a dev clone or use --legacy-patch if appropriate.
Caused by: Cmd('git') failed due to: exit code(1)
  cmdline: git cherry-pick 49839dfde34086ee9aaf9a1c69ae3404f13773e7..refs/mirror-pr/2085/head
  stdout: 'Auto-merging contrib/nccl_ep/nccl_ep.cc
CONFLICT (content): Merge conflict in contrib/nccl_ep/nccl_ep.cc'
  stderr: 'error: could not apply 198ef5fd9... [nccl-ep] Remove redundant `cudaStreamSynchronize` in `ncclEpCreateHandle`
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"'

Manual fix: from a clone whose internal remote is origin, run:

cd <your-local-clone>
BASE=master
git fetch origin
git checkout -B github-pr-2085 origin/$BASE
git fetch https://github.com/NVIDIA/nccl.git pull/2085/head:refs/tmp/github-pr-2085-head
git fetch https://github.com/NVIDIA/nccl.git refs/heads/$BASE:refs/tmp/github-pr-2085-base
MERGE_BASE=$(git merge-base refs/tmp/github-pr-2085-base refs/tmp/github-pr-2085-head)
git cherry-pick "$MERGE_BASE"..refs/tmp/github-pr-2085-head

# When git stops for conflicts: edit files, then:
git add <path/to/resolved/files>
git cherry-pick --continue
# Repeat until the cherry-pick finishes (or: git cherry-pick --abort to start over).

Then push branch github-pr-2085 to your internal Git host (same ref the automation uses).

kwen2501 and others added 3 commits April 2, 2026 02:15

[nccl-ep] Remove dead code of ht struct

2e602a7

Signed-off-by: Ke Wen <kwen@nvidia.com>

kwen2501 force-pushed the add-update-handle branch from 71a21ec to df56902 Compare April 2, 2026 02:16

kwen2501 mentioned this pull request Apr 15, 2026

[RFE]: Graph-capturable NCCL EP API for CUDA Graph / Piecewise CUDA Graph integration #2104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085
kwen2501 wants to merge 3 commits intoNVIDIA:masterfrom
kwen2501:add-update-handle

kwen2501 commented Apr 2, 2026

Uh oh!

kwen2501 commented Apr 13, 2026

Uh oh!

jskrobola commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kwen2501 commented Apr 2, 2026

Summary

API

Usage

Changes

Uh oh!

kwen2501 commented Apr 13, 2026

Uh oh!

jskrobola commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants