Skip to content

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085

Open
kwen2501 wants to merge 3 commits intoNVIDIA:masterfrom
kwen2501:add-update-handle
Open

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085
kwen2501 wants to merge 3 commits intoNVIDIA:masterfrom
kwen2501:add-update-handle

Conversation

@kwen2501
Copy link
Copy Markdown
Collaborator

@kwen2501 kwen2501 commented Apr 2, 2026

Summary

  • Add ncclEpUpdateHandle API that rebinds topk_idx on an existing handle without reallocating GPU buffers. This avoids unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing) in MoE hot loops where only the routing changes between iterations.
  • Refactor ncclEpCreateHandle to delegate the computation phase to ncclEpUpdateHandle, eliminating code duplication.
  • Fix dense_prob_buffer sizing to use max_tokens_per_rank instead of num_tokens so the buffer is reusable across different token counts.

RFC: #2084

API

ncclResult_t ncclEpUpdateHandle(
    ncclEpHandle_t handle,
    const ncclNDTensor_t* topk_idx,
    ncclNDTensor_t* const* local_tensors,
    unsigned int num_local_tensors,
    cudaStream_t stream
);

Usage

ncclEpCreateHandle(&handle, group, topk_idx_0, ...);

// Each iteration: rebind routing without realloc
ncclEpUpdateHandle(handle, topk_idx_1, ...);
ncclEpDispatch(handle, ...);
ncclEpCombine(handle, ...);

ncclEpHandleDestroy(handle);

Changes

  • nccl_ep.h: Add ncclEpUpdateHandle declaration
  • nccl_ep.cc: Add ncclEpUpdateHandle implementation (resets buffers + re-runs convert/allgather/preprocess). Refactor ncclEpCreateHandle to be allocation-only, delegating computation to ncclEpUpdateHandle.
  • nccl_wrapper.py: Add ctypes binding for ncclEpUpdateHandle

Co-authored-by: Claude noreply@anthropic.com

Made with Cursor

kwen2501 and others added 3 commits April 2, 2026 02:15
…ndle`

The `ncclAllGather` and `call_metadata_preprocessing` are on the same CUDA
stream, so intra-stream ordering already guarantees the allgather
completes before the preprocessing kernel launches. Additionally,
it seems nowhere else would use the allgather result
`global_routing_map`.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Ke Wen <kwen@nvidia.com>
ncclEpCreateHandle allocates ~8 GPU buffers whose sizes depend only on
group-level constants, yet the current API forces a full destroy+create
cycle whenever topk_idx changes between iterations. This causes
unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing).

Add ncclEpUpdateHandle that resets and re-runs only the topk_idx-dependent
computation (convert_topk_to_routing_map, ncclAllGather, and
call_metadata_preprocessing) on an existing handle. Refactor
ncclEpCreateHandle to delegate to ncclEpUpdateHandle for the computation
phase, eliminating code duplication.

Also fix dense_prob_buffer sizing to use max_tokens_per_rank instead of
num_tokens so the buffer is large enough for reuse across different
token counts.

Co-authored-by: Claude <noreply@anthropic.com>
RFC: NVIDIA#2084
Signed-off-by: Ke Wen <kwen@nvidia.com>
Made-with: Cursor
@kwen2501 kwen2501 force-pushed the add-update-handle branch from 71a21ec to df56902 Compare April 2, 2026 02:16
@kwen2501
Copy link
Copy Markdown
Collaborator Author

/mirror

@jskrobola
Copy link
Copy Markdown
Collaborator

Mirroring to GitLab failed.

Error (summary)

git cherry-pick failed (conflicts or unsupported history). Resolve on a dev clone or use --legacy-patch if appropriate.
Caused by: Cmd('git') failed due to: exit code(1)
  cmdline: git cherry-pick 49839dfde34086ee9aaf9a1c69ae3404f13773e7..refs/mirror-pr/2085/head
  stdout: 'Auto-merging contrib/nccl_ep/nccl_ep.cc
CONFLICT (content): Merge conflict in contrib/nccl_ep/nccl_ep.cc'
  stderr: 'error: could not apply 198ef5fd9... [nccl-ep] Remove redundant `cudaStreamSynchronize` in `ncclEpCreateHandle`
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"'

Manual fix: from a clone whose internal remote is origin, run:

cd <your-local-clone>
BASE=master
git fetch origin
git checkout -B github-pr-2085 origin/$BASE
git fetch https://github.com/NVIDIA/nccl.git pull/2085/head:refs/tmp/github-pr-2085-head
git fetch https://github.com/NVIDIA/nccl.git refs/heads/$BASE:refs/tmp/github-pr-2085-base
MERGE_BASE=$(git merge-base refs/tmp/github-pr-2085-base refs/tmp/github-pr-2085-head)
git cherry-pick "$MERGE_BASE"..refs/tmp/github-pr-2085-head
# When git stops for conflicts: edit files, then:
git add <path/to/resolved/files>
git cherry-pick --continue
# Repeat until the cherry-pick finishes (or: git cherry-pick --abort to start over).

Then push branch github-pr-2085 to your internal Git host (same ref the automation uses).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants