Add ncclNotifyTag API for user-defined profiler-v7#2045
Open
dmonakhov wants to merge 2 commits intoNVIDIA:masterfrom
Open
Add ncclNotifyTag API for user-defined profiler-v7#2045dmonakhov wants to merge 2 commits intoNVIDIA:masterfrom
dmonakhov wants to merge 2 commits intoNVIDIA:masterfrom
Conversation
Introduce ncclNotifyTag(tag, comm, stream) public API that delivers
user-defined string tags to profiler plugins via start/stop event pairs.
Profiler API changes:
- Add ncclProfileUserTag event type (bit 15)
- Add profiler_v7.h with userTag descriptor (const char* tag pointer)
- Add NCCL_TAG_MAX_LEN (32) constant for plugin-side fixed buffers
- v7 plugin loader with ncclProfiler_v7 symbol lookup
- Default typedefs updated from v6 to v7
NCCL core:
- ncclNotifyTag() in notify.cc validates args, calls profiler dispatch
- ncclProfilerUserTagEvent() checks activation mask, fires start+stop
Example plugin:
- Pool-based userTag events with strncpy (no heap allocation)
- Chrome trace instant events (cat=USER_TAG, ph=i) in JSON output
- v7 entry points delegate non-UserTag events to v6
Usage example:
// Tag a training epoch -- profiler plugins receive this as a UserTag event
ncclNotifyTag("epoch_1", comm, stream);
ncclAllReduce(sendbuf, recvbuf, N, ncclFloat, ncclSum, comm, stream);
// Tag a different phase
ncclNotifyTag("optimizer", comm, stream);
ncclAllReduce(sendbuf, recvbuf, N, ncclFloat, ncclSum, comm, stream);
- Update inspector profiler.h from v5 to v7 (add v6/v7 headers) - Add notifyTag() support - v7 startEvent: UserTag sets activeTag on comm, no lock needed (writer and reader on same NCCL submission thread per communicator) - Tag captured in inspectorPluginCollInfoInit at collective enqueue time - inspectorUpdateCollPerf copies userTag to completedCollInfo - JSON output: "user_tag" field in coll_perf when non-empty - Prometheus output: user_tag label when present - Export ncclProfiler_v7 symbol with v7 function pointers
Collaborator
|
Thank @dmonakhov. It seems to me like the notify API only exists to let the application talk to the profiler through NCCL. The profiler API design has the plugin form to avoid interface proliferation in NCCL. The original idea was that any new interface needed to interact with the profiler could be exposed through the profiler plugin API rather than the NCCL API. Can't your plugin expose a tag interface directly to the application without going through NCCL? For example, a way for the application to talk to the plugin using an annotation API could be to use NVTX. I think Pytorch Kineto already uses NVTX annotations in NCCL. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add ncclNotifyTag API for user-defined profiler annotations (v7)
Motivation
Training frameworks (PyTorch, Megatron, etc.) need a way to correlate NCCL
collective performance with application-level phases -- forward pass, backward
pass, optimizer step, data loading, etc. Today, profiler plugins see a flat
stream of collectives with no application context.
ncclNotifyTaglets users inject lightweight string annotations into theprofiler event stream. Plugins can then attribute bandwidth/latency metrics
to specific training phases, enabling:
Original feature request
API
tag: null-terminated string, max 32 bytes including terminator (NCCL_TAG_MAX_LEN)comm: communicator to annotatestream: unused (reserved for future use)ncclInvalidArgumentiftagorcommis NULLThe call is synchronous and lightweight -- no CUDA operations, no allocations.
Usage example
Design
Profiler plugin API v7
Extends v6 with a new event type
ncclProfileUserTag = (1 << 15)and a newdescriptor field:
NCCL core delivers UserTag as a start/stop event pair. The tag pointer is valid
only for the duration of the call -- plugins must copy the string if they need
to retain it. NCCL core does zero string copying itself.
Backward compatibility: v6 and earlier plugins receive a v7 shim that filters
out UserTag events and clears the UserTag activation bit. No changes needed
in existing plugins.
Plugin version chain
Example plugin (plugins/profiler/example/)
Records UserTag as Chrome trace instant events:
{"name": "UserTag", "cat": "USER_TAG", "ph": "i", "s": "g", "args": {"tag": "forward", "rank": 0}}Uses pool-based allocation (default pool size: 8, configurable via
NCCL_PROFILE_USER_TAG_POOL_SIZE).Inspector plugin (plugins/profiler/inspector/)
Captures the active tag per communicator at collective insertion time (not
completion time) for deterministic attribution. The tag appears in:
"user_tag": "forward"incoll_perfrecordsuser_tag="forward"label on metricsNo lock needed for tag capture -- both UserTag and collective startEvent run
on the same NCCL submission thread, serialized per communicator.
Testing
Dockerized integration test with 13 test cases (not included in this PR):
long tag truncation, empty tag, NULL comm, NULL tag
user_tagin JSONuser_taglabel in .prom outputAll tests pass on H200 with CUDA 13.0.