Skip to content

Add ncclNotifyTag API for user-defined profiler-v7#2045

Open
dmonakhov wants to merge 2 commits intoNVIDIA:masterfrom
dmonakhov:notify-tag
Open

Add ncclNotifyTag API for user-defined profiler-v7#2045
dmonakhov wants to merge 2 commits intoNVIDIA:masterfrom
dmonakhov:notify-tag

Conversation

@dmonakhov
Copy link
Copy Markdown

Add ncclNotifyTag API for user-defined profiler annotations (v7)

Motivation

Training frameworks (PyTorch, Megatron, etc.) need a way to correlate NCCL
collective performance with application-level phases -- forward pass, backward
pass, optimizer step, data loading, etc. Today, profiler plugins see a flat
stream of collectives with no application context.

ncclNotifyTag lets users inject lightweight string annotations into the
profiler event stream. Plugins can then attribute bandwidth/latency metrics
to specific training phases, enabling:

  • Per-phase collective performance dashboards (Prometheus/Grafana)
  • Training step boundaries in Chrome traces
  • Straggler detection scoped to specific communication patterns

Original feature request

API

ncclResult_t ncclNotifyTag(const char* tag, ncclComm_t comm, cudaStream_t stream);
  • tag: null-terminated string, max 32 bytes including terminator (NCCL_TAG_MAX_LEN)
  • comm: communicator to annotate
  • stream: unused (reserved for future use)
  • Returns ncclInvalidArgument if tag or comm is NULL

The call is synchronous and lightweight -- no CUDA operations, no allocations.

Usage example

// Annotate training phases for profiler plugins
ncclNotifyTag("forward", comm, stream);
ncclAllReduce(sbuf, rbuf, N, ncclFloat, ncclSum, comm, stream);

ncclNotifyTag("backward", comm, stream);
ncclAllReduce(sbuf, rbuf, N, ncclFloat, ncclSum, comm, stream);

Design

Profiler plugin API v7

Extends v6 with a new event type ncclProfileUserTag = (1 << 15) and a new
descriptor field:

struct {
  const char* tag;  // pointer valid only during startEvent/stopEvent
} userTag;

NCCL core delivers UserTag as a start/stop event pair. The tag pointer is valid
only for the duration of the call -- plugins must copy the string if they need
to retain it. NCCL core does zero string copying itself.

Backward compatibility: v6 and earlier plugins receive a v7 shim that filters
out UserTag events and clears the UserTag activation bit. No changes needed
in existing plugins.

Plugin version chain

ncclProfiler_v7 (native)
  -> ncclProfiler_v6 (shim: filters UserTag events)
    -> ncclProfiler_v5 (shim: filters UserTag + CE events)
      -> v4 -> v3 -> v2 -> v1

Example plugin (plugins/profiler/example/)

Records UserTag as Chrome trace instant events:

{"name": "UserTag", "cat": "USER_TAG", "ph": "i", "s": "g",
 "args": {"tag": "forward", "rank": 0}}

Uses pool-based allocation (default pool size: 8, configurable via
NCCL_PROFILE_USER_TAG_POOL_SIZE).

Inspector plugin (plugins/profiler/inspector/)

Captures the active tag per communicator at collective insertion time (not
completion time) for deterministic attribution. The tag appears in:

  • JSON output: "user_tag": "forward" in coll_perf records
  • Prometheus output: user_tag="forward" label on metrics

No lock needed for tag capture -- both UserTag and collective startEvent run
on the same NCCL submission thread, serialized per communicator.

Testing

Dockerized integration test with 13 test cases (not included in this PR):

  • Example plugin (6 tests): epoch tags, rapid fire (pool exhaustion),
    long tag truncation, empty tag, NULL comm, NULL tag
  • Inspector JSON (6 tests): same scenarios, validates user_tag in JSON
  • Inspector Prometheus (1 test): sustained collectives for 35s, validates
    user_tag label in .prom output

All tests pass on H200 with CUDA 13.0.

Introduce ncclNotifyTag(tag, comm, stream) public API that delivers
user-defined string tags to profiler plugins via start/stop event pairs.

Profiler API changes:
- Add ncclProfileUserTag event type (bit 15)
- Add profiler_v7.h with userTag descriptor (const char* tag pointer)
- Add NCCL_TAG_MAX_LEN (32) constant for plugin-side fixed buffers
- v7 plugin loader with ncclProfiler_v7 symbol lookup
- Default typedefs updated from v6 to v7

NCCL core:
- ncclNotifyTag() in notify.cc validates args, calls profiler dispatch
- ncclProfilerUserTagEvent() checks activation mask, fires start+stop

Example plugin:
- Pool-based userTag events with strncpy (no heap allocation)
- Chrome trace instant events (cat=USER_TAG, ph=i) in JSON output
- v7 entry points delegate non-UserTag events to v6


Usage example:
  // Tag a training epoch -- profiler plugins receive this as a UserTag event
  ncclNotifyTag("epoch_1", comm, stream);
  ncclAllReduce(sendbuf, recvbuf, N, ncclFloat, ncclSum, comm, stream);
  
  // Tag a different phase
  ncclNotifyTag("optimizer", comm, stream);
  ncclAllReduce(sendbuf, recvbuf, N, ncclFloat, ncclSum, comm, stream);
- Update inspector profiler.h from v5 to v7 (add v6/v7 headers)
- Add notifyTag() support
- v7 startEvent: UserTag sets activeTag on comm, no lock needed
  (writer and reader on same NCCL submission thread per communicator)
- Tag captured in inspectorPluginCollInfoInit at collective enqueue time
- inspectorUpdateCollPerf copies userTag to completedCollInfo
- JSON output: "user_tag" field in coll_perf when non-empty
- Prometheus output: user_tag label when present
- Export ncclProfiler_v7 symbol with v7 function pointers
@gcongiu
Copy link
Copy Markdown
Collaborator

gcongiu commented Mar 21, 2026

Thank @dmonakhov.

It seems to me like the notify API only exists to let the application talk to the profiler through NCCL. The profiler API design has the plugin form to avoid interface proliferation in NCCL. The original idea was that any new interface needed to interact with the profiler could be exposed through the profiler plugin API rather than the NCCL API.

Can't your plugin expose a tag interface directly to the application without going through NCCL? For example, a way for the application to talk to the plugin using an annotation API could be to use NVTX. I think Pytorch Kineto already uses NVTX annotations in NCCL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants