Skip to content

fix(policy): Fix DCGM policy listener fanout and preserve policy thresholds#119

Merged
rvatkar merged 2 commits into
mainfrom
rvatkar/bugfix/DCGM-6957-policy-listener-dispatcher
Apr 29, 2026
Merged

fix(policy): Fix DCGM policy listener fanout and preserve policy thresholds#119
rvatkar merged 2 commits into
mainfrom
rvatkar/bugfix/DCGM-6957-policy-listener-dispatcher

Conversation

@rvatkar

@rvatkar rvatkar commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fix DCGM policy listener callback handling by replacing shared global callback channels with a per-subscriber dispatcher. This prevents listener fanout races, callback blocking, premature cleanup, and policy threshold
clobbering while keeping public listener API signatures unchanged.

Covers:

Changes

  • Add an internal policy dispatcher with per-subscriber channels, group-aware routing, and DCGM userData registration IDs.
  • Fan out matching policy violations to all interested subscribers instead of competing on shared condition channels.
  • Use non-blocking delivery so slow consumers cannot block DCGM callback threads.
  • Preserve existing policy thresholds for Listen* APIs by merging missing conditions instead of resetting the whole policy to defaults.
  • Narrow default policy setup fallback to DCGM_ST_INSUFFICIENT_SIZE and DCGM_ST_NOT_CONFIGURED; propagate other policy-read errors.
  • Keep unregister bookkeeping until dcgmPolicyUnregister succeeds or DCGM reports no live registration can remain.
  • Expose PolicyViolationDropCount() for slow-consumer drop visibility.
  • Update C callback bridge to pass DCGM userData.
  • Add dispatcher, lifecycle, rollback, race, threshold-preservation, and error-contract tests.

Validation

  • go test ./pkg/dcgm -count=1
  • go test -race ./pkg/dcgm -run "TestPolicyDispatcherDeliverDuringUnsubscribeRace|TestPolicyDispatcherRollbackSubscription|TestPolicyDispatcherClearRegistrationKeepsStateOnUnregisterFailure| TestPolicyDispatcherSlowConsumerDrop|TestConcurrentPolicySubscribers|TestPolicyListenerLifecycle" -count=1 -v
  • go test ./pkg/dcgm -run "TestPolicyErrors|TestSetPolicyAndWatchViolations|TestSetAndGetPolicy|TestSetAndGetMultiplePolicies" -count=1 -v
  • go build ./...
  • go vet ./...

dcgm-exporter L4 Smoke Test

Built dcgm-exporter from /home/rvatkar/code/dcgm-exporter-dev against the local fixed go-dcgm branch using a temporary go.work replace.

  1. Confirm exporter uses local go-dcgm:

    cd /home/rvatkar/code/dcgm-exporter-dev
    GOWORK=/tmp/dcgm-exporter-gowork/go.work go list -m -f "{{.Path}} {{.Version}} {{.Dir}}" github.com/NVIDIA/go-dcgm
    

Expected: module resolves to /home/rvatkar/code/community/go-dcgm.

  1. Build exporter:

    GOWORK=/tmp/dcgm-exporter-gowork/go.work make binary

  2. Run exporter on L4 host with reduced L4 counters:

    /tmp/dcgm-exporter-after
    -f /tmp/dcgm-exporter-l4-counters.csv
    -a :9401
    --enable-dcgm-log
    --dcgm-log-level DEBUG

  3. Scrape metrics:

    curl -sf http://127.0.0.1:9401/metrics >/tmp/dcgm-exporter-after.metrics

    grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_SM_CLOCK|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_XID_ERRORS"
    /tmp/dcgm-exporter-after.metrics

  4. Repeated scrape soak:

    for i in $(seq 1 30); do
    curl -sf http://127.0.0.1:9401/metrics >/tmp/dcgm-exporter-after-$i.metrics || exit 1
    sleep 10
    done

Result:

  • Exporter started successfully on L4.
  • Repeated scrapes succeeded.
  • L4 metrics such as DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_GPU_UTIL, and DCGM_FI_DEV_FB_USED were present.
  • No callback panic, deadlock, or exporter crash observed.
  • XID injection was visible when exporter connected to the same hostengine using --remote-hostengine-info localhost:5555.

Notes

No dcgm-exporter code change is required, but downstream consumers need to update github.com/NVIDIA/go-dcgm dependency to pick up this fix.

Comment thread pkg/dcgm/policy.go

@nccurry nccurry left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small thing, otherwise LGTM

@rvatkar rvatkar merged commit 9838af5 into main Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants