Skip to content

chore(deps): update nvidia-dcgm (patch)#8659

Merged
djsly merged 1 commit into
mainfrom
renovate/patch-nvidia-dcgm
Jun 26, 2026
Merged

chore(deps): update nvidia-dcgm (patch)#8659
djsly merged 1 commit into
mainfrom
renovate/patch-nvidia-dcgm

Conversation

@renovate

@renovate renovate Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This PR contains the following updates:

Package Update Change
dcgm-exporter patch 4.8.2-1.azl34.8.2-4.azl3
dcgm-exporter patch 4.8.2-ubuntu24.04u14.8.2-ubuntu24.04u4
dcgm-exporter patch 4.8.2-ubuntu22.04u14.8.2-ubuntu22.04u4

Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

Copilot AI review requested due to automatic review settings June 8, 2026 14:54
@renovate renovate Bot added the renovate This pull request was created by renovate label Jun 8, 2026
@renovate renovate Bot assigned djsly Jun 8, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot requested review from djsly, ganeshkumarashok and surajssd June 8, 2026 14:54
@github-actions github-actions Bot added the components This pull request updates cached components on Linux or Windows VHDs label Jun 8, 2026
@renovate renovate Bot changed the title chore(deps): update nvidia-dcgm to v4.8.2-ubuntu22.04u2 chore(deps): update nvidia-dcgm (patch) Jun 8, 2026
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 1c34d5c to 51e8bdd Compare June 8, 2026 15:14
@djsly

djsly commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

AgentBaker Linux PR gate — E2E failure (mixed: 3 leaves shared infra; 1 ACL leaf likely on main)

  • Run: 167166471 (failed)
  • Failed: Run AgentBaker E2E → AzureCLI exit 1 (DONE 457 tests, 95 skipped, 5 failures in 1646.77s)

Group A — shared infra/test-fixture issue, NOT this PR (3 leaves):

  • Test_Ubuntu2204Gen2_ImagePullIdentityBinding_NetworkIsolated/{default (6.57s), scriptless_nbc (0.00s)}test_helpers.go:227 🔴 empty error, plus the parent container.
  • Same sub-7s empty-error shape has now hit 5 unrelated PRs in 48h (this PR, #8600, #8330, #8654, #8653). Confirmed systemic — needs NodeSIG-dev / E2E-infra triage of the ImagePullIdentityBinding_NetworkIsolated private-cluster/ACR-private-endpoint precondition.

Group B — ACL FIPS TL leaf, very likely existing main regression (2 leaves):

  • Test_ACLGen2FIPSTL/scriptless_nbc (265.83s) — validation.go:345 🔴: wireserver check "wireserver port 80 goalstate": unexpected curl exit code "0" (want 28 timeout or 7 refused) (plus root container).
  • The test expects WireServer port 80 to be blocked (curl exit 28=timeout or 7=refused) but got 0 (HTTP 200 reachable). That's an ACL FIPS TL firewall/network policy assertion. This PR (nvidia-dcgm patch bump in parts/common/components.json) touches GPU package versions only and has no path to ACL networking. Strongly suggests an existing ACL FIPS TL regression on main, not caused by this PR.

Confidence: HIGH that this PR is not the cause of either failure group.

Recommended next action:

  1. Rerun the failing job; do not block this PR on Group A.
  2. NodeSIG-dev: file a tracker on the ACL FIPS TL wireserver-block regression in validation.go:345 (the test expectation flipped, or the ACL network policy unit shipped in the latest VHD no longer blocks WireServer); investigate against main head independently of any specific PR.
  3. NodeSIG-dev / E2E-infra: triage the ImagePullIdentityBinding_NetworkIsolated fixture (sub-7s empty failures across multiple unrelated PRs).

Strongest alternative (less likely): transient ACR-private-endpoint outage for Group A + intermittent ACL firewall rule timing for Group B — refuted because each pattern is now reproducing deterministically on every recent PR build.

Posted by Clawpilot AgentBaker gate detective.

@surajssd surajssd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not merge, until renovate also adds support for Azure Linux. Once this is merged: #8660 I don't have to manually say that we should not merge this.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 51e8bdd to f4fe0e0 Compare June 9, 2026 07:03
Copilot AI review requested due to automatic review settings June 9, 2026 07:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from f4fe0e0 to 6310c4d Compare June 10, 2026 03:01
Copilot AI review requested due to automatic review settings June 10, 2026 06:02
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 6310c4d to d445cbf Compare June 10, 2026 06:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 236-failure run: shared cluster fleet outage continues (test-infra, NOT this PR)

  • Run: 167422694 (failed)
  • Failed task: Run AgentBaker E2E (full 60-minute timeout consumed)
  • Test summary: DONE 402 tests, 95 skipped, 236 failures in ~3616s (~59% failure rate; 0 fwupd hits)

Same shared cluster fleet outage affecting every concurrent PR in this window: 123× get or create cluster: failed to wait for cluster abe2e-kubenet-v5-150ee to be ready: context deadline exceeded plus 36× ResourceGroupDeletionBlocked on shared MC RGs. Earlier overnight runs hit ~11 min; current runs consume the full 60-min E2E timeout, indicating the fleet is worse, not recovering.

Cross-PR pattern this morning: PR #8652 build 167419663, PR #8679 build 167421198, PR #8294 build 167422687, and concurrent PRs all hit identical 236-fail / cluster-not-ready signature.

Build-vs-test: test-infra (shared cluster fleet outage), NOT product, NOT PR-caused.
This PR's exposure check: nvidia-dcgm renovate patch bump (GPU monitoring). No path to shared test cluster lifecycle.
Confidence: HIGH that PR #8659 is not the cause.

Recommended next action / owner: ⚠️ E2E infra / NodeSIG-dev — urgent shared cluster fleet restoration required (abe2e-kubenet-v5-*, abe2e-latest-kubernetes-version-v2-*, abe2e-azure-networkisolated-v2-*, abe2e-azure-v4-*, abe2e-azure-bootstrapprofile-cache-v2-*); clear ResourceGroupDeletionBlocked locks. PR gate is effectively offline until restored. PR author: rerun once fleet recovers.

Posted by Clawpilot AgentBaker gate detective.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from d445cbf to 313d630 Compare June 10, 2026 15:51
@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux PR gate — 3 distinct E2E failures, all test-infra / shared-cluster (NOT this PR)

  • Run: 167493131
  • Failed job: Run AgentBaker E2E (all VHD builds passed)
  • Failed scenarios: Test_AzureLinux_Skip_Binary_Cleanup/{default,scriptless_nbc}, Test_AzureLinuxV3_CustomSysctls/default, Test_Ubuntu2204_HTTPSProxy_PrivateDNS/{default,scriptless_nbc} (5 subtests across 3 scenarios)

Detective summary — two independent signatures

(1) wireserver-blocking-validator-assertion — both AzureLinux scenarios on shared cluster abe2e-kubenet-v5-150ee:

🔴 FAIL: wireserver check "wireserver port 80 goalstate":
        unexpected curl exit code "0" (want 28 timeout or 7 refused)

The validator asserts that the node's iptables rule blocks egress to 168.63.129.16:80. iptables shows the DROP rule is present (DROP ... 168.63.129.16 tcp dpt:80), but the test's curl still returns exit 0 — the rule isn't taking effect for the test's connection (likely matched against a stale conntrack/TIME_WAIT entry; logs show pre-existing 168.63.129.16 flows in conntrack). This is the same signature as build 167348372; second occurrence. Wiki: wireserver-blocking-validator-assertion.

(2) httpsproxy-fixture-proxy-unreachable — both HTTPSProxy_PrivateDNS subtests on shared cluster abe2e-azure-network-v4-ce2ad:

VMExtensionProvisioningError ... vmssCSE exit 99
W: Failed to fetch https://packages.microsoft.com/ubuntu/22.04/prod/dists/jammy/InRelease
   Could not connect to 10.14.0.193:8888 (10.14.0.193). - connect (111: Connection refused)

The CSE retries apt-get update 10 times against the scenario's HTTP proxy at 10.14.0.193:8888; the proxy endpoint refuses every attempt and CSE exits 99. The proxy is part of the test fixture (private DNS / HTTPS proxy scenario infra), not anything this PR touches. New signature.

Classification: Test infrastructure / shared-cluster issues. Neither failure is reachable from changes in PR #8659 (renovate nvidia-dcgm patch — does not touch wireserver/iptables, CSE, apt sources, or the proxy fixture).

Confidence: High for both (multiple subtests, identical signatures, no PR-relevant changed files, all VHD builds passed).

Strongest alternative theory: Recent change to aks-node CSE / iptables wiring that lets the wireserver block "miss" — less likely because the rule is present and counters match expected DROP behavior in the chain dump; the leak is at the conntrack/TIME_WAIT layer pre-dating the validator's curl. For the proxy: a transient ARM/MMS issue affecting 10.14.0.193. Less likely than fixture-side because the proxy is the only target refusing connections; the cluster, the VM, the AKS extension framework, and the AKS managed runtime all succeeded.

Recommended next action / owner:

  • Wireserver-blocking validator: SIG Node Lifecycle test-code owner — make the validator tolerate pre-existing conntrack entries (curl with --local-port / fresh tuple), or flush conntrack for 168.63.129.16 before the curl probe. Pattern is now seen on two distinct PR runs and one cluster fixture.
  • HTTPSProxy fixture: AgentBaker E2E test-infra — the proxy mirror behind 10.14.0.193:8888 on shared cluster abe2e-azure-network-v4-ce2ad is unreachable; check the proxy pod/daemon health on that fixture.
  • No PR change required. Recommend rerun of the failed leg only.

Evidence used: failed task log (5 === FAIL markers across 3 distinct scenarios on 2 distinct shared clusters), all VHD builds succeeded, no changed file in PR #8659 touches wireserver/CSE/proxy code.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169119572
Failed job/stage/task: e2e / Run AgentBaker E2E
Summary: This run has a deterministic PR-caused metadata failure before the later shared-cluster noise. Test_Version_Consistency_GPU_Managed_Components failed with dcgm-exporter partial OS update: expected rebuild revision 3, actual 1 for ubuntu.r2204.

Likely cause/signature: incomplete dcgm-exporter OS-variant alignment in parts/common/components.json (dcgm-exporter-partial-os-update). Confidence: High.
Alternative considered: the later NetworkIsolated shared-cluster cleanup failures; less likely as primary because the first failure is the GPU managed component consistency validator and directly matches this PR's nvidia-dcgm component update.
Recommended owner/action: PR owner/Renovate maintainer: align all dcgm-exporter OS entries in components.json, or revert the partial bump.
Wiki: https://dev.azure.com/msazure/09706533-03bf-4b43-9a9b-b49c75429646/_wiki/wikis/ed4a85e9-1085-4151-a39b-2753523eba2b?pagePath=%2FAKS%2FSIGs%20and%20Teams%2FAKS%20Components%2FSIG%3A%20Node%20Lifecycle%2FAI%20Agent%20Knowledge%2FAgentBaker%20Gate%20PR%20Pipeline%20Flakiness (dcgm-exporter-partial-os-update)

Copilot AI review requested due to automatic review settings June 23, 2026 12:43
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from fc6c9d5 to df80fc4 Compare June 23, 2026 12:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

AgentBaker Linux gate detective

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169154224
Failed job/stage/task: e2e / Run AgentBaker E2E
Summary: E2E failed on the known stale NetworkIsolated shared-cluster issue. abe2e-azure-networkisolated-v3-d6cc9 was already Deleting / Failed; NetworkIsolated scenarios failed in get or create cluster with cluster ... is in state Failed, won't retry. Missing SIG-image skips and later dual-stack stale-cluster failures are secondary.

Likely cause/signature: shared NetworkIsolated cluster cleanup corruption (networkisolated-shared-cluster-delete-blocked-inuse-nsg, repair #38506740). Confidence: High.
Alternative considered: PR #8659's nvidia-dcgm bump; less likely because the first/fanout failure is shared cluster lifecycle before PR-specific validation and PR only changes components.json.
Recommended owner/action: No PR action; E2E infra owner should clean/quarantine stale NetworkIsolated resources and continue repair #38506740.
Wiki: https://dev.azure.com/msazure/09706533-03bf-4b43-9a9b-b49c75429646/_wiki/wikis/ed4a85e9-1085-4151-a39b-2753523eba2b?pagePath=%2FAKS%2FSIGs%20and%20Teams%2FAKS%20Components%2FSIG%3A%20Node%20Lifecycle%2FAI%20Agent%20Knowledge%2FAgentBaker%20Gate%20PR%20Pipeline%20Flakiness (networkisolated-shared-cluster-delete-blocked-inuse-nsg)

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from df80fc4 to ffe4cf0 Compare June 23, 2026 17:38
Copilot AI review requested due to automatic review settings June 24, 2026 04:36
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from ffe4cf0 to 2ba0d14 Compare June 24, 2026 04:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 2ba0d14 to 0067b3b Compare June 24, 2026 08:05
Copilot AI review requested due to automatic review settings June 24, 2026 11:29
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 0067b3b to 1d9ef2e Compare June 24, 2026 11:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 1d9ef2e to 3eeee47 Compare June 25, 2026 08:44
Copilot AI review requested due to automatic review settings June 25, 2026 13:50
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 3eeee47 to 29c7369 Compare June 25, 2026 13:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 29c7369 to 26d1c30 Compare June 25, 2026 18:11
@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate run

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169474377
Failed job/stage/task: e2e / Run AgentBaker E2E / Run AgentBaker E2E; plus build / build2404gen2containerd / Test, Scan, and Cleanup

Detective summary

E2E reached validation in Test_Ubuntu2204_ScriptlessCSECmd_Hotfix/default; LocalDNS exporter metrics passed, then validators.go:1010 reported localdns-exporter@...service unexpectedly entered a failed state. Timeline/build status corroborate the E2E task exit code 1. The VHD scan also reported known CIS rule 6.1.4.1|pass->fail in build2404gen2containerd.

Likely cause and signature

E2E: known LocalDNS exporter failed-state flake, signature localdns-exporter-systemd-failed-state, confidence high. Strongest alternative is a product regression in LocalDNS exporter, but this exact signature already has repair Bug #38581800 across unrelated PRs. CIS: known Ubuntu 24.04 Gen2 containerd CIS signal, signature linux-vhd-prgate-cis-ubuntu2404-gen2-containerd-6141-pass-fail, confidence high.

Recommended owner/action

Node Lifecycle E2E/LocalDNS owners should continue Bug #38581800. Linux VHD/CIS owners should continue Bug #38529622. No PR-specific action is indicated from this evidence alone.

Evidence

@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate run

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169539788
Failed job/stage/task: e2e / Run AgentBaker E2E / Run AgentBaker E2E

Detective summary

Test_Ubuntu2204_HTTPSProxy_PrivateDNS/default failed creating the VMSS while processing the CustomScript extension: CSE ExitCode=50, KubeletStartTime=n/a, and the known mandb/oldlocal cache pattern. Timeline and build status corroborate the E2E task exit code 1.

Likely cause and signature

Known HTTPSProxy PrivateDNS CSE/provisioning flake, signature ubuntu2204-httpsproxy-privatedns-cse-exit50-outbound-kubelet-na, confidence high. Strongest alternative is a deterministic CSE regression, but this stable kubelet-n/a signature is already tracked under repair Bug #38559852 across unrelated PRs.

Recommended owner/action

Continue the existing repair investigation in Bug #38559852; no PR author action is indicated from this evidence alone.

Evidence

Copilot AI review requested due to automatic review settings June 25, 2026 21:57
@renovate renovate Bot force-pushed the renovate/patch-nvidia-dcgm branch from 26d1c30 to 4873a37 Compare June 25, 2026 21:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@surajssd surajssd enabled auto-merge (squash) June 26, 2026 20:51
@djsly djsly disabled auto-merge June 26, 2026 20:58
@djsly djsly merged commit 2808536 into main Jun 26, 2026
46 of 48 checks passed
@djsly djsly deleted the renovate/patch-nvidia-dcgm branch June 26, 2026 20:58
@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169649265

Failed job/stage/task: e2e / Run AgentBaker E2E (logId 538) and build2404gen2containerd / Test, Scan, and Cleanup (logId 481).

Detective summary

The first E2E failure is deterministic: Test_Version_Consistency_GPU_Managed_Components found dcgm-exporter rebuild revision mismatch, expected 4 but ubuntu.r2204 has 1. The same run also hit the known Ubuntu 24.04 CIS 6.1.4.1 pass->fail gate signature.

Likely cause / signature

Primary likely cause is a PR-caused partial OS metadata update in parts/common/components.json. Signature: dcgm-exporter-partial-os-update. Confidence: High.

Secondary known gate issue: linux-vhd-prgate-cis-ubuntu2404-gen2-containerd-6141-pass-fail, tracked by repair item #38529622.

Strongest alternative: flaky/infrastructure E2E behavior, but less likely because the failing test is a static consistency check that names the exact component, OS variant, and expected/actual rebuild revisions.

Recommended action

Align all dcgm-exporter OS entries in parts/common/components.json to rebuild revision 4, or revert the partial bump. No PR author action is recommended for the separate known CIS signature.

Evidence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs renovate This pull request was created by renovate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants