Skip to content

feat(rcv1p): unify cert bootstrap flow and add Windows CA refresh task#8096

Open
rchincha wants to merge 126 commits into
mainfrom
origin/rchinchani/rcv1p-2
Open

feat(rcv1p): unify cert bootstrap flow and add Windows CA refresh task#8096
rchincha wants to merge 126 commits into
mainfrom
origin/rchinchani/rcv1p-2

Conversation

@rchincha

@rchincha rchincha commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it

This PR implements RCV1P (Robust Certificate Validation for 1P) — the next-generation mechanism for distributing Azure root CA certificates to AKS nodes. Instead of hardcoding certificate bundles, RCV1P queries the Azure wireserver at provisioning time to download and install the latest root certificates into the OS trust store.

Reference: https://eng.ms/docs/products/onecert-certificates-key-vault-and-dsms/onecert-customer-guide/autorotationandecr/rcv1ptsg


Summary of Changes

1. Linux: Unified cert bootstrap flow (init-aks-custom-cloud.sh)

  • Consolidated 3 scripts into 1: Removed init-aks-custom-cloud-mariner.sh, init-aks-custom-cloud-operation-requests-mariner.sh, and init-aks-custom-cloud-operation-requests.sh. All cert logic now flows through a single init-aks-custom-cloud.sh that detects the distro (Ubuntu, Mariner, AzureLinux, Flatcar, ACL) at runtime.
  • Extracted repo depot logic: Moved Ubuntu repo depot initialization and chrony configuration into a new init-aks-custom-cloud-repos.sh to keep base customData size small for non-custom-cloud scenarios (critical for Flatcar/ACL which have tight size limits).
  • Two cert endpoint modes: legacy (ussec/usnat regions) and rcv1p (all other regions), selected by cloud location at runtime.
  • Fatal failure handling: Wireserver cert retrieval failures are fatal on both Linux and Windows — after exhausting retries, provisioning fails rather than continuing without certificates. The cert mode (legacy vs rcv1p) is always determined and must succeed.

2. Windows: CA cert refresh task and rcv1p support (kubernetesfunc.ps1)

  • New Get-CACertificates with -Location parameter: Determines cert endpoint mode from location, uses legacy endpoint for ussec/usnat, rcv1p for all others.
  • New Register-CACertificatesRefreshTask: Registers a daily scheduled task to refresh CA certificates, with backward compatibility for older VHDs that don't accept -Location.
  • New Should-InstallCACertificatesRefreshTask: Gates refresh task registration on wireserver opt-in status.
  • Pester tests: 260 lines of tests covering URI construction, endpoint mode selection, and refresh task gating.

3. E2E tests (e2e/scenario_rcv1p_test.go, e2e/scenario_rcv1p_win_test.go)

  • Linux tests: Ubuntu 2204, Ubuntu 2404, AzureLinux V3, Flatcar, ACL — each validates cert download, trust store installation, and refresh schedule.
  • Windows tests: Windows 2022, 23H2, 2025 — validates cert download to C:\ca, Windows certificate store import, and scheduled task registration.
  • Negative test: Test_RCV1P_NotOptedIn verifies that omitting the VM opt-in tag correctly prevents cert installation.
  • Dedicated pipeline: .pipelines/e2e-rcv1p.yaml runs daily at 3am PST with tag filter rcv1pcertmode=true (not yet enabled).

4. E2E infrastructure: multi-subscription and VM instance tagging

  • Multi-subscription support: RCV1P tests run in a dedicated subscription (RCV1P_SUBSCRIPTION_ID) with the Microsoft.Compute/PlatformSettingsOverride feature flag. Added SubscriptionID field to scenarios and GetAzure()/GetSubscriptionID() helpers.
  • VMSS opt-in tagging: Sets the wireserver opt-in tag (platformsettings.host_environment.service.platform_optedin_for_rootcerts=true) on the VMSS at creation time via a VMConfigMutator. VMSS-level tags inherit to VM instances automatically.

⚠️ Critical Design Decisions

1. Cert endpoint mode is determined by cloud location, not a flag

Decision: ussec*/usnat*legacy mode, everything else → rcv1p mode. This is determined at runtime from the node's Azure location.
Why: Avoids requiring a new API contract field. The location-based approach lets us roll out rcv1p incrementally — ussec/usnat stay on the legacy endpoint that works today, while all other regions use the new rcv1p endpoint with opt-in gating.

2. Two-layer access control for rcv1p

Decision: Both conditions must be met for cert installation:

  1. Subscription feature flag (Microsoft.Compute/PlatformSettingsOverride) enables the wireserver endpoint
  2. VM instance tag (platformsettings.host_environment.service.platform_optedin_for_rootcerts=true) grants per-VM access

Why: Defense in depth — the subscription flag is a coarse gate, the VM tag provides per-node opt-in control. Without the tag, wireserver returns IsOptedInForRootCerts=false.

3. VM opt-in tag is set at VMSS creation time

Decision: The opt-in tag (platformsettings.host_environment.service.platform_optedin_for_rootcerts=true) is set on the VMSS at creation time and inherits to all VM instances automatically.
Why: VMSS-level tags propagate to VM instances, and wireserver reads the tag from the VM instance to determine opt-in status. In E2E tests, the positive tests set the tag via a VMConfigMutator at VMSS creation, while the negative test (Test_RCV1P_NotOptedIn) simply omits the tag to verify wireserver returns IsOptedInForRootCerts=false.

4. Get-CACertificates moved outside IsAKSCustomCloud guard (Windows)

Decision: Get-CACertificates -Location $Location -FailOnError now runs for all clouds, not just custom clouds.
Why: RCV1P applies to all clouds. The function itself handles the location-based mode selection internally and gracefully skips cert installation when wireserver returns IsOptedInForRootCerts=false (which is the case on public cloud without the feature flag).

5. Wireserver failures are fatal after retries

Decision: If wireserver cert endpoints fail after exhausting retries, provisioning fails (exit 1 on Linux, throw on Windows with -FailOnError).
Why: Cert installation is required for the selected mode. Silently continuing without certificates would leave the node in an inconsistent state. Retries with backoff handle transient wireserver issues (rate limiting, temporary unavailability).

6. Backward compatibility for Windows VHD/CSE version skew

Decision: kuberneteswindowssetup.ps1 guards Register-CACertificatesRefreshTask with Get-Command checks before calling it.
Why: Windows VHD and CSE release independently. Newer CSE must not crash on older VHDs that don't have these functions. The guard falls back gracefully.


Testing Evidence

MSFT tenant (default E2E subscription)

Linux (Build 158446017):

  • All distros (Ubuntu, ACL, AzureLinux) correctly detect rcv1p mode
  • IsOptedInForRootCerts check works (skips on public cloud as expected)
  • Chrony configured per-OS, CSE completes successfully across 113 tests

Windows (Build 158446024):

  • Get-CACertificates -Location correctly selects rcv1p mode
  • Should-InstallCACertificatesRefreshTask returns $false on public cloud (correct)
  • Backward-compat guard works for older VHDs

TME tenant (RCV1P_SUBSCRIPTION_ID set in pipeline, with PlatformSettingsOverride feature flag)

Linux — Validated end-to-end: wireserver returns IsOptedInForRootCerts=true, certificates downloaded and installed into OS trust store, refresh schedule registered. Passed across Ubuntu 2204, Ubuntu 2404, AzureLinux V3, Flatcar, ACL.

Windows (Build 161633049):

  • 5 of 6 jobs passed — all RCV1P tests passed across Windows 2022, 23H2, 2025 (both gen1 and gen2 variants)
  • Wireserver returns IsOptedInForRootCerts=true, certificates downloaded to C:\ca, scheduled task aks-ca-certs-refresh-task registered
  • The only failure (windows-2022-containerd job) was a pre-existing Test_Windows2022_VHDCaching issue unrelated to RCV1P
  • Bug fix validated: The original Windows implementation used the wrong JSON field name (OperationRequests instead of OperationsInfo) when parsing wireserver responses — this was the root cause of empty cert downloads. Fixed in commit b6cd4e4f68.

Note on Windows E2E infrastructure: The published CSE scripts package (v0.0.52 from packages.aks.azure.com) predates the RCV1P code. To test the branch code end-to-end, the tests build a CSE zip from staging/cse/windows/ at test time, upload it to blob storage, and override CseScriptsPackageURL in the bootstrap config. This override is temporary — once a new CSE package is published with RCV1P support, the override can be removed.


Files Changed (31 files, +1979 / -1218)

Area Files Description
Linux provisioning init-aks-custom-cloud.sh, init-aks-custom-cloud-repos.sh (new), 3 removed Unified cert flow, repo depot extraction
Windows provisioning kubernetesfunc.ps1, kuberneteswindowssetup.ps1 rcv1p support, refresh task, backward compat
Windows tests kubernetesfunc.tests.ps1 (new) 260 lines of Pester tests
E2E tests scenario_rcv1p_test.go (new), scenario_rcv1p_win_test.go (new) Linux + Windows + negative tests
E2E infra vmss.go, types.go, validators.go, cluster.go, config/ Multi-sub, VM instance tags, validators
Pipeline e2e-rcv1p.yaml (new) Daily RCV1P test pipeline
AgentBaker service baker.go, const.go, variables.go Wire up new scripts

PR File Breakdown: Functionality vs Tests

Functionality (1,859 lines — 51%)

Lines File Purpose
455 parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh Linux RCV1P cert provisioning
358 parts/linux/cloud-init/artifacts/init-aks-custom-cloud-repos.sh Split out repo init logic
346 parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Deleted (consolidated)
236 parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests-mariner.sh Deleted (consolidated)
211 staging/cse/windows/kubernetesfunc.ps1 Windows RCV1P cert functions
186 parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Deleted (consolidated)
18 pkg/agent/variables.go Template variable plumbing
13 parts/windows/kuberneteswindowssetup.ps1 Windows CSE template
12 pkg/agent/const.go Embedded script references
7 parts/linux/cloud-init/nodecustomdata.yml Cloud-init custom data
7 aks-node-controller/parser/helper.go ANC parser helper
4 parts/linux/cloud-init/artifacts/cse_cmd.sh CSE command script
3 aks-node-controller/parser/templates/cse_cmd.sh.gtpl CSE command template
3 pkg/agent/baker.go Baker template plumbing

Tests / E2E Infra (1,795 lines — 49%)

Lines File Purpose
507 e2e/scenario_rcv1p_test.go Linux RCV1P E2E tests + CSE zip helper
260 staging/cse/windows/kubernetesfunc.tests.ps1 Windows PowerShell unit tests
195 e2e/scenario_rcv1p_win_test.go Windows RCV1P E2E tests
146 e2e/validators.go RCV1P validation functions
135 e2e/cluster.go Multi-subscription cluster support
111 e2e/vmss.go VMSS tagging and log collection
96 e2e/config/azure.go Azure client config for multi-sub
80 e2e/test_helpers.go Test helper improvements
68 e2e/types.go Scenario type extensions
61 spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_spec.sh ShellSpec unit tests
47 e2e/cache.go Cluster cache additions
29 e2e/aks_model.go AKS model extensions
24 e2e/config/config.go E2E config additions
19 .pipelines/e2e-rcv1p.yaml Dedicated RCV1P pipeline
14 e2e/kube.go Kube helper improvements
2 .pipelines/scripts/e2e_run.sh E2E run script changes
1 .pipelines/templates/e2e-template.yaml E2E template changes

Summary

Category Lines Changed Percentage
Functionality 1,859 51%
Tests / E2E 1,795 49%
Total 3,654 100%

Copilot AI review requested due to automatic review settings March 16, 2026 05:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to unify the custom-cloud CA certificate bootstrap path (removing the separate “operation-requests” init scripts) and adds a Windows scheduled task to periodically refresh custom-cloud CA certificates.

Changes:

  • Windows: add a scheduled task to refresh custom-cloud CA certificates; update Get-CACertificates to support legacy vs “rcv1p” modes keyed off location.
  • Linux: consolidate custom-cloud init to a single init script and update CSE command generation to set a cert-endpoint mode variable.
  • Regenerate multiple custom data / generated command snapshots to reflect the new templates.

Reviewed changes

Copilot reviewed 74 out of 176 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
staging/cse/windows/kubernetesfunc.ps1 Adds CA refresh scheduled task + updates CA retrieval logic and error behavior
parts/windows/kuberneteswindowssetup.ps1 Wires Get-CACertificates -Location and registers refresh task for custom clouds
pkg/agent/variables.go Always injects initAKSCustomCloud payload into cloud-init data
pkg/agent/const.go Removes separate custom-cloud init script constants; keeps single init script
pkg/agent/baker.go Simplifies GetTargetEnvironment; notes IsAKSCustomCloud as deprecated
parts/linux/cloud-init/artifacts/cse_cmd.sh Updates CSE command to set cert endpoint mode + run custom-cloud init script
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Deleted (custom-cloud init consolidation)
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests-mariner.sh Deleted (custom-cloud init consolidation)
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Deleted (custom-cloud init consolidation)
aks-node-controller/parser/templates/cse_cmd.sh.gtpl Mirrors CSE command template updates for aks-node-controller parser
aks-node-controller/parser/testdata/Compatibility+EmptyConfig/generatedCSECommand Regenerated snapshot for new CSE cmd template
aks-node-controller/parser/testdata/AzureLinuxv2+Kata+DisableUnattendedUpgrades=false/generatedCSECommand Regenerated snapshot for new CSE cmd template
aks-node-controller/parser/testdata/AKSUbuntu2204+SSHStatusOn/generatedCSECommand Regenerated snapshot for new CSE cmd template
aks-node-controller/parser/testdata/AKSUbuntu2204+EnablePubkeyAuth/generatedCSECommand New snapshot for new template output
aks-node-controller/parser/testdata/AKSUbuntu2204+DisablePubkeyAuth/generatedCSECommand New snapshot for new template output
aks-node-controller/parser/testdata/AKSUbuntu2204+DefaultPubkeyAuth/generatedCSECommand New snapshot for new template output
aks-node-controller/parser/testdata/AKSUbuntu2204+CustomOSConfig/generatedCSECommand Regenerated snapshot for new CSE cmd template
aks-node-controller/parser/testdata/AKSUbuntu2204+CustomCloud/generatedCSECommand Regenerated snapshot for new CSE cmd template
aks-node-controller/parser/testdata/AKSUbuntu2204+Containerd+MIG/generatedCSECommand Regenerated snapshot for new CSE cmd template
aks-node-controller/parser/testdata/AKSUbuntu2204+CloudProviderOverrides/generatedCSECommand New snapshot for new template output
aks-node-controller/parser/testdata/AKSUbuntu2204+China/generatedCSECommand Regenerated snapshot for new CSE cmd template
pkg/agent/testdata/MarinerV2+Kata/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AzureLinuxV2+Kata/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AzureLinuxV3+Kata/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2204+China/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2204+cgroupv2/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2204+ootcredentialprovider/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2204+SecurityProfile/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2204+SSHStatusOn/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2204+SSHStatusOff/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2404+NetworkPolicy/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/AKSUbuntu2404+Teleport/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/CustomizedImage/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/CustomizedImageKata/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData Regenerated snapshot (custom data gzip payload changed)
pkg/agent/testdata/Flatcar/CustomData.inner Regenerated snapshot (embedded gzip payload changed)
pkg/agent/testdata/ACL/CustomData.inner Regenerated snapshot (embedded gzip payload changed)

You can also share your feedback on Copilot code review. Take the survey.

Comment thread staging/cse/windows/kubernetesfunc.ps1
Comment thread parts/windows/kuberneteswindowssetup.ps1 Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_cmd.sh Outdated
Comment thread aks-node-controller/parser/templates/cse_cmd.sh.gtpl Outdated
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from 44ff9ee to a0a1307 Compare March 18, 2026 21:11
Copilot AI review requested due to automatic review settings March 18, 2026 22:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to unify AKS custom-cloud CA certificate bootstrap behavior (legacy vs “rcv1p/operation-requests” style flows) and adds a Windows scheduled task to periodically refresh custom-cloud CA certificates.

Changes:

  • Adds Windows CA refresh scheduled task registration and introduces location-based endpoint-mode selection (legacy vs rcv1p).
  • Refactors Windows CA certificate retrieval to support both endpoint modes and opt-in gating for rcv1p.
  • Simplifies Linux custom-cloud init script selection by consolidating onto init-aks-custom-cloud.sh and removing older variants; updates generated testdata accordingly.

Reviewed changes

Copilot reviewed 93 out of 99 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
staging/cse/windows/kubernetesfunc.ps1 Adds CA refresh scheduled task and endpoint-mode-aware Get-CACertificates implementation.
pkg/agent/variables.go Simplifies how initAKSCustomCloud is added to Linux cloud-init variables.
pkg/agent/testdata/MarinerV2+Kata/CustomData Updates expected CustomData snapshot (generated content changed).
pkg/agent/testdata/Flatcar/CustomData.inner Updates expected Flatcar CustomData snapshot (generated content changed).
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData Updates expected CustomData snapshot (generated content changed).
pkg/agent/testdata/CustomizedImageKata/CustomData Updates expected CustomData snapshot (generated content changed).
pkg/agent/testdata/CustomizedImage/CustomData Updates expected CustomData snapshot (generated content changed).
pkg/agent/testdata/AzureLinuxV3+Kata/CustomData Updates expected CustomData snapshot (generated content changed).
pkg/agent/testdata/AzureLinuxV2+Kata/CustomData Updates expected CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworkingNoConfig/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworkingDisabled/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworking/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+ootcredentialprovider/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+SecurityProfile/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+ManagedIdentity/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+KubeletServingCertificateRotation/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+KubeletClientTLSBootstrapping/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+K8S119/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+K8S119+FIPS/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+K8S119+CSI/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+K8S118/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+K8S117/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+K8S116/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+EnablePrivateClusterHostsConfigAgent/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+CustomVnet/CustomData Updates expected Windows CustomData snapshot (calls/refresh task additions).
pkg/agent/testdata/AKSWindows2019+CustomCloud/CustomData Updates expected Windows CustomData snapshot (new Get-CACertificates call form + refresh task).
pkg/agent/testdata/AKSWindows2019+CustomCloud+ootcredentialprovider/CustomData Updates expected Windows CustomData snapshot (new Get-CACertificates call form + refresh task).
pkg/agent/testdata/AKSUbuntu2404+Teleport/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSUbuntu2404+NetworkPolicy/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSUbuntu2204+ootcredentialprovider/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSUbuntu2204+cgroupv2/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSUbuntu2204+SecurityProfile/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSUbuntu2204+SSHStatusOn/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/AKSUbuntu2204+China/CustomData Updates expected Ubuntu CustomData snapshot (generated content changed).
pkg/agent/testdata/ACL/CustomData.inner Updates expected ACL CustomData snapshot (generated content changed).
pkg/agent/const.go Consolidates custom-cloud init script constants to a single script.
parts/windows/kuberneteswindowssetup.ps1 Updates Windows setup flow to call Get-CACertificates with location and registers CA refresh scheduled task.
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Removes operation-requests-specific Linux init script (consolidation).
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests-mariner.sh Removes Mariner/AzureLinux operation-requests init script (consolidation).
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Removes Mariner/AzureLinux legacy init script variant (consolidation).
aks-node-controller/parser/templates/cse_cmd.sh.gtpl Adds a LOCATION shell variable in the generated CSE command template.
aks-node-controller/parser/helper.go Factors out a shared getCloudLocation helper and reuses it in getCloudTargetEnv.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread parts/windows/kuberneteswindowssetup.ps1 Outdated
Comment thread parts/windows/kuberneteswindowssetup.ps1 Outdated
Comment thread staging/cse/windows/kubernetesfunc.ps1
Comment thread aks-node-controller/parser/templates/cse_cmd.sh.gtpl Outdated
@rchincha rchincha changed the title feat(custom-cloud): unify cert bootstrap flow and add Windows CA refresh task feat(rcv1p): unify cert bootstrap flow and add Windows CA refresh task Mar 19, 2026
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from 2b3c1d6 to e19a19b Compare March 19, 2026 00:51
Copilot AI review requested due to automatic review settings March 19, 2026 01:00
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from e19a19b to d41856f Compare March 19, 2026 01:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR unifies the AKS custom cloud CA certificate bootstrap logic to a single flow and adds a Windows scheduled task to periodically refresh custom cloud CA certificates. It also updates Linux/customdata generation and test snapshots to reflect the new wiring.

Changes:

  • Add Windows scheduled task registration for daily CA certificate refresh and introduce a location-based cert endpoint mode selector.
  • Simplify Linux custom cloud init script selection by standardizing on init-aks-custom-cloud.sh, plus add wiring/tests for refresh-mode arguments.
  • Update aks-node-controller template to export LOCATION, and regenerate CustomData snapshot test artifacts.

Reviewed changes

Copilot reviewed 95 out of 101 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
staging/cse/windows/kubernetesfunc.tests.ps1 Adds Pester coverage for cert endpoint mode selection, scheduled task registration, and CA retrieval behavior.
staging/cse/windows/kubernetesfunc.ps1 Implements unified Windows CA retrieval logic with legacy/rcv1p modes and registers a daily refresh scheduled task.
spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_spec.sh Adds ShellSpec assertions to validate refresh-mode argument parsing/wiring in the Linux init script.
pkg/agent/variables.go Changes how initAKSCustomCloud is injected into Linux cloud-init data.
pkg/agent/const.go Removes per-cloud custom init script constants and standardizes on init-aks-custom-cloud.sh.
parts/windows/kuberneteswindowssetup.ps1 Wires CA retrieval call and registers the Windows CA refresh scheduled task during BasePrep.
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Removed (operation-requests variant no longer used).
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests-mariner.sh Removed (operation-requests Mariner/AzureLinux variant no longer used).
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Removed (Mariner/AzureLinux legacy variant no longer used).
aks-node-controller/parser/templates/cse_cmd.sh.gtpl Exports LOCATION into the CSE environment for downstream scripts.
aks-node-controller/parser/helper.go Adds a helper to normalize location and reuses it in cloud target env detection.
pkg/agent/testdata/MarinerV2+Kata/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/Flatcar/CustomData.inner Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/CustomizedImageKata/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/CustomizedImage/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AzureLinuxV3+Kata/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AzureLinuxV2+Kata/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworkingNoConfig/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworkingDisabled/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworking/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+ootcredentialprovider/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+SecurityProfile/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+ManagedIdentity/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+KubeletServingCertificateRotation/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+KubeletClientTLSBootstrapping/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+K8S119/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+K8S119+FIPS/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+K8S119+CSI/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+K8S118/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+K8S117/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+K8S116/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+EnablePrivateClusterHostsConfigAgent/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+CustomVnet/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+CustomCloud/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSWindows2019+CustomCloud+ootcredentialprovider/CustomData Regenerated CustomData snapshot due to Windows CA refresh task wiring.
pkg/agent/testdata/AKSUbuntu2404+Teleport/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSUbuntu2404+NetworkPolicy/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSUbuntu2204+ootcredentialprovider/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSUbuntu2204+cgroupv2/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSUbuntu2204+SecurityProfile/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSUbuntu2204+SSHStatusOn/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/AKSUbuntu2204+China/CustomData Regenerated CustomData snapshot due to init/custom cloud wiring changes.
pkg/agent/testdata/ACL/CustomData.inner Regenerated CustomData snapshot due to init/custom cloud wiring changes.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread parts/windows/kuberneteswindowssetup.ps1 Outdated
Comment thread pkg/agent/variables.go
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from d41856f to 18ba549 Compare March 19, 2026 20:13
Copilot AI review requested due to automatic review settings March 19, 2026 22:07
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from 18ba549 to e94c465 Compare March 19, 2026 22:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates AKS custom cloud certificate bootstrapping to use a single unified flow and adds a Windows scheduled task for periodic custom cloud CA refresh.

Changes:

  • Added Windows CA refresh task registration plus new logic to select cert retrieval mode and opt-in gating.
  • Simplified Linux custom cloud init script wiring by removing legacy “operation-requests” variants and normalizing location for refresh mode.
  • Added/updated tests and refreshed golden testdata outputs to reflect new custom data content.

Reviewed changes

Copilot reviewed 95 out of 101 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
staging/cse/windows/kubernetesfunc.tests.ps1 Adds Pester coverage for endpoint-mode selection, task registration behavior, and CA retrieval failure handling.
staging/cse/windows/kubernetesfunc.ps1 Implements endpoint-mode derivation, opt-in gating, CA retrieval paths, and a Windows scheduled task for refresh.
spec/parts/linux/cloud-init/artifacts/init_aks_custom_cloud_spec.sh Adds ShellSpec checks to ensure init script wiring for ca-refresh mode and LOCATION usage.
pkg/agent/variables.go Simplifies init script selection and updates how custom cloud init script is injected into cloud-init data.
pkg/agent/const.go Removes now-unused custom-cloud init script constants; keeps unified init script constant.
parts/windows/kuberneteswindowssetup.ps1 Updates Windows setup to call Get-CACertificates with Location and conditionally register refresh task.
aks-node-controller/parser/templates/cse_cmd.sh.gtpl Adds LOCATION variable for downstream scripts during custom cloud provisioning.
aks-node-controller/parser/helper.go Adds getCloudLocation helper and reuses it for cloud target env detection.
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests.sh Removes legacy operation-requests init script (superseded by unified script).
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-operation-requests-mariner.sh Removes legacy Mariner operation-requests init script (superseded by unified script).
parts/linux/cloud-init/artifacts/init-aks-custom-cloud-mariner.sh Removes legacy Mariner init script variant (superseded by unified script).
pkg/agent/testdata/MarinerV2+Kata/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/Flatcar/CustomData.inner Updates golden ignition/customData payload for unified custom cloud init content.
pkg/agent/testdata/CustomizedImageLinuxGuard/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/CustomizedImageKata/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/CustomizedImage/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AzureLinuxV3+Kata/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AzureLinuxV2+Kata/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworkingNoConfig/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworkingDisabled/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows23H2Gen2+NextGenNetworking/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+ootcredentialprovider/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+SecurityProfile/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+ManagedIdentity/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+KubeletServingCertificateRotation/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+KubeletClientTLSBootstrapping/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+K8S119/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+K8S119+FIPS/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+K8S119+CSI/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+K8S118/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+K8S117/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+K8S116/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+EnablePrivateClusterHostsConfigAgent/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+CustomVnet/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+CustomCloud/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSWindows2019+CustomCloud+ootcredentialprovider/CustomData Updates golden Windows customData to pass Location to CA cert retrieval + refresh task gating.
pkg/agent/testdata/AKSUbuntu2404+Teleport/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2404+NetworkPolicy/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2204+ootcredentialprovider/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2204+cgroupv2/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2204+SecurityProfile/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2204+SSHStatusOn/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2204+SSHStatusOff/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/AKSUbuntu2204+China/CustomData Updates golden customData to match unified custom cloud init content.
pkg/agent/testdata/ACL/CustomData.inner Updates golden ignition/customData payload for unified custom cloud init content.
Comments suppressed due to low confidence (7)

staging/cse/windows/kubernetesfunc.ps1:1

  • Get-CACertificates used to fail fast via Set-ExitCode on retrieval/parse errors, but now returns $false (and logs warnings) for a wide range of failure cases. Because call sites in the generated setup scripts invoke Get-CACertificates -Location $Location without checking the return value, this can silently proceed without required CA material and lead to harder-to-diagnose TLS failures later in provisioning. Consider restoring fatal behavior for “expected-to-install” scenarios (e.g., legacy mode, or rcv1p when opted-in), or have callers check the return value and invoke Set-ExitCode when it’s $false in those modes.
    staging/cse/windows/kubernetesfunc.ps1:1
  • Get-CACertificates used to fail fast via Set-ExitCode on retrieval/parse errors, but now returns $false (and logs warnings) for a wide range of failure cases. Because call sites in the generated setup scripts invoke Get-CACertificates -Location $Location without checking the return value, this can silently proceed without required CA material and lead to harder-to-diagnose TLS failures later in provisioning. Consider restoring fatal behavior for “expected-to-install” scenarios (e.g., legacy mode, or rcv1p when opted-in), or have callers check the return value and invoke Set-ExitCode when it’s $false in those modes.
    pkg/agent/variables.go:1
  • This change removes the previous cs.IsAKSCustomCloud() guard and injects the custom cloud init script into cloudInitData unconditionally. That can increase customData size for all clusters (risking platform limits) and may introduce unintended side effects if any downstream template writes/executes this script outside custom cloud. Recommend reinstating the custom cloud guard (and only setting initAKSCustomCloud when IsAKSCustomCloud() is true), while still using the unified initAKSCustomCloudScript for all custom clouds.
    staging/cse/windows/kubernetesfunc.ps1:1
  • $resourceFileName is used directly to build a path under C:\ca. If the upstream response ever contains path separators (e.g., ..\foo or nested paths), this can write outside the intended directory. Prefer sanitizing to a basename (e.g., using Split-Path -Leaf or [IO.Path]::GetFileName($resourceFileName)) before Join-Path, and consider rejecting names containing directory traversal characters.
    staging/cse/windows/kubernetesfunc.ps1:1
  • $resourceFileName is used directly to build a path under C:\ca. If the upstream response ever contains path separators (e.g., ..\foo or nested paths), this can write outside the intended directory. Prefer sanitizing to a basename (e.g., using Split-Path -Leaf or [IO.Path]::GetFileName($resourceFileName)) before Join-Path, and consider rejecting names containing directory traversal characters.
    staging/cse/windows/kubernetesfunc.ps1:1
  • The new rcv1p operation-requests flow is non-trivial (multiple requests, JSON shape assumptions, per-item content downloads, and $downloadedAny aggregation), but the added Pester tests only cover legacy mode and the “throws returns false” path. Add tests that (1) exercise the rcv1p path end-to-end with mocked Retry-Command returning operation requests and cert bodies, and (2) verify behavior when operation requests are empty/invalid (ensuring the function returns $false and logs expected warnings).
    pkg/agent/variables.go:1
  • The PR description still contains placeholder text (Fixes # with no linked issue and no explanation of “what/why”). Please update the PR description to summarize the behavior change (unified bootstrap + Windows refresh task) and link the relevant issue or remove the placeholder.

Comment thread aks-node-controller/parser/templates/cse_cmd.sh.gtpl
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from e94c465 to f20d5b8 Compare March 19, 2026 23:28
Copilot AI review requested due to automatic review settings March 20, 2026 06:42
@rchincha rchincha force-pushed the origin/rchinchani/rcv1p-2 branch from f20d5b8 to b53f240 Compare March 20, 2026 06:42
rchincha and others added 2 commits June 26, 2026 22:13
…o surface preamble failures (REVERT ME)

Without this guard, any failure between line ~256 and the main try-block at line ~632 (download, expand-archive, or top-level error in any dot-sourced .ps1 file) leaves the VM with no provision.complete file, and CSE returns the opaque WINDOWS_CSE_ERROR_NO_CSE_RESULT_LOG (exit 50) with no information about the actual cause. This wrapper writes a structured provision.complete with exit code 76 and the underlying error message, converting the silent failure into a diagnosable one.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 6 comments.

Comment thread staging/cse/windows/kubernetesfunc.ps1
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud-repos.sh
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud-repos.sh
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud-repos.sh
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
rchincha and others added 2 commits June 27, 2026 00:13
…ain baseline

Diagnostic: if RCV1P Windows e2e passes with this commit, the failure is in our kubernetesfunc.ps1 edits. If it still fails, the failure is elsewhere (other staging/cse/windows/ files or e2e infra). MUST be reverted before merge.
… account

The per-subscription storage account naming scheme (e.g. abe2etmewestus338d771 for sub 38d77129...) creates a fresh storage account whenever a new E2E_SUBSCRIPTION_ID is used. The e2e bootstrap was granting Storage Blob Data Contributor only to the VM managed identity, not to the test runner principal (ADO service-connection SP in pipelines, or developer's user identity locally). The runner has management-plane Contributor (sufficient to CREATE the storage account) but no data-plane RBAC, producing 'HTTP 403 AuthorizationPermissionMismatch' when uploading the branch CSE zip for RCV1P Windows tests.

Resolve the runner's object ID by decoding the JWT 'oid' claim from an ARM access token. Distinguish ServicePrincipal vs User via the 'idtyp' claim so ARM accepts the role assignment in both pipeline and local-dev contexts. Idempotent: a 409 Conflict on re-run is swallowed (matches the existing assignRolesToVMIdentity behavior).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 4 comments.

Comment thread parts/windows/kuberneteswindowssetup.ps1
Comment thread parts/windows/kuberneteswindowssetup.ps1
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
rchincha and others added 4 commits June 27, 2026 09:32
The Azure Firewall app rule in aks_model.go:219 whitelists the storage

account FQDN at cluster-creation time. After 3f1c640 made the storage

account name subscription-unique (abe2ewestus3 -> abe2ewestus38ecadf in

MSFT sub), pre-existing cached RCV1P clusters still embed the old FQDN

in their firewall, so Windows VMs cannot reach the new storage to

download the branch CSE zip, causing exit 50 (provision.complete not

generated).

Bumping the cluster name forces fresh clusters with a firewall rule

matching the new storage account name. Follows the established vN

versioning pattern used on main (abe2e-kubenet-v5, abe2e-azure-network-v4,

etc.).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…/catch to surface preamble failures (REVERT ME)"

This reverts commit 9f963d5.
…drifts

The shared Azure Firewall `abe2e-fw` was created once with the legacy storage

FQDN `abe2ewestus3.blob.core.windows.net`. After the per-sub storage rename

(`BlobStorageAccount()` now returns `abe2ewestus3<sub-suffix>`), the firewall

still blocked the new FQDN. Windows CSE failed with NO_CSE_RESULT_LOG because

the zip download timed out before any log file could be written; Linux CSE

hung on artifact downloads.

`ensureSharedFirewall` previously returned early on any existing firewall,

never reconciling its rules. This change detects FQDN drift on the dynamic

blob-storage-fqdn rule and re-issues CreateOrUpdate with the current target

FQDN, reusing the existing public IP to preserve external bindings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
Comment thread staging/cse/windows/kubernetesfunc.ps1
rchincha and others added 3 commits June 28, 2026 09:45
… upload

Windows RCV1P scenarios construct rcv1pWindowsCSEMutator at scenario-struct
init time, which calls getOrBuildBranchCSEPackageURL -> buildAndUploadCSEZip.
This runs BEFORE RunScenario -> CachedCreateVMManagedIdentity, which is what
actually creates the per-sub blob storage account on first use.

On westus3 the account had been created by prior runs so the upload worked.
The first ever run in southcentralus had no pre-existing account, so the
blob client's first PUT failed with NXDOMAIN:
  dial tcp: lookup abe2etmesouthcentr38d771.blob.core.windows.net on
  127.0.0.53:53: no such host

Piggyback on the same CachedCreateVMManagedIdentity that Linux scenarios use
to guarantee the storage account exists before attempting the upload. Bump
the ctx timeout from 2m to 5m to cover a cold-start storage account create
(~30-90s) on top of the zip build/upload.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t failures

Addresses PR #8096 review threads r3485350754, r3485350764, r3485350769:
the sourced init-aks-custom-cloud-repos.sh uses `exit 1` to fail-fast on
repo init errors, which bypasses parent-side telemetry. Add emit_event
calls (defined in the sourcing parent init-aks-custom-cloud.sh) before
each exit so failures are visible off-node in Geneva/Kusto via the Guest
Agent. Behavior is unchanged: still fail-fast at the failing site.

Sites covered:
- check_url: URL unreachable
- aptget_update: 404 from apt source
- dnf_makecache: failure after retries (Mariner + Azure Linux)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… code

Addresses PR #8096 review r3485665993. The comment claimed wireserver
may return either JSON or key=value, but the code only parses JSON via
jq. Empirically wireserver returns JSON, and the Windows side
(kubernetesfunc.ps1 Get-CACertificates) also assumes JSON only.
Update the comment to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh:80

  • This distro-detection block uses [[ ... ]] in several places, but the repo’s make validate-shell runs ShellCheck POSIX checks (SC3010) on this script. Only the first [[ is disabled right now, so CI will still flag the remaining [[ usages. Please rewrite this block using POSIX [ ]/case (or add per-line disables, though POSIX syntax is preferred).
# shellcheck disable=SC3010
if [[ -f /etc/os-release ]]; then
    . /etc/os-release
    # shellcheck disable=SC3010
    if [[ $NAME == *"Ubuntu"* ]]; then

Comment thread e2e/validators.go
Comment thread parts/linux/cloud-init/artifacts/cse_cmd.sh
Previously cp and update-ca-* failures were silently swallowed because the final debug_print_trust_store call ends with `ls ... || true` and returns 0, masking all prior errors. An empty *.crt glob also expanded to the literal '*.crt', producing a cp failure that callers never saw.

Track rc through each step, guard the empty-glob case with compgen, and return rc so the existing call-site checks (which exit 1 on failure) can actually fire.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Comment thread staging/cse/windows/kubernetesfunc.ps1
Comment thread parts/linux/cloud-init/artifacts/cse_cmd.sh
Comment thread staging/cse/windows/kubernetesfunc.tests.ps1 Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_cmd.sh
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
rchincha and others added 3 commits June 28, 2026 21:34
Mirror the defense-in-depth check already present on Linux (parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh:239-247): if wireserver ever returns a ResouceFileName that includes a path separator or an absolute path, Join-Path would write outside $caFolder. Sanitize via [IO.Path]::GetFileName() and skip the entry when the sanitized value differs from the original.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The pre-dot-source Set-ExitCode at lines 16-19 was dead code -- it gets overwritten when windowscsehelper.ps1 is dot-sourced, then re-overridden by the identical definition at line 62 (now line 57). Keep only the post-dot-source override that actually shadows the real Set-ExitCode.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
On Ubuntu 24.04, /usr/lib/ssl/cert.pem is a symlink to /etc/ssl/certs/ca-certificates.crt.

The cp invocation fails with 'are the same file' (exit 1), which after commit 9923f06

(propagate errors from install_certs_to_trust_store) caused install_certs_to_trust_store

to return non-zero. That aborted the rcv1p init flow before the ca-refresh cron entry

was installed, breaking Test_RCV1P_Ubuntu2404/{default,scriptless_nbc}.

Use test -ef (same device + inode) to detect when the dest already refers to the

source and skip the copy. Applies same defensive check in init-aks-custom-cloud-repos.sh

to avoid a misleading stderr error on Ubuntu 24.04 custom-cloud nodes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
Comment thread parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh
@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=169691293

Failed job/stage/task: e2e / Run AgentBaker E2E (logId 538) and build2204arm64gen2containerd / Test, Scan, and Cleanup (logId 381).

Detective summary

The E2E failure is the known LocalDNS exporter failed-state signature: Test_Ubuntu2204_ANCHotfix_BinarySelection/default reached validation, localdns metrics and Inspektor Gadget checks passed, then validators.go reported localdns-exporter@...service unexpectedly failed. The processed CIS pair is the known Ubuntu 22.04 6.1.3.1 pass->fail signature.

Likely cause / signature

Primary likely cause is existing localdns-exporter runtime/test flakiness, not an RCV1P bootstrap regression. Signature: localdns-exporter-systemd-failed-state. Confidence: Medium-high.

Secondary processed signature: linux-vhd-prgate-cis-ubuntu2204-gen2-containerd-6131-logfiles, tracked by repair item #38501652. The localdns signature is tracked by repair item #38581800.

Strongest alternative: PR #8096 touches RCV1P/CSE/cloud-init and could have introduced bootstrap side effects, but node readiness, localdns metrics, Inspektor Gadget, and wireserver validation all passed before the unit-state check failed.

Recommended action

No immediate PR author action recommended for these two known signatures. Continue tracking under repair items #38581800 and #38501652.

Evidence

// Test_RCV1P_Flatcar validates RCV1P on Flatcar Container Linux, which has a read-only root
// filesystem and requires certificates to be placed in /etc/ssl/certs/ as .pem files.
// This is the most constrained environment for cert installation.
func Test_RCV1P_Flatcar(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably omit this test since we're in the middle of removing flatcar support

Comment thread e2e/types.go

// ClusterInfra captures the Azure infrastructure scope for cluster operations.
// It allows cluster creation and management to target different subscriptions.
type ClusterInfra struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this type is no longer used - can we get rid of it?

Comment thread e2e/validators.go
// Validate trust store was updated (distro-specific path)
trustStoreDir := rcv1pTrustStoreDir(s)
execScriptOnVMForScenarioValidateExitCode(ctx, s,
fmt.Sprintf("sudo ls -1 %s/*.crt 2>/dev/null || sudo ls -1 %s/*.pem 2>/dev/null", trustStoreDir, trustStoreDir),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: may be able to simplify to a single sudo ls -1 %s/*.{crt,pem} 2>/dev/null

Comment thread e2e/validators.go
} else {
// Ubuntu, Mariner, AzureLinux use cron
execScriptOnVMForScenarioValidateExitCode(ctx, s,
"sudo crontab -l 2>/dev/null | grep -q ca-refresh",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a systemd timer for all cases?

Comment thread e2e/validators.go

// Validate no refresh schedule was created
execScriptOnVMForScenarioValidateExitCode(ctx, s,
"sudo crontab -l 2>/dev/null | grep -q ca-refresh",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're missing a negative check on the systemd timer for ACL

{{GetVariableProperty "cloudInitData" "azureNetworkUdevRule"}}

{{if IsAKSCustomCloud}}
- path: {{GetInitAKSCustomCloudReposFilepath}}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be baked onto the VHD in all cases to support scriptless provisioning

Comment thread pkg/agent/baker.go
"GetInitAKSCustomCloudFilepath": func() string {
return initAKSCustomCloudFilepath
},
"GetInitAKSCustomCloudReposFilepath": func() string {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants