Skip to content

fix(prow-job-executor): retry Key Vault prow-token lookup on transient failures (AROSLSRE-1228)#252

Merged
geoberle merged 5 commits into
Azure:mainfrom
raelga:fix/prowtoken-keyvault-retry-1228
Jun 23, 2026
Merged

fix(prow-job-executor): retry Key Vault prow-token lookup on transient failures (AROSLSRE-1228)#252
geoberle merged 5 commits into
Azure:mainfrom
raelga:fix/prowtoken-keyvault-retry-1228

Conversation

@raelga

@raelga raelga commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Problem

The prowjobexecutor execute command used in EV2 gating looks up the Prow token in Key Vault before it ever talks to Prow, and that lookup had no retry logic. When the managed-identity token acquisition hits a transient failure — most commonly an EOF / connection reset reading the instance metadata endpoint (169.254.169.254) on the EV2 runner — the whole EV2 gating step fails immediately and the entire deployment job must be restarted from scratch.

This is a separate failure point from #251 / AROSLSRE-1168, which added retries to the Gangway SubmitJob() 429 path. This failure happens earlier, during Key Vault credential acquisition.

Observed in PROD (branch-ci-Azure-ARO-HCP-main-e2e-prod-e2e-parallel, eastus2euap):

failed to lookup prow token in Key Vault: failed to get secret "prow-token" from
Key Vault "https://arohcpprod-global.vault.azure.net": DefaultAzureCredential:
failed to acquire a token. ... ManagedIdentityCredential: Get
"http://169.254.169.254/metadata/identity/oauth2/token...": EOF

az login --identity succeeded seconds earlier in the same step, so this is a momentary IMDS/networking blip, not a misconfiguration.

What changes

  • Wrap lookupProwTokenInKeyVault in an exponential-backoff retry (retryProwTokenLookup), mirroring the existing SubmitJob pattern in prowjob/client.go.
  • isRetryableKeyVaultError retries transient failures — credential/IMDS and network errors, plus Key Vault HTTP 429/5xx — and fails fast on permanent Key Vault 4xx (401/403/404/400/409).
  • The single-shot lookup is factored into lookupProwTokenInKeyVaultOnce; the fetch is injectable so retry behavior is unit-tested without a live Key Vault.
  • Retries are logged and bounded by the parent context. Backoff: 30s, 1, 2, 4, 8, 16 min over up to 6 attempts (~31 min worst-case).

Testing

  • go test -race ./... — pass (new prowtoken_test.go: classifier 429/5xx/4xx/IMDS-EOF, retry-then-succeed, fail-fast on 403/404, exhaust on persistent transient).
  • go tool golangci-lint run ./... — 0 issues.
  • go vet ./... — clean. go mod tidy + go work sync clean (promotes azcore to a direct require).

Follow-ups

  • After merge + tag, bump the dependency in Azure/ARO-HCP (tooling/templatize, test) so the EV2-shipped prowjobexecutor binary picks up the retry.

Fixes AROSLSRE-1228.

…t failures

The prow-token Key Vault lookup used by EV2 gating had no retry. A transient
managed-identity/IMDS failure (EOF reading 169.254.169.254 while minting the
token) or a Key Vault 429/5xx failed the whole EV2 gating step, forcing the
entire deployment job to restart.

Wrap lookupProwTokenInKeyVault in an exponential-backoff retry mirroring the
existing SubmitJob pattern. Transient errors (credential/IMDS, network, Key
Vault 429/5xx) are retried; permanent Key Vault 4xx (401/403/404/400/409) fail
fast. Retries are logged and bounded by the parent context.

AROSLSRE-1228
Copilot AI review requested due to automatic review settings June 19, 2026 11:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the robustness of prowjobexecutor execute in EV2 gating by adding exponential-backoff retries around the Key Vault prow-token lookup, addressing transient managed-identity/IMDS and Key Vault service failures before the tool contacts Prow.

Changes:

  • Add a retry wrapper (retryProwTokenLookup) around the Key Vault secret fetch, with a retryability classifier (isRetryableKeyVaultError).
  • Factor the single-attempt Key Vault lookup into lookupProwTokenInKeyVaultOnce to enable unit testing of retry behavior.
  • Add unit tests for retry classification and retry loop behavior, and promote azcore to a direct module dependency.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tools/prow-job-executor/prowtoken.go Adds retry/backoff logic and retryability classification for Key Vault prow-token lookup.
tools/prow-job-executor/prowtoken_test.go Adds unit tests covering retry classification and retry/exhaust/fail-fast scenarios.
tools/prow-job-executor/go.mod Promotes github.com/Azure/azure-sdk-for-go/sdk/azcore to a direct requirement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/prow-job-executor/prowtoken.go Outdated
Comment thread tools/prow-job-executor/prowtoken.go Outdated
Comment thread tools/prow-job-executor/prowtoken.go
- bump backoff Steps 6->7 so the schedule matches the documented ~31.5m
  worst-case (six sleeps: 30s,1,2,4,8,16m), fixing a Steps/comment mismatch
- in retryProwTokenLookup, return a cancelled/expired parent context error
  as-is and only wrap with "after retries" when retries were genuinely
  exhausted, so deadline/cancel errors are no longer masked
- treat context.Canceled/DeadlineExceeded as non-retryable in
  isRetryableKeyVaultError to fail fast instead of looping
- add tests for context-error classification and fail-fast-on-cancel

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread tools/prow-job-executor/prowtoken.go
Comment thread tools/prow-job-executor/prowtoken_test.go
Comment thread tools/prow-job-executor/prowtoken.go
…on permanent KV errors

Three call sites (Gangway SubmitJob, GetJobStatus, Key Vault prow-token
lookup) each reimplemented the same wait.Backoff retry loop. Extract a
single generic helper retry.WithValue[T] and route all three through it.

Also narrow the Key Vault classifier (review follow-up): deterministic
local failures (unparseable Key Vault URI, empty secret value) are now
marked errPermanentKeyVaultLookup and fail fast instead of consuming the
full ~31-minute transient backoff budget. IMDS/credential and transport
errors remain retryable.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment thread tools/prow-job-executor/internal/retry/retry.go
Comment thread tools/prow-job-executor/prowtoken_test.go
Comment thread tools/prow-job-executor/internal/retry/retry_test.go
…lper

WithValue now checks ctx.Err() at the top of the error branch so a context
cancelled mid-call surfaces as-is, without a misleading "will retry" log or
recording it as the last transient error (notably the GetJobStatus path, whose
classifier treats context errors as retryable).

Strengthen the context-cancellation tests to assert call counts: a context
cancelled before the first attempt never invokes fn (0 calls), and a new test
deterministically exercises mid-call cancellation (1 call, surfaces
context.Canceled, not the exhausted-budget wrapper).

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

tools/prow-job-executor/prowtoken.go:175

  • cmdutils.GetAzureTokenCredentials() failures are local/deterministic (e.g., Azure CLI not available, invalid credential chain options) and are unlikely to be fixed by waiting. Right now they’ll be treated as retryable (non-HTTP, non-sentinel) and can burn the full ~31.5 minute backoff before failing. Consider marking this error as permanent (wrap with errPermanentKeyVaultLookup) so misconfiguration fails fast while still retrying genuine IMDS/token-acquisition blips during GetSecret().
	// Get Azure credentials using ARO-Tools
	cred, err := cmdutils.GetAzureTokenCredentials()
	if err != nil {
		return "", fmt.Errorf("failed to get Azure credentials: %w", err)
	}

Comment thread tools/prow-job-executor/prowjob/client.go
…um attempts"

wait.Backoff.Steps is the total number of attempts, not retries. Align this
comment with the other two backoffs in the codebase.
@geoberle geoberle merged commit 4b47beb into Azure:main Jun 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants