fix(prow-job-executor): retry Key Vault prow-token lookup on transient failures (AROSLSRE-1228)#252
Merged
Conversation
…t failures The prow-token Key Vault lookup used by EV2 gating had no retry. A transient managed-identity/IMDS failure (EOF reading 169.254.169.254 while minting the token) or a Key Vault 429/5xx failed the whole EV2 gating step, forcing the entire deployment job to restart. Wrap lookupProwTokenInKeyVault in an exponential-backoff retry mirroring the existing SubmitJob pattern. Transient errors (credential/IMDS, network, Key Vault 429/5xx) are retried; permanent Key Vault 4xx (401/403/404/400/409) fail fast. Retries are logged and bounded by the parent context. AROSLSRE-1228
There was a problem hiding this comment.
Pull request overview
This PR improves the robustness of prowjobexecutor execute in EV2 gating by adding exponential-backoff retries around the Key Vault prow-token lookup, addressing transient managed-identity/IMDS and Key Vault service failures before the tool contacts Prow.
Changes:
- Add a retry wrapper (
retryProwTokenLookup) around the Key Vault secret fetch, with a retryability classifier (isRetryableKeyVaultError). - Factor the single-attempt Key Vault lookup into
lookupProwTokenInKeyVaultOnceto enable unit testing of retry behavior. - Add unit tests for retry classification and retry loop behavior, and promote
azcoreto a direct module dependency.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tools/prow-job-executor/prowtoken.go | Adds retry/backoff logic and retryability classification for Key Vault prow-token lookup. |
| tools/prow-job-executor/prowtoken_test.go | Adds unit tests covering retry classification and retry/exhaust/fail-fast scenarios. |
| tools/prow-job-executor/go.mod | Promotes github.com/Azure/azure-sdk-for-go/sdk/azcore to a direct requirement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- bump backoff Steps 6->7 so the schedule matches the documented ~31.5m worst-case (six sleeps: 30s,1,2,4,8,16m), fixing a Steps/comment mismatch - in retryProwTokenLookup, return a cancelled/expired parent context error as-is and only wrap with "after retries" when retries were genuinely exhausted, so deadline/cancel errors are no longer masked - treat context.Canceled/DeadlineExceeded as non-retryable in isRetryableKeyVaultError to fail fast instead of looping - add tests for context-error classification and fail-fast-on-cancel
…on permanent KV errors Three call sites (Gangway SubmitJob, GetJobStatus, Key Vault prow-token lookup) each reimplemented the same wait.Backoff retry loop. Extract a single generic helper retry.WithValue[T] and route all three through it. Also narrow the Key Vault classifier (review follow-up): deterministic local failures (unparseable Key Vault URI, empty secret value) are now marked errPermanentKeyVaultLookup and fail fast instead of consuming the full ~31-minute transient backoff budget. IMDS/credential and transport errors remain retryable.
…lper WithValue now checks ctx.Err() at the top of the error branch so a context cancelled mid-call surfaces as-is, without a misleading "will retry" log or recording it as the last transient error (notably the GetJobStatus path, whose classifier treats context errors as retryable). Strengthen the context-cancellation tests to assert call counts: a context cancelled before the first attempt never invokes fn (0 calls), and a new test deterministically exercises mid-call cancellation (1 call, surfaces context.Canceled, not the exhausted-budget wrapper).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
tools/prow-job-executor/prowtoken.go:175
cmdutils.GetAzureTokenCredentials()failures are local/deterministic (e.g., Azure CLI not available, invalid credential chain options) and are unlikely to be fixed by waiting. Right now they’ll be treated as retryable (non-HTTP, non-sentinel) and can burn the full ~31.5 minute backoff before failing. Consider marking this error as permanent (wrap witherrPermanentKeyVaultLookup) so misconfiguration fails fast while still retrying genuine IMDS/token-acquisition blips duringGetSecret().
// Get Azure credentials using ARO-Tools
cred, err := cmdutils.GetAzureTokenCredentials()
if err != nil {
return "", fmt.Errorf("failed to get Azure credentials: %w", err)
}
…um attempts" wait.Backoff.Steps is the total number of attempts, not retries. Align this comment with the other two backoffs in the codebase.
geoberle
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
prowjobexecutor executecommand used in EV2 gating looks up the Prow token in Key Vault before it ever talks to Prow, and that lookup had no retry logic. When the managed-identity token acquisition hits a transient failure — most commonly anEOF/ connection reset reading the instance metadata endpoint (169.254.169.254) on the EV2 runner — the whole EV2 gating step fails immediately and the entire deployment job must be restarted from scratch.This is a separate failure point from #251 / AROSLSRE-1168, which added retries to the Gangway
SubmitJob()429 path. This failure happens earlier, during Key Vault credential acquisition.Observed in PROD (
branch-ci-Azure-ARO-HCP-main-e2e-prod-e2e-parallel,eastus2euap):az login --identitysucceeded seconds earlier in the same step, so this is a momentary IMDS/networking blip, not a misconfiguration.What changes
lookupProwTokenInKeyVaultin an exponential-backoff retry (retryProwTokenLookup), mirroring the existingSubmitJobpattern inprowjob/client.go.isRetryableKeyVaultErrorretries transient failures — credential/IMDS and network errors, plus Key Vault HTTP 429/5xx — and fails fast on permanent Key Vault 4xx (401/403/404/400/409).lookupProwTokenInKeyVaultOnce; the fetch is injectable so retry behavior is unit-tested without a live Key Vault.Testing
go test -race ./...— pass (newprowtoken_test.go: classifier 429/5xx/4xx/IMDS-EOF, retry-then-succeed, fail-fast on 403/404, exhaust on persistent transient).go tool golangci-lint run ./...— 0 issues.go vet ./...— clean.go mod tidy+go work syncclean (promotesazcoreto a direct require).Follow-ups
Azure/ARO-HCP(tooling/templatize,test) so the EV2-shippedprowjobexecutorbinary picks up the retry.Fixes AROSLSRE-1228.