feat(helm): fail fast on stale Helm release locks (AROSLSRE-997)#253
Merged
geoberle merged 7 commits intoJun 23, 2026
Merged
Conversation
helmdeploy previously surfaced the opaque Helm error "another operation (install/upgrade/rollback) is in progress" when a prior deployment crashed or timed out and left the latest release revision stuck in a pending state. Add a pre-deploy stale-lock check that inspects the latest release revision before the dry-run/upgrade. If the latest revision is pending-install, pending-upgrade, or pending-rollback and older than a configurable threshold (--stale-lock-threshold, default 15m), the deploy fails fast with a diagnostic error reporting the release name, namespace, pending revision, status, age, and the backing Helm secret (sh.helm.release.v1.<name>.v<rev>), plus operator remediation guidance. Genuinely in-flight operations younger than the threshold are left untouched; the check can be disabled with --stale-lock-threshold=0. Relates to AROSLSRE-997.
There was a problem hiding this comment.
Pull request overview
Adds a pre-deploy safety check to tools/helm that detects Helm releases stuck in a stale pending state and fails early with an actionable error message (including the specific sh.helm.release.v1.* secret and remediation commands), instead of letting Helm fail later with “another operation … is in progress”.
Changes:
- Introduces stale pending-release detection (
pending-*older than a configurable threshold) and aStaleReleaseLockErrorwith operator-focused diagnostics. - Adds
--stale-lock-threshold(default15m,0disables) and wires the check into the deploy flow before the pre-deploy dry-run. - Adds table-driven tests for stale vs. fresh pending states and validates the error message content.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tools/helm/stale_lock.go | Implements stale pending revision detection and an actionable error type/message. |
| tools/helm/stale_lock_test.go | Adds coverage for stale-lock detection and error message content. |
| tools/helm/options.go | Adds CLI flag/default for the threshold and calls the stale-lock check during deploy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address review feedback: cobra's DurationVar accepts negative durations, which the `> 0` guard silently treated as "disabled", masking misconfiguration. Validate() now fails fast when --stale-lock-threshold is negative (use 0 to disable explicitly). Also align the stale-lock tests on DefaultStaleLockThreshold instead of a hard-coded value, and add validation coverage for negative/zero/positive thresholds. Relates to AROSLSRE-997.
geoberle
reviewed
Jun 22, 2026
Address review nit: keep checkForStaleReleaseLock focused by extracting the release age calculation into releaseAge() and the backing-secret name into releaseSecretName(), each with a single responsibility. Relates to AROSLSRE-997.
geoberle
approved these changes
Jun 23, 2026
Relates to AROSLSRE-997.
Add an explicit pending-rollback case to the stale-lock table tests so rollback handling is guarded against regressions alongside pending-install and pending-upgrade. Relates to AROSLSRE-997.
Reword checkForStaleReleaseLock doc to reference Info.LastDeployed explicitly instead of "last deployed", avoiding any implication that a pending revision was successfully deployed. Relates to AROSLSRE-997.
geoberle
approved these changes
Jun 23, 2026
A pending revision that was never successfully deployed (notably pending-install) can carry a zero Info.LastDeployed timestamp. time.Since on that yields a near-infinite age, which would falsely flag a genuinely in-flight operation as a stale lock and fail fast. Guard releaseAge against a zero LastDeployed (return 0, i.e. fresh) and add a table case asserting a pending revision with an unset timestamp does not trigger. Relates to AROSLSRE-997.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Relates to AROSLSRE-997.
Problem
When a previous
helmdeployrun crashes or times out, it can leave the latest Helm release revision stuck in a pending state. The next deploy then fails with Helm's opaque error and no actionable context:Operators have to manually figure out which release is locked and which
sh.helm.release.v1.*secret to clean up.Goal
Detect a stale release lock before deploying and fail fast with diagnostics that name the stuck release and tell the operator exactly how to recover — without breaking genuinely in-flight operations.
What changes
pending-install,pending-upgrade, orpending-rollbackand it is older than a configurable threshold, fail fast with aStaleReleaseLockError.sh.helm.release.v1.<name>.v<rev>), plus copy-paste remediation (kubectl get/delete secret).--stale-lock-thresholdflag (default15m). A pending op younger than the threshold is treated as active and left alone. Set--stale-lock-threshold=0to disable the check.Example
Validation
go build ./...,go vet ./...,gofmt -lclean intools/helm.go test ./...passes, including new table-driven coverage for: stalepending-upgrade/pending-install(fails fast), normaldeployedhistory (no false positive), fresh pending op within threshold (no false positive), empty history, and "only latest revision considered".Follow-ups
None.