Skip to content

feat(helm): fail fast on stale Helm release locks (AROSLSRE-997)#253

Merged
geoberle merged 7 commits into
Azure:mainfrom
raelga:raelga/aroslsre-997-helmdeploy-stale-lock
Jun 23, 2026
Merged

feat(helm): fail fast on stale Helm release locks (AROSLSRE-997)#253
geoberle merged 7 commits into
Azure:mainfrom
raelga:raelga/aroslsre-997-helmdeploy-stale-lock

Conversation

@raelga

@raelga raelga commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Relates to AROSLSRE-997.

Problem

When a previous helmdeploy run crashes or times out, it can leave the latest Helm release revision stuck in a pending state. The next deploy then fails with Helm's opaque error and no actionable context:

another operation (install/upgrade/rollback) is in progress

Operators have to manually figure out which release is locked and which sh.helm.release.v1.* secret to clean up.

Goal

Detect a stale release lock before deploying and fail fast with diagnostics that name the stuck release and tell the operator exactly how to recover — without breaking genuinely in-flight operations.

What changes

  • Before the pre-deploy dry-run, inspect the latest release revision. If its status is pending-install, pending-upgrade, or pending-rollback and it is older than a configurable threshold, fail fast with a StaleReleaseLockError.
  • The error reports the release name, namespace, pending revision, status, age, and the backing Helm secret (sh.helm.release.v1.<name>.v<rev>), plus copy-paste remediation (kubectl get/delete secret).
  • New --stale-lock-threshold flag (default 15m). A pending op younger than the threshold is treated as active and left alone. Set --stale-lock-threshold=0 to disable the check.

Example

helm release "backend" in namespace "aro-hcp" is stuck in pending state
"pending-upgrade" at revision 7 (age 42m0s exceeds staleness threshold 15m0s);
a previous Helm operation most likely crashed or timed out and left a stale
release lock.
To recover, back up and delete the stale Helm release secret, then retry:
  kubectl --namespace aro-hcp get secret sh.helm.release.v1.backend.v7 -o yaml > sh.helm.release.v1.backend.v7.backup.yaml
  kubectl --namespace aro-hcp delete secret sh.helm.release.v1.backend.v7

Validation

  • go build ./..., go vet ./..., gofmt -l clean in tools/helm.
  • go test ./... passes, including new table-driven coverage for: stale pending-upgrade/pending-install (fails fast), normal deployed history (no false positive), fresh pending op within threshold (no false positive), empty history, and "only latest revision considered".

Follow-ups

None.

helmdeploy previously surfaced the opaque Helm error "another operation
(install/upgrade/rollback) is in progress" when a prior deployment crashed
or timed out and left the latest release revision stuck in a pending state.

Add a pre-deploy stale-lock check that inspects the latest release revision
before the dry-run/upgrade. If the latest revision is pending-install,
pending-upgrade, or pending-rollback and older than a configurable
threshold (--stale-lock-threshold, default 15m), the deploy fails fast with
a diagnostic error reporting the release name, namespace, pending revision,
status, age, and the backing Helm secret (sh.helm.release.v1.<name>.v<rev>),
plus operator remediation guidance. Genuinely in-flight operations younger
than the threshold are left untouched; the check can be disabled with
--stale-lock-threshold=0.

Relates to AROSLSRE-997.
Copilot AI review requested due to automatic review settings June 19, 2026 13:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pre-deploy safety check to tools/helm that detects Helm releases stuck in a stale pending state and fails early with an actionable error message (including the specific sh.helm.release.v1.* secret and remediation commands), instead of letting Helm fail later with “another operation … is in progress”.

Changes:

  • Introduces stale pending-release detection (pending-* older than a configurable threshold) and a StaleReleaseLockError with operator-focused diagnostics.
  • Adds --stale-lock-threshold (default 15m, 0 disables) and wires the check into the deploy flow before the pre-deploy dry-run.
  • Adds table-driven tests for stale vs. fresh pending states and validates the error message content.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tools/helm/stale_lock.go Implements stale pending revision detection and an actionable error type/message.
tools/helm/stale_lock_test.go Adds coverage for stale-lock detection and error message content.
tools/helm/options.go Adds CLI flag/default for the threshold and calls the stale-lock check during deploy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/helm/options.go
Comment thread tools/helm/stale_lock_test.go Outdated
Address review feedback: cobra's DurationVar accepts negative durations,
which the `> 0` guard silently treated as "disabled", masking
misconfiguration. Validate() now fails fast when --stale-lock-threshold is
negative (use 0 to disable explicitly). Also align the stale-lock tests on
DefaultStaleLockThreshold instead of a hard-coded value, and add validation
coverage for negative/zero/positive thresholds.

Relates to AROSLSRE-997.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Comment thread tools/helm/stale_lock.go
Address review nit: keep checkForStaleReleaseLock focused by extracting the
release age calculation into releaseAge() and the backing-secret name into
releaseSecretName(), each with a single responsibility.

Relates to AROSLSRE-997.
Copilot AI review requested due to automatic review settings June 23, 2026 11:54

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread tools/helm/stale_lock_test.go
Add an explicit pending-rollback case to the stale-lock table tests so
rollback handling is guarded against regressions alongside pending-install
and pending-upgrade.

Relates to AROSLSRE-997.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread tools/helm/stale_lock.go Outdated
Reword checkForStaleReleaseLock doc to reference Info.LastDeployed explicitly
instead of "last deployed", avoiding any implication that a pending revision
was successfully deployed.

Relates to AROSLSRE-997.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread tools/helm/stale_lock.go
Comment thread tools/helm/stale_lock_test.go
A pending revision that was never successfully deployed (notably
pending-install) can carry a zero Info.LastDeployed timestamp. time.Since on
that yields a near-infinite age, which would falsely flag a genuinely
in-flight operation as a stale lock and fail fast.

Guard releaseAge against a zero LastDeployed (return 0, i.e. fresh) and add a
table case asserting a pending revision with an unset timestamp does not
trigger.

Relates to AROSLSRE-997.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@geoberle geoberle merged commit 607a228 into Azure:main Jun 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants