Skip to content

fix(core): make migration Job hooks stable across Helm + ArgoCD#358

Open
Dav-14 wants to merge 1 commit into
mainfrom
fix/job-hooks-and-migration-sa
Open

fix(core): make migration Job hooks stable across Helm + ArgoCD#358
Dav-14 wants to merge 1 commit into
mainfrom
fix/job-hooks-and-migration-sa

Conversation

@Dav-14
Copy link
Copy Markdown
Contributor

@Dav-14 Dav-14 commented May 6, 2026

Summary

core.job.annotations has hit three real-world failures across Formance projects (membership, portal, console-v3 — and any chart that consumes it):

  1. Identity flip on first upgrade. Conditional {{- if and (not .Release.IsInstall) .Values.feature.migrationHooks }} meant the Job was a plain release resource on helm install and a pre-upgrade hook on the next helm upgrade. Helm tracks hooks outside the release manifest, so the resource flips identity. First upgrade after enabling the feature fails with Job already exists; helm uninstall later leaves orphans.
  2. hook-succeeded swallowed logs, leaving kubectl logs empty post-migration when operators try to investigate failures.
  3. ArgoCD's helm template mode ignored Helm hook annotations entirely, so under Argo the Job re-applied on every sync as a plain resource and tripped Job-spec immutability on the second sync.

Changes

  • Rewrite core.job.annotations to always emit hook annotations covering Helm install + Helm upgrade + ArgoCD sync:
    helm.sh/hook: pre-install,pre-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: "10"
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    argocd.argoproj.io/sync-wave: "10"
  • Restore core.postgres.job.sa.annotations from a deprecated alias to a real helper. SA now uses hook-weight: "0" / sync-wave: "0" so it's created BEFORE the Job that references it (previously both were weight 10 → SA could be missing when the Job's pod tried to schedule).
  • feature.migrationHooks is no longer consulted. Hooks are always emitted because every other mode is broken in some way. Consumers can drop the value with no behaviour change. The old alias core.postgres.job.annotations is kept (still deprecated) for backward compat.
  • Bump core to 1.6.0 (minor: helper semantics change, no breaking API).

What this lets consumer charts do

  • Stop using feature.migrationHooks.
  • Get a clean pre-install,pre-upgrade lifecycle on plain Helm AND a clean PreSync lifecycle on ArgoCD.
  • Use a dedicated migration ServiceAccount via core.postgres.job.serviceAccountName + core.postgres.job.sa.annotations without the SA-not-yet-existing race.

Test plan

  • helm install + helm upgrade cycle: Job is a hook on both, identity stable, no orphans.
  • ArgoCD sync: Job runs as PreSync, recreated each sync via BeforeHookCreation.
  • Failed migration: Job pod stays around for kubectl logs.
  • Successful migration: Job cleaned up by ttlSecondsAfterFinished, not by Helm hook policy.
  • Consumer chart (terraform-hcp-proxy) updated to use the new helpers — see follow-up PR.

🤖 Generated with Claude Code

`core.job.annotations` had two real-world failure modes operators have
hit across Formance projects (membership, portal, console-v3, and
inherited by every chart that uses the helper):

1. **Identity flip on first upgrade.** The conditional
   `{{- if and (not .Release.IsInstall) .Values.feature.migrationHooks }}`
   meant the Job was rendered as a plain release resource on
   `helm install`, but as a `pre-upgrade` hook on the next
   `helm upgrade`. Helm tracks hooks outside the release manifest, so
   the same `Job/<release>-migrate` flips identity:
     - Install: tracked in release, owned by Helm release.
     - Upgrade: emitted as a hook, NOT in release manifest, orphaned.
   First post-feature-enable upgrade fails with `Job already exists`
   (install-time Job is still there, hook can't claim the same name);
   `helm uninstall` later leaves an orphan because Helm only deletes
   resources it tracks in the latest release.

2. **`hook-succeeded` swallows logs.** The previous delete policy
   `before-hook-creation,hook-succeeded,hook-failed` deleted the
   Job (and its pod) immediately on success, leaving `kubectl logs`
   empty when an operator goes to investigate a slow migration.

3. **ArgoCD's `helm template` ignored hook annotations entirely**, so
   under Argo the Job re-applied on every sync as a plain resource and
   tripped Job-spec immutability on the second sync.

This commit replaces the helper with always-emit hook annotations
covering all three orchestration paths:

- `helm.sh/hook: pre-install,pre-upgrade` — stable identity on both
  install and upgrade; no flip.
- `helm.sh/hook-delete-policy: before-hook-creation` — handles spec
  changes (image bumps) by deleting the previous Job before
  recreating. `hook-succeeded` removed; cleanup is bounded by the
  Job's `ttlSecondsAfterFinished` instead.
- `argocd.argoproj.io/hook: PreSync` +
  `argocd.argoproj.io/hook-delete-policy: BeforeHookCreation` +
  `argocd.argoproj.io/sync-wave: "10"` — same intent under Argo.

Also fixes `core.postgres.job.sa.annotations` (previously
deprecated and aliased to `core.job.annotations`, so the SA inherited
the same `hook-weight: "10"` as the Job that referenced it — meaning
the SA had no guarantee of existing when the Job's pod tried to
schedule). Now emits the same hook block but at weight `0` /
sync-wave `0` so the SA is always created first.

`feature.migrationHooks` is no longer consulted; consumers can
drop it from their values without behaviour change. Kept in
deprecated state with the old alias `core.postgres.job.annotations`
for backward-compat.

Bump core to 1.6.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dav-14 Dav-14 requested a review from a team as a code owner May 6, 2026 15:12
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 281fa185-f9a0-437c-abbc-b16f8ecc11f1

📥 Commits

Reviewing files that changed from the base of the PR and between a3f9e1e and 0f394bb.

⛔ Files ignored due to path filters (1)
  • charts/core/Chart.yaml is excluded by !**/*.yaml
📒 Files selected for processing (2)
  • charts/core/templates/_job.tpl
  • charts/core/templates/_postgres.tpl

Walkthrough

The Helm Job template annotation strategy is consolidated to unconditionally apply combined Helm and ArgoCD hook annotations (pre-install, pre-upgrade, PreSync) with synchronized policies. The Postgres Job template is updated to delegate to the refactored base template and add explicit service-account-specific hook metadata.

Changes

Hook Annotation Consolidation

Layer / File(s) Summary
Base Template & Documentation
charts/core/templates/_job.tpl
Job annotation docstring and core.job.annotations block replaced; conditional pre-upgrade gating removed in favor of unconditional, combined Helm (pre-install, pre-upgrade) and ArgoCD (PreSync) hooks with unified hook-delete-policy and sync-wave.
Dependent Template Updates
charts/core/templates/_postgres.tpl
core.postgres.job.annotations redefined with deprecation note delegating to core.job.annotations; new core.postgres.job.sa.annotations added with explicit Helm/ArgoCD hook metadata (pre-install, pre-upgrade, hook-weight, PreSync, BeforeHookCreation) for service-account migrations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

A rabbit hops through Helm charts with glee, 🐰
No more conditions to gate what should be,
Hooks sync together, Argo and Helm as one,
Migrations run smooth—the refactoring's done!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the primary change: making migration Job hooks stable across Helm and ArgoCD deployment systems.
Description check ✅ Passed The description comprehensively documents the three real-world failures being fixed, the specific changes made, and the test plan for verification.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/job-hooks-and-migration-sa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added bug Something isn't working release labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant