feat(vault): 3-node raft HA on single-replica storage (JDWLABS-25) [DRAFT — window-gated] by jdwillmsen · Pull Request #10 · jdwlabs/platform

jdwillmsen · 2026-06-07T02:36:18Z

⚠️ DRAFT — DO NOT MERGE until a maintenance window

Merging triggers ArgoCD to redeploy Vault as 3-node raft, which re-inits Vault and wipes current data. Execute only via the JDWLABS-25 cutover runbook (pre-flight seed-value check → phase 3 init → phase 4 re-seed). Design + runbook live on JDWLABS-25.

Summary

Removes Vault as a single point of failure. On 2026-06-06 a Longhorn instance-manager fault sealed the single standalone Vault, downing the cluster ClusterSecretStore. Part of epic JDWLABS-22.

Closes JDWLABS-25, JDWLABS-30 (Vault PDB is auto-managed by the chart in HA mode).

Changes

New longhorn-single StorageClass — 1 replica, best-effort, Retain. Raft replicates across the 3 pods, so one Longhorn fault takes at most one Vault node; quorum (2/3) survives. This is the SPOF removal.
vault/values.yaml — ha.raft enabled, 3 replicas, host anti-affinity, retry_join (follower auto-join), data on longhorn-single.
vault-unseal-cronjob.yaml — unseal all 3 pods by DNS name (the VIP only reaches one, leaving followers sealed).

Verification (pre-merge, non-destructive)

✅ helm template (chart 0.30.1): renders 3-replica StatefulSet, raft config w/ 3 retry_join, longhorn-single PVCs, host anti-affinity, auto PDB platform-vault.
✅ kubectl apply --dry-run=server on SC + CronJob.
✅ All files valid YAML.
⬜ Post-cutover (in window): raft list-peers=3; kill-one-pod keeps secrets served; ExternalSecrets re-sync.

Cutover runbook

On JDWLABS-25. Pre-flight: confirm all seed values on hand (porkbun, grafana, longhorn, alertmanager, usersrole, argocd-dex, per-tenant github-app/ai-keys/discord) — re-init wipes current secrets.

🤖 Generated with Claude Code

Removes Vault as a single point of failure. On 2026-06-06 a Longhorn instance-manager fault sealed the single standalone Vault, taking down the cluster ClusterSecretStore. - New longhorn-single StorageClass (1 replica, best-effort, Retain): raft replicates across pods at the app layer, so one Longhorn fault takes at most one Vault node and quorum (2/3) survives. - vault values: ha.raft enabled, 3 replicas, host anti-affinity, retry_join for follower auto-join, data on longhorn-single. - unseal cron: unseal all three pods by name (the service VIP only reaches one, leaving followers sealed). Verified: helm template (chart 0.30.1) renders a 3-replica StatefulSet with the raft config, longhorn-single PVCs, host anti-affinity, and an auto-managed PodDisruptionBudget; SC and CronJob pass --dry-run=server; all files parse as valid YAML. DO NOT MERGE until the maintenance window: applying this re-inits Vault (wipes current data). Follow the JDWLABS-25 cutover runbook (pre-flight seed-value check, phase 3 init, phase 4 re-seed). Closes JDWLABS-25. Closes JDWLABS-30 (Vault PDB auto-managed by chart). Refs JDWLABS-22. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vault): 3-node raft HA on single-replica storage (JDWLABS-25) [DRAFT — window-gated]#10

feat(vault): 3-node raft HA on single-replica storage (JDWLABS-25) [DRAFT — window-gated]#10
jdwillmsen wants to merge 1 commit into
mainfrom
feat/JDWLABS-25-vault-ha

jdwillmsen commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdwillmsen commented Jun 7, 2026

⚠️ DRAFT — DO NOT MERGE until a maintenance window

Summary

Changes

Verification (pre-merge, non-destructive)

Cutover runbook

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant