Skip to content

fix(longhorn): node-local CI work volumes + raise manager memory (JDWLABS-22)#8

Open
jdwillmsen wants to merge 2 commits into
mainfrom
fix/JDWLABS-22-ci-stability
Open

fix(longhorn): node-local CI work volumes + raise manager memory (JDWLABS-22)#8
jdwillmsen wants to merge 2 commits into
mainfrom
fix/JDWLABS-22-ci-stability

Conversation

@jdwillmsen
Copy link
Copy Markdown
Member

Summary

Stability fixes for the Longhorn instance-manager cascade (JDWLABS-22) that faulted ~11 volumes — including Vault (sealed) and the CI runner _work (EIO/SIGBUS, crashed apps PR #7) — on 2026-06-06.

Closes JDWLABS-23 and JDWLABS-24. Part of epic JDWLABS-22.

Changes

  • New StorageClass longhorn-ephemeral-localnumberOfReplicas: 1, dataLocality: strict-local, reclaimPolicy: Delete. Node-local, single-replica scratch for throwaway CI work volumes.
  • Repoint both ARC runner sets (jdwlabs, dotablaze-tech) work-volume claims from longhorn-ephemerallonghorn-ephemeral-local.
  • longhorn-manager memory request 512Mi → 1Gi, limit 1Gi → 2Gi.

Why

CI does heavy node_modules small-file I/O. The old 3-replica longhorn-ephemeral SC replicated every write across the network → slow builds (12→30 min) and made CI a casualty of any instance-manager disruption. strict-local single-replica removes the replication load and confines a build's blast radius to its own node. kubernetes containerMode requires a PVC, so emptyDir is not viable — this is the node-local equivalent. Separately, longhorn-manager was OOM-killed (exitCode 137) at 1Gi under churn.

Verification

  • kubectl apply --dry-run=server on the new StorageClass: accepted by Longhorn.
  • ✅ All edited values files parse as valid YAML.
  • ⬜ Post-merge: ArgoCD platform-longhorn + jdwlabs-arc-runner-set-jdwlabs Synced/Healthy; a CI build runs green on the new SC with improved wall-time.

Scope / follow-ups (separate issues under JDWLABS-22)

  • Vault HA (JDWLABS-25), cluster memory pressure (JDWLABS-26), CI timeout-minutes (JDWLABS-27), runner node isolation (JDWLABS-28), alerting (JDWLABS-29), PDBs (JDWLABS-30).

🤖 Generated with Claude Code

jdwillmsen and others added 2 commits June 6, 2026 15:13
…LABS-22)

Two related stability fixes for the Longhorn instance-manager cascade
that faulted ~11 volumes (Vault, CI runner _work) on 2026-06-06.

JDWLABS-23: add longhorn-ephemeral-local StorageClass (numberOfReplicas 1,
dataLocality strict-local, reclaim Delete) and point both ARC runner-set
work-volume claims at it. CI does heavy node_modules small-file I/O; the
previous 3-replica longhorn-ephemeral SC replicated every write across the
network, which both slowed builds (12->30 min) and made CI a casualty of
any instance-manager disruption. A strict-local single replica removes the
replication load and confines a CI build's blast radius to its own node.
kubernetes containerMode requires a PVC, so emptyDir is not an option; this
is the node-local equivalent.

JDWLABS-24: raise longhorn-manager memory request 512Mi->1Gi and limit
1Gi->2Gi. It was OOM-killed (exitCode 137) under volume churn, disrupting
instance-managers.

Verified: new StorageClass passes kubectl apply --dry-run=server; all
edited values files parse as valid YAML.

Refs JDWLABS-22, JDWLABS-23, JDWLABS-24

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ticket traceability belongs in commit messages and PR descriptions, not
in code/config comments where it rots and clutters the codebase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant