fix(longhorn): node-local CI work volumes + raise manager memory (JDWLABS-22)#8
Open
jdwillmsen wants to merge 2 commits into
Open
fix(longhorn): node-local CI work volumes + raise manager memory (JDWLABS-22)#8jdwillmsen wants to merge 2 commits into
jdwillmsen wants to merge 2 commits into
Conversation
…LABS-22) Two related stability fixes for the Longhorn instance-manager cascade that faulted ~11 volumes (Vault, CI runner _work) on 2026-06-06. JDWLABS-23: add longhorn-ephemeral-local StorageClass (numberOfReplicas 1, dataLocality strict-local, reclaim Delete) and point both ARC runner-set work-volume claims at it. CI does heavy node_modules small-file I/O; the previous 3-replica longhorn-ephemeral SC replicated every write across the network, which both slowed builds (12->30 min) and made CI a casualty of any instance-manager disruption. A strict-local single replica removes the replication load and confines a CI build's blast radius to its own node. kubernetes containerMode requires a PVC, so emptyDir is not an option; this is the node-local equivalent. JDWLABS-24: raise longhorn-manager memory request 512Mi->1Gi and limit 1Gi->2Gi. It was OOM-killed (exitCode 137) under volume churn, disrupting instance-managers. Verified: new StorageClass passes kubectl apply --dry-run=server; all edited values files parse as valid YAML. Refs JDWLABS-22, JDWLABS-23, JDWLABS-24 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Ticket traceability belongs in commit messages and PR descriptions, not in code/config comments where it rots and clutters the codebase.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stability fixes for the Longhorn instance-manager cascade (JDWLABS-22) that faulted ~11 volumes — including Vault (sealed) and the CI runner
_work(EIO/SIGBUS, crashed apps PR #7) — on 2026-06-06.Closes JDWLABS-23 and JDWLABS-24. Part of epic JDWLABS-22.
Changes
longhorn-ephemeral-local—numberOfReplicas: 1,dataLocality: strict-local,reclaimPolicy: Delete. Node-local, single-replica scratch for throwaway CI work volumes.jdwlabs,dotablaze-tech) work-volume claims fromlonghorn-ephemeral→longhorn-ephemeral-local.512Mi → 1Gi, limit1Gi → 2Gi.Why
CI does heavy
node_modulessmall-file I/O. The old 3-replicalonghorn-ephemeralSC replicated every write across the network → slow builds (12→30 min) and made CI a casualty of any instance-manager disruption.strict-localsingle-replica removes the replication load and confines a build's blast radius to its own node.kubernetescontainerMode requires a PVC, soemptyDiris not viable — this is the node-local equivalent. Separately,longhorn-managerwas OOM-killed (exitCode 137) at 1Gi under churn.Verification
kubectl apply --dry-run=serveron the new StorageClass: accepted by Longhorn.platform-longhorn+jdwlabs-arc-runner-set-jdwlabsSynced/Healthy; a CI build runs green on the new SC with improved wall-time.Scope / follow-ups (separate issues under JDWLABS-22)
timeout-minutes(JDWLABS-27), runner node isolation (JDWLABS-28), alerting (JDWLABS-29), PDBs (JDWLABS-30).🤖 Generated with Claude Code