feat(monitoring): add platform stability alerts (JDWLABS-29) by jdwillmsen · Pull Request #9 · jdwlabs/platform

jdwillmsen · 2026-06-07T00:44:47Z

Summary

Proactive alerts for the Longhorn instance-manager cascade failure mode (epic JDWLABS-22) — which self-healed and stayed invisible until a CI build happened to fail and was investigated by hand.

Closes JDWLABS-29.

New PrometheusRule `platform-stability-rules` (ns: monitoring)

Alert	Expr	Severity
LonghornInstanceManagerRecreated	instance-manager pod age < 10m	warning
LonghornManagerOOMKilled	last terminated reason = OOMKilled	warning
VaultNotReady	platform-vault StatefulSet ready replicas < 1	critical
NodeMemoryHigh	node mem > 85% for 10m	warning

LonghornVolumeFaulted already exists in longhorn-rules, so it's not duplicated here. ruleSelectorNilUsesHelmValues: false means the rule is auto-selected.

Verification

✅ kubectl apply --dry-run=server: accepted.
✅ All 4 expressions evaluated against live Prometheus: status=success, 0 firing (cluster healthy), and each underlying metric returns series.
✅ Valid YAML.

Known gap / follow-up

vault_core_unsealed returned 0 series — Vault's vault-metrics ServiceMonitor exists but isn't producing data, so a true seal-while-running alert can't fire yet. VaultNotReady (kube-state) covers pod loss, which is how the incident actually manifested. Fixing Vault metrics scraping → add VaultSealed is noted as a follow-up.

🤖 Generated with Claude Code

Add PrometheusRule platform-stability-rules with alerts for the failure modes behind the instance-manager cascade incident, which previously self-healed and stayed invisible until a build happened to fail: - LonghornInstanceManagerRecreated: instance-manager pod recreated in the last 10m (faults the replicas it served). - LonghornManagerOOMKilled: longhorn-manager hit its memory limit. - VaultNotReady: platform-vault StatefulSet has no ready replicas. - NodeMemoryHigh: node memory above 85%. Vault's own telemetry (vault_core_unsealed) is not currently scraped, so a true seal-while-running alert is left as a follow-up; VaultNotReady covers pod loss, which is how the incident manifested. Verified: all four expressions evaluate against live Prometheus (status=success, 0 firing) and the metrics return series; manifest passes kubectl apply --dry-run=server. Closes JDWLABS-29. Refs JDWLABS-22. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitoring): add platform stability alerts (JDWLABS-29)#9

feat(monitoring): add platform stability alerts (JDWLABS-29)#9
jdwillmsen wants to merge 1 commit into
mainfrom
fix/JDWLABS-29-stability-alerts

jdwillmsen commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdwillmsen commented Jun 7, 2026

Summary

New PrometheusRule platform-stability-rules (ns: monitoring)

Verification

Known gap / follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New PrometheusRule `platform-stability-rules` (ns: monitoring)