Skip to content

feat(monitoring): add platform stability alerts (JDWLABS-29)#9

Open
jdwillmsen wants to merge 1 commit into
mainfrom
fix/JDWLABS-29-stability-alerts
Open

feat(monitoring): add platform stability alerts (JDWLABS-29)#9
jdwillmsen wants to merge 1 commit into
mainfrom
fix/JDWLABS-29-stability-alerts

Conversation

@jdwillmsen
Copy link
Copy Markdown
Member

Summary

Proactive alerts for the Longhorn instance-manager cascade failure mode (epic JDWLABS-22) — which self-healed and stayed invisible until a CI build happened to fail and was investigated by hand.

Closes JDWLABS-29.

New PrometheusRule platform-stability-rules (ns: monitoring)

Alert Expr Severity
LonghornInstanceManagerRecreated instance-manager pod age < 10m warning
LonghornManagerOOMKilled last terminated reason = OOMKilled warning
VaultNotReady platform-vault StatefulSet ready replicas < 1 critical
NodeMemoryHigh node mem > 85% for 10m warning

LonghornVolumeFaulted already exists in longhorn-rules, so it's not duplicated here. ruleSelectorNilUsesHelmValues: false means the rule is auto-selected.

Verification

  • kubectl apply --dry-run=server: accepted.
  • ✅ All 4 expressions evaluated against live Prometheus: status=success, 0 firing (cluster healthy), and each underlying metric returns series.
  • ✅ Valid YAML.

Known gap / follow-up

vault_core_unsealed returned 0 series — Vault's vault-metrics ServiceMonitor exists but isn't producing data, so a true seal-while-running alert can't fire yet. VaultNotReady (kube-state) covers pod loss, which is how the incident actually manifested. Fixing Vault metrics scraping → add VaultSealed is noted as a follow-up.

🤖 Generated with Claude Code

Add PrometheusRule platform-stability-rules with alerts for the failure
modes behind the instance-manager cascade incident, which previously
self-healed and stayed invisible until a build happened to fail:

- LonghornInstanceManagerRecreated: instance-manager pod recreated in the
  last 10m (faults the replicas it served).
- LonghornManagerOOMKilled: longhorn-manager hit its memory limit.
- VaultNotReady: platform-vault StatefulSet has no ready replicas.
- NodeMemoryHigh: node memory above 85%.

Vault's own telemetry (vault_core_unsealed) is not currently scraped, so
a true seal-while-running alert is left as a follow-up; VaultNotReady
covers pod loss, which is how the incident manifested.

Verified: all four expressions evaluate against live Prometheus
(status=success, 0 firing) and the metrics return series; manifest
passes kubectl apply --dry-run=server.

Closes JDWLABS-29. Refs JDWLABS-22.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant