Skip to content

feat: node recovery agent#28

Merged
hlts2 merged 71 commits intomainfrom
feat/node-recovery
Apr 23, 2026
Merged

feat: node recovery agent#28
hlts2 merged 71 commits intomainfrom
feat/node-recovery

Conversation

@hlts2
Copy link
Copy Markdown
Member

@hlts2 hlts2 commented Apr 13, 2026

Summary

  • Add recovery state machine (Healthy → Unhealthy → WaitingReboot → Failed) with PhaseUnknown as zero value
  • Add pkg/health: HealthChecker interface with per-checker thresholds and reason reporting
    • NodeReadyChecker (5min), GPUChecker (10min), DiskPressureChecker (30min), CiliumChecker (10min)
    • GPU checker reads expected count from nvidia.com/gpu.count label vs allocatable (no static env var)
    • Cilium checker monitors NetworkUnavailable condition with CiliumIsUp reason (auto-skips non-Cilium CNI)
    • Check() returns (bool, string) — the reason is used as a low-cardinality metrics label for observability
  • Add pkg/operation: Executor interface with CivoExecutor (Functional Option Pattern) and NopExecutor as a safe default
  • Add pkg/metrics: Prometheus metrics with civo_ prefix (see Metrics section)
  • Add pkg/watcher/state: NodePhase enum, NodeState (private fields + getters), StateStore with transition methods (including MarkFailed)
  • Refactor watcher: replace API polling with Node Informer, state machine reconcile loop
  • Differentiate reboot wait: standard nodes 10min, GPU nodes 40min
  • Reboot retry limit: after CIVO_NODE_MAX_REBOOT_RETRIES (default 5) attempts, transition to Failed to prevent infinite reboot loops. rebootCount only increments in active mode so monitor-only tracking is not polluted by simulated reboots.
  • Support multiple node pool IDs (comma-separated via CIVO_NODE_POOL_IDS) or empty (all nodes)
  • Graceful shutdown: HTTP metrics server shutdown + civo_node_agent_info cleanup via defer
  • Clean up per-node Prometheus metric labels when a node is removed from the cluster
  • Update Helm chart to new env vars and move non-secret config to values.yaml
  • Remove legacy code: fake.go, inline reboot/check functions, sync.Map tracking, CIVO_NODE_DESIRED_GPU_COUNT, CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES

Deployment target

civo-node-agent is designed to run as a daemon process on the control plane VM (preferred deployment, installed via an external script outside this repo). A Helm chart is also provided for running it as a single-replica Deployment in kube-system if needed.

Metrics

All metrics use the civo_ prefix.

Metric Type Labels Description
civo_node_agent_info Gauge version, cluster_id Build and cluster identity (always 1 while running; deleted on graceful shutdown)
civo_node_agent_health_check_total Counter node, checker, result Health check executions (result is a low-cardinality reason)
civo_node_agent_recovery_actions_total Counter node, action, mode Recovery actions attempted (mode: monitor or active)
civo_node_agent_recovery_failures_total Counter node, action Recovery actions that failed (e.g. Civo API errors)
civo_node_agent_reconcile_errors_total Counter reason Errors in the reconcile loop (e.g. list_nodes)
civo_node_agent_node_unhealthy_duration_seconds Gauge node How long a node has been continuously unhealthy
civo_node_agent_recovery_phase Gauge node, phase Current recovery phase (Healthy, Unhealthy, WaitingReboot, Failed = 1; others = 0)

Environment Variables

Agent-specific settings use the CIVO_NODE_AGENT_ prefix.

Variable Default Description
CIVO_API_KEY Civo API key (read from civo-api-access secret in Helm deployment)
CIVO_API_URL Civo API URL
CIVO_CLUSTER_ID Civo cluster ID (used as label in civo_node_agent_info)
CIVO_REGION Civo region
CIVO_NODE_POOL_IDS empty Comma-separated node pool IDs to watch. Empty = all nodes
CIVO_NODE_AGENT_MONITOR_ONLY true Monitor-only mode (log but don't reboot)
CIVO_NODE_AGENT_METRICS_PORT 9625 Prometheus metrics HTTP port (must be 1024-65535)
CIVO_NODE_REBOOT_WAIT_MINUTES 10 Minutes to wait after rebooting a standard node before retrying
CIVO_GPU_NODE_REBOOT_WAIT_MINUTES 40 Minutes to wait after rebooting a GPU node before retrying
CIVO_NODE_MAX_REBOOT_RETRIES 5 Maximum reboot attempts before the node transitions to Failed

Command Line Flags

Flag Default Description
--kubeconfig /etc/rancher/k3s/k3s.yaml Path to kubeconfig file (empty for in-cluster config)
--version Print the agent version

Helm Chart

Non-sensitive configuration moved from secrets to values.yaml. CIVO_API_KEY, CIVO_API_URL, CIVO_CLUSTER_ID, CIVO_REGION still come from the auto-provisioned civo-api-access secret.

Value Default Description
nodePoolIDs "" Comma-separated node pool IDs to watch. Empty = all nodes
rebootWaitMinutes 10 Minutes to wait after rebooting a standard node
gpuRebootWaitMinutes 40 Minutes to wait after rebooting a GPU node
maxRebootRetries 5 Max reboot attempts before transitioning to Failed
monitorOnly true If true, log recovery actions without executing them
metricsPort 9625 Port for the Prometheus metrics endpoint

The Deployment passes --kubeconfig="" so the agent uses in-cluster config.

Design Decisions

  • Monitor-only mode (default true): logs recovery actions without executing them
  • Per-checker thresholds: each checker defines its own unhealthy threshold; reboot triggers when the shortest threshold of any failed checker is exceeded
  • GPU detection via label: nvidia.com/gpu.count (static, set by GFD) is compared to allocatable["nvidia.com/gpu"] (dynamic). health.HasGPU also uses the label so GPU nodes are correctly identified even when all GPUs are unhealthy
  • Different reboot wait for standard vs GPU nodes: 10min vs 40min
  • Bounded reboot retries: after maxRebootRetries attempts the node goes to Failed (no further reboots) to prevent infinite loops. Recovery from Failed happens automatically when all health checks pass again.
  • rebootCount tracks real reboots only: in monitor-only mode the counter is not incremented, so PhaseFailed transitions apply only to actual reboot-attempt exhaustion.
  • NopExecutor as default: prevents nil pointer dereference when no executor is configured
  • UTC timestamps: nowFunc defaults to time.Now().UTC()
  • All package-boundary types accessed via interfaces: structs are private, constructors return interfaces
  • NodeState fields are private: mutations only through StateStore transition methods
  • Test-only options are unexported: withNowFunc, withNodeLister
  • Low-cardinality result label: condition-based checkers pass through cond.Reason; GPU checker uses fixed values (GPUCountMatch, GPUCountMismatch, NonGPUNode, NoAllocatableGPU)

Not in This PR

  • Cluster-wide blast-radius protection (concurrent reboot cap, unhealthy-rate circuit breaker, PDB awareness). Required before enabling monitorOnly=false in production.
  • Standard node Drain → Replace flow (PhaseDrain / PhaseReplace defined but not wired up). Standard nodes that exceed the retry limit currently transition to Failed along with GPU nodes.
  • Per-checker RepairAction (None/Reboot/Replace)
  • Civo API context propagation — civogo library doesn't accept a context; a hung API call still blocks the reconcile tick
  • Exponential backoff / retry on Civo API errors
  • Dockerfile hardening (distroless, non-root USER, pinned base tag)
  • SecurityContext tightening (readOnlyRootFilesystem, allowPrivilegeEscalation: false, drop: [ALL])
  • /healthz / /readyz endpoints
  • Linter configuration

Test Plan

  • go test ./... passes across all packages
  • go vet ./... clean
  • go fmt ./... applied
  • State machine transitions covered: healthy, unhealthy detection, reboot trigger (active/monitor), recovery, reboot retry, GPU mismatch, error handling, stale cleanup
  • Coverage: health ~98%, operation ~81%, watcher ~80%, total ~76%
  • Verified on staging CP VM: /metrics endpoint returns Prometheus metrics with civo_ prefix
  • Verified NotReady → Reboot → Ready cycle produces expected civo_node_agent_recovery_actions_total and civo_node_agent_recovery_phase transitions
  • Verify CIVO_NODE_AGENT_MONITOR_ONLY=true logs but does not reboot
  • Verify CIVO_NODE_AGENT_MONITOR_ONLY=false triggers actual HardRebootInstance API call
  • Verify retry limit → Failed phase after CIVO_NODE_MAX_REBOOT_RETRIES failed attempts

@hlts2 hlts2 changed the title feat: node recovery state machine MVP feat: node recovery state machine Apr 13, 2026
hlts2 and others added 14 commits April 14, 2026 03:52
… and metrics

Introduce a proper recovery state machine (Healthy → Unhealthy → WaitingReboot)
with pluggable health checkers, Prometheus metrics, and monitor-only mode support.

- Add pkg/health: HealthChecker interface with NodeReady and GPU checkers
- Add pkg/operation: Executor interface with CivoExecutor (FOP-based)
- Add pkg/metrics: Prometheus metrics (health checks, recovery actions, phase, duration)
- Add pkg/watcher/state: NodePhase enum, NodeState (private fields), StateStore
- Refactor watcher: replace polling with Node Informer, state machine reconcile loop
- Remove legacy code: fake.go, inline reboot/check functions, sync.Map tracking
- Update main.go: new env vars (CIVO_NODE_MONITOR_ONLY, CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES, CIVO_NODE_METRICS_PORT)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation

- Rename reboot.go to civo.go to better reflect the Civo executor scope
- Extract FOP (Option, WithAPIConfig, WithClient) into options.go
- Add validation for clusterID, apiKey, and apiURL before Civo client creation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add Threshold() to HealthChecker interface for per-checker thresholds
- NodeReady: 10min, GPU: 10min, DiskPressure: 30min
- Remove single unhealthyThreshold from watcher; use min threshold of failed checkers
- Remove WithUnhealthyThresholdMinutes option and CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES env var
- Refactor main.go: parseUintOrZero, defaultMetricsPort as int

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ssages in NewWatcher

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…config flag

- Remove nodeDesiredGPUCount field, WithDesiredGPUCount option (GPU count is owned by checker)
- Remove NewGPUChecker public constructor (only used internally by NewDefaultCheckers)
- Add --kubeconfig flag with default /etc/rancher/k3s/k3s.yaml for CP VM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change NewWatcher signature: nodePoolID string → nodePoolIDs []string
- Add buildNodeSelector: empty=all nodes, single=MatchLabels, multiple=In operator
- Add parseNodePoolIDs in main.go for comma-separated CIVO_NODE_POOL_ID

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… var

- GPU checker now compares nvidia.com/gpu.count label (expected) vs allocatable (actual)
- Auto-skips non-GPU nodes when label is absent
- Remove CIVO_NODE_DESIRED_GPU_COUNT env var and NewDefaultCheckers parameter
- NewDefaultCheckers() always includes GPU checker (self-determines GPU node)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

- NewWatcher signature: (ctx, clusterID, nodePoolIDs, opts...) → (ctx, clusterID, opts...)
- Add WithNodePoolIDs option with append semantics (empty is no-op)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
clusterID was only used by the old rebootNode() which moved to the executor.
NewWatcher signature simplified: (ctx, clusterID, opts...) → (ctx, opts...)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The label is static (set by GFD) and correctly identifies GPU nodes
even when all GPUs are unhealthy and allocatable drops to 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ttings

- CIVO_NODE_MONITOR_ONLY → CIVO_NODE_AGENT_MONITOR_ONLY
- CIVO_NODE_METRICS_PORT → CIVO_NODE_AGENT_METRICS_PORT

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hlts2 hlts2 force-pushed the feat/node-recovery branch from d8bc10b to 3204673 Compare April 13, 2026 18:55
hlts2 and others added 5 commits April 14, 2026 04:20
- Standard nodes: reboot wait 10min (CIVO_NODE_REBOOT_WAIT_MINUTES)
- GPU nodes: reboot wait 40min (CIVO_GPU_NODE_REBOOT_WAIT_MINUTES)
- NodeReady threshold changed from 10min to 5min per TDD
- Replace single rebootTimeWindowMinutes with rebootWaitMinutes + gpuRebootWaitMinutes
- WaitingReboot phase checks state.IsGPUNode() to select appropriate wait time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WithNodePoolIDs now accepts a comma-separated string and parses internally.
Remove parseNodePoolIDs from main.go.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard nodes should transition to PhaseDrain → PhaseReplace
instead of retrying reboot indefinitely. GPU nodes keep reboot-only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WithMonitorOnly now accepts a string and parses internally via strconv.ParseBool.
Empty or unparsable values preserve the default (true).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hlts2 hlts2 changed the title feat: node recovery state machine feat: node recovery state machine MVP Apr 13, 2026
hlts2 and others added 6 commits April 14, 2026 10:30
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Monitors CiliumAgentIsReady node condition. Auto-skips nodes where
the condition is absent (Cilium not installed). Threshold: 10min.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…um checker

Cilium sets NetworkUnavailable=False with reason CiliumIsUp, not a custom
CiliumAgentIsReady condition. Skip if reason is not CiliumIsUp (other CNI).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check now returns (bool, string) where the string is the reason.
Condition-based checkers pass through cond.Reason directly.
GPU checker returns a descriptive reason (e.g. "expected 8 but got 7").
The reason is used as the result label in HealthCheckTotal metric.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPU node detection belongs in the health package where gpuCountLabel
is defined. Removes strconv and corev1 imports from watcher.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hlts2 hlts2 changed the title feat: node recovery state machine MVP feat: node recovery agent Apr 14, 2026
…odeSelector

- Health: Threshold tests for all checkers, HasGPU tests
- Operation: NewCivoExecutor validation tests (empty clusterID, apiKey, apiURL)
- Watcher: buildNodeSelector tests (nil, single, multiple)
- Coverage: health 84→98%, operation 50→81%, total 71→76%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hlts2 and others added 8 commits April 20, 2026 15:36
- civo_node_agent_health_check_total
- civo_node_agent_recovery_actions_total
- civo_node_agent_node_unhealthy_duration_seconds
- civo_node_agent_recovery_phase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…plugin instructions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks failed recovery actions (e.g. Civo API errors during reboot),
separating failures from successful attempts in recovery_actions_total.
Cleaned up along with other per-node metrics on node removal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard Prometheus info metric pattern for tracking deployments and
distinguishing clusters. Use with group_left to enrich other metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures civo_node_agent_info is removed when the agent stops, signaling
to Prometheus that this version/cluster instance is no longer active.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks reconcile loop errors (e.g. node list failures from the API server)
to distinguish "agent alive but cannot do its job" from "agent down".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit "Watcher reconcile loop started" once before the loop, not on every tick.
The per-tick log was misleading (implied startup) and added noise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hlts2 hlts2 requested review from jarededwards and jokestax April 21, 2026 05:37
hlts2 and others added 16 commits April 21, 2026 17:52
In monitor-only mode no reboot actually happens, so logging per-tick
"waiting for reboot effect" is misleading and noisy (6 logs/min per
unhealthy node). "Reboot retry" still fires every rebootWait cycle as a
liveness signal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous version described the pre-refactor implementation (sync.Map, single
time window, CIVO_NODE_DESIRED_GPU_COUNT etc.). Updated to reflect:
- State machine (pkg/watcher/state.go)
- Health checker package and per-checker thresholds
- Executor/NopExecutor abstraction
- Current env var names (CIVO_NODE_POOL_IDS, reboot wait, monitor-only)
- Prometheus metrics
- Known limitations

Also simplify the "Waiting for reboot effect" comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches the pattern used by WithRebootWaitMinutes and WithGPURebootWaitMinutes:
log an Info message when the input cannot be parsed, so misconfigurations
are visible at startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove sections that duplicate what's in the code (env vars, flags,
  metrics, reconcile loop detail, dependencies, etc.). Readers can find
  these by reading the source.
- Clarify that the recommended deployment is a daemon process on the
  control plane VM, not a Kubernetes Pod. The Helm chart is available
  but secondary.
- Keep: project overview, build/test commands, package-level architecture,
  design conventions, known limitations (important for AI agents to avoid
  premature enabling of active reboots), release process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove Known Limitations and Design Conventions sections — these belong in
issue tracker / PR descriptions, not in the agent guide. Soften the
deployment description: daemon on CP VM is preferred, Helm Deployment is
also supported.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After maxRebootRetries reboots fail to recover a node, the state machine
transitions to PhaseFailed and stops retrying. The node remains in Failed
until it naturally recovers (all health checkers pass), at which point the
existing recovery path resets it to Healthy.

- Add PhaseFailed to NodePhase enum and StateStore.MarkFailed()
- MarkWaitingReboot now takes a countReboot bool; monitor-only mode passes
  false so rebootCount tracks only real reboots issued to the Civo API
- Add WithMaxRebootRetries option (default 5) and CIVO_NODE_MAX_REBOOT_RETRIES env var
- Update Helm chart (maxRebootRetries in values.yaml and deployment env)
- Update README with the new value

Applied to both standard and GPU nodes to avoid infinite reboot loops.
A future PR can wire up PhaseDrain → PhaseReplace for standard nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Include Failed in the state machine summary
- Drop interface / type-level details from the Packages section; they can
  be read from the code directly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestNodePhaseString: add PhaseFailed
- TestStateStoreMarkFailed / _Nonexistent: verify the transition
- TestRun_RebootRetryLimitExceeded_TransitionsToFailed: verify active-mode
  flow from Unhealthy → 3 reboots → WaitingReboot → Failed (no further reboots)
- TestRun_MonitorOnlyDoesNotIncrementRebootCount: rebootCount stays 0 and
  Failed is never reached in monitor-only mode
- TestRun_RecoverFromFailed: Failed → Healthy via Reset on successful checks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- RBAC: add roleRef.apiGroup to ClusterRoleBinding (worked via apiserver
  defaulting but should be explicit)
- time.Duration: multiply by time.Minute at option-set time so the field
  truly holds a duration (was nanoseconds)
- metrics server: use errors.Is for http.ErrServerClosed and trigger the
  signal-based stop() when ListenAndServe fails unexpectedly, so the
  agent shuts down cleanly instead of running without observability
- watcher: drop the unused named return on setupKubernetesClient
- health.HealthChecker.Threshold godoc: document zero value as
  "trigger immediately on failure"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NodeState: add sync.RWMutex; getters RLock; mutators take
  Get() then NodeState.mu.Lock() so the Store lock is released
  before the per-state lock
- StateStore.Range: iterate over a snapshot taken under the read
  lock so fn can safely call mutators without self-deadlocking
- Set RecoveryPhase{phase=Healthy}=1 every tick while healthy,
  not only on transition, so always-healthy nodes appear in the
  metric
- Call MarkWaitingReboot before the "Reboot retry" log and drop
  the stale "+1" in the rebootCount log; monitor mode now reports
  the actual (unincremented) count
- Drop leftover rebootWait*time.Minute in the
  waiting-for-reboot-effect log
- Remove unused ctx parameter from NewWatcher

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The only difference from active mode is now the absence of the
executor.Reboot() call. rebootCount increments every retry,
maxRebootRetries eventually drives the node to PhaseFailed, and
the "Reboot retry" log stops instead of firing every rebootWait
cycle forever.

- state.MarkWaitingReboot drops the countReboot flag and always
  increments the counter
- watcher.go drops the !monitorOnly guard around state updates,
  logs, and metrics in the retry path
- watcher_test: invert TestRun_MonitorOnlyDoesNotIncrementRebootCount
  into TestRun_MonitorOnlySimulatesFullLifecycle

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fo defer

- watcher.go: the HealthCheckTotal "result" label godoc promises
  pass/fail, but the code was passing the free-form reason string.
  Emit "pass"/"fail" and rely on logs for failure reason detail
  (keeps the metric's label cardinality bounded)
- main.go: remove the defer DeleteLabelValues for Info. Defers run
  LIFO after metricsServer.Shutdown, so nothing can scrape the
  delete; Prometheus already handles process exit via up{}=0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- watcher.go: emit a structured "Health check failed" log inside the
  checker loop with the checker name and reason. The HealthCheckTotal
  metric stays bounded (pass/fail) while the reason is preserved for
  on-call debugging via logs
- state.go: remove PhaseReboot. The state machine transitions directly
  from Unhealthy to WaitingReboot; the enum value was never assigned.
  Subsequent iota values shift down by one but all comparisons use
  names and metric labels use String(), so there is no external impact
- state_test.go: drop the corresponding table entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reason string is left unused in the watcher but the HealthChecker
interface still returns it for future consumers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a second give-up condition alongside maxRebootRetries: if
executor.Reboot returns errors more than maxRebootFailures times, the
node transitions to PhaseFailed from either PhaseUnhealthy (initial
reboot) or PhaseWaitingReboot (retry). Prevents infinite retries when
the Civo reboot API is unhealthy.

maxRebootFailures is intentionally kept as an internal default option
(not an env var) since failures have no wait window between attempts,
so a user-set high value would let the agent hammer the API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hlts2
Copy link
Copy Markdown
Member Author

hlts2 commented Apr 23, 2026

@jarededwards - Thank you for your review. I will merge this PR!

@hlts2 hlts2 merged commit a77331e into main Apr 23, 2026
1 check passed
@hlts2 hlts2 deleted the feat/node-recovery branch April 23, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants