Conversation
… and metrics Introduce a proper recovery state machine (Healthy → Unhealthy → WaitingReboot) with pluggable health checkers, Prometheus metrics, and monitor-only mode support. - Add pkg/health: HealthChecker interface with NodeReady and GPU checkers - Add pkg/operation: Executor interface with CivoExecutor (FOP-based) - Add pkg/metrics: Prometheus metrics (health checks, recovery actions, phase, duration) - Add pkg/watcher/state: NodePhase enum, NodeState (private fields), StateStore - Refactor watcher: replace polling with Node Informer, state machine reconcile loop - Remove legacy code: fake.go, inline reboot/check functions, sync.Map tracking - Update main.go: new env vars (CIVO_NODE_MONITOR_ONLY, CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES, CIVO_NODE_METRICS_PORT) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation - Rename reboot.go to civo.go to better reflect the Civo executor scope - Extract FOP (Option, WithAPIConfig, WithClient) into options.go - Add validation for clusterID, apiKey, and apiURL before Civo client creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add Threshold() to HealthChecker interface for per-checker thresholds - NodeReady: 10min, GPU: 10min, DiskPressure: 30min - Remove single unhealthyThreshold from watcher; use min threshold of failed checkers - Remove WithUnhealthyThresholdMinutes option and CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES env var - Refactor main.go: parseUintOrZero, defaultMetricsPort as int Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ssages in NewWatcher Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…config flag - Remove nodeDesiredGPUCount field, WithDesiredGPUCount option (GPU count is owned by checker) - Remove NewGPUChecker public constructor (only used internally by NewDefaultCheckers) - Add --kubeconfig flag with default /etc/rancher/k3s/k3s.yaml for CP VM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change NewWatcher signature: nodePoolID string → nodePoolIDs []string - Add buildNodeSelector: empty=all nodes, single=MatchLabels, multiple=In operator - Add parseNodePoolIDs in main.go for comma-separated CIVO_NODE_POOL_ID Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… var - GPU checker now compares nvidia.com/gpu.count label (expected) vs allocatable (actual) - Auto-skips non-GPU nodes when label is absent - Remove CIVO_NODE_DESIRED_GPU_COUNT env var and NewDefaultCheckers parameter - NewDefaultCheckers() always includes GPU checker (self-determines GPU node) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion - NewWatcher signature: (ctx, clusterID, nodePoolIDs, opts...) → (ctx, clusterID, opts...) - Add WithNodePoolIDs option with append semantics (empty is no-op) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
clusterID was only used by the old rebootNode() which moved to the executor. NewWatcher signature simplified: (ctx, clusterID, opts...) → (ctx, opts...) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The label is static (set by GFD) and correctly identifies GPU nodes even when all GPUs are unhealthy and allocatable drops to 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ttings - CIVO_NODE_MONITOR_ONLY → CIVO_NODE_AGENT_MONITOR_ONLY - CIVO_NODE_METRICS_PORT → CIVO_NODE_AGENT_METRICS_PORT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d8bc10b to
3204673
Compare
- Standard nodes: reboot wait 10min (CIVO_NODE_REBOOT_WAIT_MINUTES) - GPU nodes: reboot wait 40min (CIVO_GPU_NODE_REBOOT_WAIT_MINUTES) - NodeReady threshold changed from 10min to 5min per TDD - Replace single rebootTimeWindowMinutes with rebootWaitMinutes + gpuRebootWaitMinutes - WaitingReboot phase checks state.IsGPUNode() to select appropriate wait time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WithNodePoolIDs now accepts a comma-separated string and parses internally. Remove parseNodePoolIDs from main.go. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard nodes should transition to PhaseDrain → PhaseReplace instead of retrying reboot indefinitely. GPU nodes keep reboot-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WithMonitorOnly now accepts a string and parses internally via strconv.ParseBool. Empty or unparsable values preserve the default (true). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Monitors CiliumAgentIsReady node condition. Auto-skips nodes where the condition is absent (Cilium not installed). Threshold: 10min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…um checker Cilium sets NetworkUnavailable=False with reason CiliumIsUp, not a custom CiliumAgentIsReady condition. Skip if reason is not CiliumIsUp (other CNI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check now returns (bool, string) where the string is the reason. Condition-based checkers pass through cond.Reason directly. GPU checker returns a descriptive reason (e.g. "expected 8 but got 7"). The reason is used as the result label in HealthCheckTotal metric. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPU node detection belongs in the health package where gpuCountLabel is defined. Removes strconv and corev1 imports from watcher. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odeSelector - Health: Threshold tests for all checkers, HasGPU tests - Operation: NewCivoExecutor validation tests (empty clusterID, apiKey, apiURL) - Watcher: buildNodeSelector tests (nil, single, multiple) - Coverage: health 84→98%, operation 50→81%, total 71→76% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- civo_node_agent_health_check_total - civo_node_agent_recovery_actions_total - civo_node_agent_node_unhealthy_duration_seconds - civo_node_agent_recovery_phase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…plugin instructions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks failed recovery actions (e.g. Civo API errors during reboot), separating failures from successful attempts in recovery_actions_total. Cleaned up along with other per-node metrics on node removal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard Prometheus info metric pattern for tracking deployments and distinguishing clusters. Use with group_left to enrich other metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensures civo_node_agent_info is removed when the agent stops, signaling to Prometheus that this version/cluster instance is no longer active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks reconcile loop errors (e.g. node list failures from the API server) to distinguish "agent alive but cannot do its job" from "agent down". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Emit "Watcher reconcile loop started" once before the loop, not on every tick. The per-tick log was misleading (implied startup) and added noise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In monitor-only mode no reboot actually happens, so logging per-tick "waiting for reboot effect" is misleading and noisy (6 logs/min per unhealthy node). "Reboot retry" still fires every rebootWait cycle as a liveness signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous version described the pre-refactor implementation (sync.Map, single time window, CIVO_NODE_DESIRED_GPU_COUNT etc.). Updated to reflect: - State machine (pkg/watcher/state.go) - Health checker package and per-checker thresholds - Executor/NopExecutor abstraction - Current env var names (CIVO_NODE_POOL_IDS, reboot wait, monitor-only) - Prometheus metrics - Known limitations Also simplify the "Waiting for reboot effect" comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Matches the pattern used by WithRebootWaitMinutes and WithGPURebootWaitMinutes: log an Info message when the input cannot be parsed, so misconfigurations are visible at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove sections that duplicate what's in the code (env vars, flags, metrics, reconcile loop detail, dependencies, etc.). Readers can find these by reading the source. - Clarify that the recommended deployment is a daemon process on the control plane VM, not a Kubernetes Pod. The Helm chart is available but secondary. - Keep: project overview, build/test commands, package-level architecture, design conventions, known limitations (important for AI agents to avoid premature enabling of active reboots), release process. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove Known Limitations and Design Conventions sections — these belong in issue tracker / PR descriptions, not in the agent guide. Soften the deployment description: daemon on CP VM is preferred, Helm Deployment is also supported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After maxRebootRetries reboots fail to recover a node, the state machine transitions to PhaseFailed and stops retrying. The node remains in Failed until it naturally recovers (all health checkers pass), at which point the existing recovery path resets it to Healthy. - Add PhaseFailed to NodePhase enum and StateStore.MarkFailed() - MarkWaitingReboot now takes a countReboot bool; monitor-only mode passes false so rebootCount tracks only real reboots issued to the Civo API - Add WithMaxRebootRetries option (default 5) and CIVO_NODE_MAX_REBOOT_RETRIES env var - Update Helm chart (maxRebootRetries in values.yaml and deployment env) - Update README with the new value Applied to both standard and GPU nodes to avoid infinite reboot loops. A future PR can wire up PhaseDrain → PhaseReplace for standard nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Include Failed in the state machine summary - Drop interface / type-level details from the Packages section; they can be read from the code directly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestNodePhaseString: add PhaseFailed - TestStateStoreMarkFailed / _Nonexistent: verify the transition - TestRun_RebootRetryLimitExceeded_TransitionsToFailed: verify active-mode flow from Unhealthy → 3 reboots → WaitingReboot → Failed (no further reboots) - TestRun_MonitorOnlyDoesNotIncrementRebootCount: rebootCount stays 0 and Failed is never reached in monitor-only mode - TestRun_RecoverFromFailed: Failed → Healthy via Reset on successful checks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- RBAC: add roleRef.apiGroup to ClusterRoleBinding (worked via apiserver defaulting but should be explicit) - time.Duration: multiply by time.Minute at option-set time so the field truly holds a duration (was nanoseconds) - metrics server: use errors.Is for http.ErrServerClosed and trigger the signal-based stop() when ListenAndServe fails unexpectedly, so the agent shuts down cleanly instead of running without observability - watcher: drop the unused named return on setupKubernetesClient - health.HealthChecker.Threshold godoc: document zero value as "trigger immediately on failure" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NodeState: add sync.RWMutex; getters RLock; mutators take
Get() then NodeState.mu.Lock() so the Store lock is released
before the per-state lock
- StateStore.Range: iterate over a snapshot taken under the read
lock so fn can safely call mutators without self-deadlocking
- Set RecoveryPhase{phase=Healthy}=1 every tick while healthy,
not only on transition, so always-healthy nodes appear in the
metric
- Call MarkWaitingReboot before the "Reboot retry" log and drop
the stale "+1" in the rebootCount log; monitor mode now reports
the actual (unincremented) count
- Drop leftover rebootWait*time.Minute in the
waiting-for-reboot-effect log
- Remove unused ctx parameter from NewWatcher
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The only difference from active mode is now the absence of the executor.Reboot() call. rebootCount increments every retry, maxRebootRetries eventually drives the node to PhaseFailed, and the "Reboot retry" log stops instead of firing every rebootWait cycle forever. - state.MarkWaitingReboot drops the countReboot flag and always increments the counter - watcher.go drops the !monitorOnly guard around state updates, logs, and metrics in the retry path - watcher_test: invert TestRun_MonitorOnlyDoesNotIncrementRebootCount into TestRun_MonitorOnlySimulatesFullLifecycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fo defer
- watcher.go: the HealthCheckTotal "result" label godoc promises
pass/fail, but the code was passing the free-form reason string.
Emit "pass"/"fail" and rely on logs for failure reason detail
(keeps the metric's label cardinality bounded)
- main.go: remove the defer DeleteLabelValues for Info. Defers run
LIFO after metricsServer.Shutdown, so nothing can scrape the
delete; Prometheus already handles process exit via up{}=0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- watcher.go: emit a structured "Health check failed" log inside the checker loop with the checker name and reason. The HealthCheckTotal metric stays bounded (pass/fail) while the reason is preserved for on-call debugging via logs - state.go: remove PhaseReboot. The state machine transitions directly from Unhealthy to WaitingReboot; the enum value was never assigned. Subsequent iota values shift down by one but all comparisons use names and metric labels use String(), so there is no external impact - state_test.go: drop the corresponding table entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reason string is left unused in the watcher but the HealthChecker interface still returns it for future consumers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a second give-up condition alongside maxRebootRetries: if executor.Reboot returns errors more than maxRebootFailures times, the node transitions to PhaseFailed from either PhaseUnhealthy (initial reboot) or PhaseWaitingReboot (retry). Prevents infinite retries when the Civo reboot API is unhealthy. maxRebootFailures is intentionally kept as an internal default option (not an env var) since failures have no wait window between attempts, so a user-set high value would let the agent hammer the API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jarededwards
approved these changes
Apr 23, 2026
Member
Author
|
@jarededwards - Thank you for your review. I will merge this PR! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Healthy → Unhealthy → WaitingReboot → Failed) withPhaseUnknownas zero valuepkg/health:HealthCheckerinterface with per-checker thresholds and reason reportingNodeReadyChecker(5min),GPUChecker(10min),DiskPressureChecker(30min),CiliumChecker(10min)nvidia.com/gpu.countlabel vsallocatable(no static env var)NetworkUnavailablecondition withCiliumIsUpreason (auto-skips non-Cilium CNI)Check()returns(bool, string)— the reason is used as a low-cardinality metrics label for observabilitypkg/operation:Executorinterface withCivoExecutor(Functional Option Pattern) andNopExecutoras a safe defaultpkg/metrics: Prometheus metrics withcivo_prefix (see Metrics section)pkg/watcher/state:NodePhaseenum,NodeState(private fields + getters),StateStorewith transition methods (includingMarkFailed)CIVO_NODE_MAX_REBOOT_RETRIES(default 5) attempts, transition toFailedto prevent infinite reboot loops.rebootCountonly increments in active mode so monitor-only tracking is not polluted by simulated reboots.CIVO_NODE_POOL_IDS) or empty (all nodes)civo_node_agent_infocleanup viadefervalues.yamlfake.go, inline reboot/check functions,sync.Maptracking,CIVO_NODE_DESIRED_GPU_COUNT,CIVO_NODE_REBOOT_TIME_WINDOW_MINUTESDeployment target
civo-node-agentis designed to run as a daemon process on the control plane VM (preferred deployment, installed via an external script outside this repo). A Helm chart is also provided for running it as a single-replica Deployment inkube-systemif needed.Metrics
All metrics use the
civo_prefix.civo_node_agent_infoversion,cluster_idcivo_node_agent_health_check_totalnode,checker,resultcivo_node_agent_recovery_actions_totalnode,action,modemonitororactive)civo_node_agent_recovery_failures_totalnode,actioncivo_node_agent_reconcile_errors_totalreasonlist_nodes)civo_node_agent_node_unhealthy_duration_secondsnodecivo_node_agent_recovery_phasenode,phaseHealthy,Unhealthy,WaitingReboot,Failed= 1; others = 0)Environment Variables
Agent-specific settings use the
CIVO_NODE_AGENT_prefix.CIVO_API_KEYcivo-api-accesssecret in Helm deployment)CIVO_API_URLCIVO_CLUSTER_IDcivo_node_agent_info)CIVO_REGIONCIVO_NODE_POOL_IDSCIVO_NODE_AGENT_MONITOR_ONLYtrueCIVO_NODE_AGENT_METRICS_PORT9625CIVO_NODE_REBOOT_WAIT_MINUTES10CIVO_GPU_NODE_REBOOT_WAIT_MINUTES40CIVO_NODE_MAX_REBOOT_RETRIES5FailedCommand Line Flags
--kubeconfig/etc/rancher/k3s/k3s.yaml--versionHelm Chart
Non-sensitive configuration moved from secrets to
values.yaml.CIVO_API_KEY,CIVO_API_URL,CIVO_CLUSTER_ID,CIVO_REGIONstill come from the auto-provisionedcivo-api-accesssecret.nodePoolIDs""rebootWaitMinutes10gpuRebootWaitMinutes40maxRebootRetries5FailedmonitorOnlytruetrue, log recovery actions without executing themmetricsPort9625The Deployment passes
--kubeconfig=""so the agent uses in-cluster config.Design Decisions
true): logs recovery actions without executing themnvidia.com/gpu.count(static, set by GFD) is compared toallocatable["nvidia.com/gpu"](dynamic).health.HasGPUalso uses the label so GPU nodes are correctly identified even when all GPUs are unhealthymaxRebootRetriesattempts the node goes toFailed(no further reboots) to prevent infinite loops. Recovery fromFailedhappens automatically when all health checks pass again.rebootCounttracks real reboots only: in monitor-only mode the counter is not incremented, so PhaseFailed transitions apply only to actual reboot-attempt exhaustion.NopExecutoras default: prevents nil pointer dereference when no executor is configurednowFuncdefaults totime.Now().UTC()StateStoretransition methodswithNowFunc,withNodeListerresultlabel: condition-based checkers pass throughcond.Reason; GPU checker uses fixed values (GPUCountMatch,GPUCountMismatch,NonGPUNode,NoAllocatableGPU)Not in This PR
monitorOnly=falsein production.PhaseDrain/PhaseReplacedefined but not wired up). Standard nodes that exceed the retry limit currently transition toFailedalong with GPU nodes.RepairAction(None/Reboot/Replace)contextpropagation — civogo library doesn't accept a context; a hung API call still blocks the reconcile tickreadOnlyRootFilesystem,allowPrivilegeEscalation: false,drop: [ALL])/healthz//readyzendpointsTest Plan
go test ./...passes across all packagesgo vet ./...cleango fmt ./...applied/metricsendpoint returns Prometheus metrics withcivo_prefixcivo_node_agent_recovery_actions_totalandcivo_node_agent_recovery_phasetransitionsCIVO_NODE_AGENT_MONITOR_ONLY=truelogs but does not rebootCIVO_NODE_AGENT_MONITOR_ONLY=falsetriggers actualHardRebootInstanceAPI callFailedphase afterCIVO_NODE_MAX_REBOOT_RETRIESfailed attempts