feat(platform): RECON tracks A/C/D/F/H/I/J -- MachineConfigSync, NodeAddress, CAPI drop, node health, upgrade checkpoints#29
Merged
Conversation
When a drift-k8s-version-{cluster} DriftSignal arrives (emitted by
KubernetesVersionDriftLoop on the tenant conductor), create a corrective
UpgradePolicy (type=kubernetes, targetKubernetesVersion=spec.kubernetesVersion)
in seam-tenant-{cluster}. UpgradePolicyReconciler picks it up and submits
a kube-upgrade executor Job to bring the cluster back to declared state.
Routing: InfrastructureTalosCluster signals are now distinguished by name
prefix -- drift-k8s-version-* routes to handleKubernetesVersionDrift,
all others continue to handleTalosVersionDrift.
1 unit test: TestDriftSignalReconciler_K8sVersionDrift_CreatesUpgradePolicy.
All 7 DriftSignal unit tests pass.
…paths reconcileVersionUpgrade now derives UpgradePolicy type from which version fields are set: talosVersion only -> UpgradeTypeTalos (existing), kubernetesVersion only -> UpgradeTypeKubernetes, both -> UpgradeTypeStack. Two new unit tests: TestTalosCluster_VersionUpgrade_KubernetesOnly_CreatesKubePolicy and TestTalosCluster_VersionUpgrade_Stack_CreatesBothVersions. All 8 version upgrade tests pass.
…uster}
The UpgradePolicy was created in tc.Namespace (seam-system for imported
clusters). Conductor's stackUpgradeHandler reads the UpgradePolicy from
tenantNamespace(clusterRef) = seam-tenant-{cluster}, so the executor Job
looked in the wrong namespace and could not find the policy.
Fix: create UpgradePolicy in seam-tenant-{tc.Name} where the
platform-executor SA, talosconfig Secret, and Conductor executor all
already live. Closes STACK-UPGRADE-UP-NAMESPACE; STACK-UPGRADE-MGMT-SA
and STACK-UPGRADE-TALOSCONFIG-SCOPE are superseded.
Tests: update all UpgradePolicy namespace lookups to seam-tenant-ccs-mgmt.
… under seam.ontai.dev Defines TalosCluster under api/seam/v1alpha1 (seam.ontai.dev/v1alpha1). Removes the dead InfrastructureTalosCluster stub from api/v1alpha1. Adds seam.ontai.dev_talosclusters.yaml CRD manifest. Updates main.go, reconciler, and all consumer tests to the new type. Also adds CRD manifests for day-2 types produced during session 25.
…nder seam.ontai.dev
Replace seam-core -> seam in go.mod replace/require. Update all Go import paths from github.com/ontai-dev/seam-core/ to github.com/ontai-dev/seam/. Add seam-sdk replace + require. Update runnerconfig_cr.go type aliases to use post-MIGRATION-3.8 names (RunnerConfig, RunnerConfigSpec, RunnerConfigStatus).
Replace ../seam-core with ../seam following the seam-core -> seam filesystem rename. Module path github.com/ontai-dev/seam was already updated in Phase 4; this aligns the local path pointer.
…ontai.dev Update rbacPolicyGVK, rbacProfileGVK and APIGroups arrays from security.ontai.dev to guardian.ontai.dev in taloscluster_helpers.go and associated tests.
…names - Replace testdata/crds/infrastructure.ontai.dev_infrastructurerunnerconfigs.yaml with seam.ontai.dev_runnerconfigs.yaml (current seam CRD, same group/kind) - Comments: InfrastructureTalosCluster -> TalosCluster, InfrastructureTalosClusterOperationResult -> ClusterLog, seam-core -> seam (module/repo references, not schema doc names) All 3 platform test packages pass (unit, integration/capi, integration/day2).
Fresh documentation from current codebase. seam-core references replaced with seam. wrapper references replaced with dispatcher. TalosCluster and ClusterLog ownership under seam.ontai.dev clarified. platform.ontai.dev day-2 CRD catalog updated to match current Go types.
All short-lived day2 operation CRs (NodeOperation, NodeMaintenance, EtcdMaintenance, PKIRotation, MaintenanceBundle, MachineConfigBackup, MachineConfigRestore) now self-delete 6 hours after their completion condition transitions to True. The reconciler requeues with the exact remaining duration so no polling occurs. ClusterLog retains the permanent operational record; the CR is ephemeral. Introduces day2TTLExpired helper and day2OperationTTL constant in operational_job_base.go. MaintenanceBundle also applies the TTL to its Degraded terminal condition since that is equally a final state.
… tag always conductorExecuteImageName was "conductor-execute" but the built image is "conductor-exec" (INV-011). executorImageTag was overriding to "dev" in lab builds which also conflicts with INV-011 (conductor exec tracks Talos version). Both fixed: image name corrected, tag always uses tc.Spec.TalosVersion.
…ig schema + MachineConfigSync CRD + health conditions RECON-A1: Add machineconfig secret label constants and naming helpers in internal/controller/machineconfig_labels.go. Defines all platform.ontai.dev/ label keys, sync status values, class values, and MachineConfigSecretName(). RECON-A5 (partial): Add MachineConfigSync CRD type in api/v1alpha1/machineconfigsync_types.go. Full spec/status schema with clusterRef, nodeClass, forceApply, reason, and lineage fields. DeepCopy methods added to zz_generated.deepcopy.go. Reconciler and exec capability handler remain pending. RECON-B2: Add NodeHealthSummary, HumanInterventionRequired, CapacitySaturation, DiskPressure condition type constants and health reason constants to taloscluster_types.go. NodeHealthAnnotation constant for per-node JSON summary. Written by conductor ClusterNodeHealthLoop (RECON-B1).
…health conditions + reconciler
…tatus New types in api/v1alpha1/upgradepolicy_types.go: - UpgradeProgressPhase enum: upgrading, complete - UpgradeProgress struct: CompletedNodes, CurrentNode, FailedNode, Phase - UpgradePolicyStatus.Progress *UpgradeProgress optional field Enables conductor exec Jobs to write per-node checkpoint state after each successful node step so retry Jobs can skip already-upgraded nodes. Closes the C->T feedback gap for partial upgrade completion (RECON-J6, GAP-19).
…in + K8s ready check)
… secrets
Read machineconfig from each Talos node during mode=import onboarding via MachineConfigReaderFn
(talos goclient in production, injectable function in tests). Classify nodes by machine.type
into controlplane/worker classes. Write seam-mc-{cluster}-{class} Secrets with sync-status:pending
label and SHA-256 hash. Create MachineConfigSync CRs with reason=import-initial-sync so conductor
injects the ONT-controlled node label via machineconfig-sync capability. Failure is non-fatal:
import proceeds to Ready=True even when MCSOT fails (e.g. node unreachable at import time).
Unit tests: node classification, both-classes from multi-endpoint, Secret idempotency,
MachineConfigSync CR idempotency, all-endpoints-fail non-fatal behavior.
…c on content change
Add Secret Watch in TalosClusterReconciler.SetupWithManager for machineconfig Secrets
labeled platform.ontai.dev/mc-class. Maps Secret events to TalosCluster reconcile requests
via LabelMachineConfigCluster label.
reconcileMachineConfigSync: detects when SHA-256(data.machineconfig) differs from
platform.ontai.dev/sync-hash label. On content change: patches Secret sync-status to
pending, deletes stale watch-triggered MachineConfigSync CR if present, creates new CR
named {cluster}-mc-sync-{class} with reason=secret-content-changed. No-op when hash
matches (avoids duplicate CRs alongside import-triggered {cluster}-mc-import-{class} CRs).
Unit tests: content-change creates CR, no-change does not create CR, stale CR replaced.
TalosCluster.status.deletionStage records progress through the deletion cascade so a reconciler restart can resume from the correct step. Stage is written before each step (pack-execution, pack-installed, runner-config) and on completion. deletionStageReached() uses index-based ordering for skip logic. advanceDeletionStage() treats NotFound as success (object GC'd after all finalizers removed). 4 unit tests added.
… day2 reconcilers Adds retryCount/maxRetry fields to all five day2 CRD specs (MachineConfigSync, UpgradePolicy, EtcdMaintenance, NodeMaintenance, NodeOperation). Retry logic uses a deterministic job name encoding the retry count to avoid job naming conflicts without requiring explicit deletion. Permanent failure sets HumanInterventionRequired on TalosCluster.
Add ConditionTypeNodeInfrastructureReady to TalosCluster condition constants. True when all nodes: machineconfig applied, ont-controlled label injected, talosconfig endpoints current. Prerequisite for Kubernetes-layer B selections (tenant conductor RuntimeDrift remediation gate). RECON-H2.
…ON-C9 Adds reconcileNodeRosterRefresh triggered by platform.ontai.dev/refresh-node-roster annotation on a stable-Ready TalosCluster; re-reads live node roster via ensureMachineConfigSecrets, marks vanished per-node secrets as 'decommissioned' (audit preserved, INV-006), emits NodeRosterRefreshed Event, clears annotation. 5 unit tests pass.
… gzip compression, node-rollback op type + wipe field; fix hash-over-uncompressed bug in reconcileMachineConfigSync
…kubeconfig mount for scale-up NodeOperationSpec gains TargetNodeIP (required for scale-up) and NodeRole (controlplane|worker, defaults to worker). Rollback added to NodeOperationType enum. nodeoperation_reconciler adds node-rollback capability mapping and addKubeconfigMount call for scale-up jobs (RECON-J2 kubeconfig pattern).
…r clean Remove spec.capi.enabled and all CAPI dual-path reconcilers from platform and conductor. Zero live usage (both lab clusters use direct bootstrap). This eliminates ~1300 LOC of dual-path complexity across four day-2 reconcilers, TalosClusterReconciler helpers, and compiler input structs. Changes: - api/seam/v1alpha1/taloscluster_types.go: remove CAPIConfig, CAPIControlPlaneConfig, CAPIWorkerConfig, CAPICiliumPackRefInput types and spec.capi field - api/v1alpha1/: remove CAPI-specific condition constants from four day-2 types - internal/controller/: collapse four dual-path reconcilers to direct/conductor path only; remove CAPI helper functions from taloscluster_helpers.go - test/: delete four CAPI-only test files; update remaining tests to remove CAPI references (role=tenant replaces capi.enabled=true as the finalizer gate) RECON-D1 Status: COMPLETE
…HEAD for all RECON-D1 CAPI removals
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
go test ./...passes (all unit + integration suites)