Skip to content

feat(platform): RECON tracks A/C/D/F/H/I/J -- MachineConfigSync, NodeAddress, CAPI drop, node health, upgrade checkpoints#29

Merged
ontave merged 34 commits into
mainfrom
session/pr-merge-reconcile
May 29, 2026
Merged

feat(platform): RECON tracks A/C/D/F/H/I/J -- MachineConfigSync, NodeAddress, CAPI drop, node health, upgrade checkpoints#29
ontave merged 34 commits into
mainfrom
session/pr-merge-reconcile

Conversation

@ontave
Copy link
Copy Markdown
Contributor

@ontave ontave commented May 29, 2026

Summary

  • RECON-D1: CAPI dual-path fully removed from platform -- TalosCluster reconciler is Conductor-only; all CAPI scaffolding, types, and test fixtures deleted
  • RECON-A2/A4/A8: MachineConfigSync CRD + reconciler; per-node patch secrets; gzip compression and coalesce window for sync operations
  • RECON-A9: NodeAddress field with role classification (control-plane/worker) wired into TalosCluster spec
  • RECON-C8/C9: NodeOperation adds TargetNodeIP + NodeRole + rollback field; post-import node roster refresh via annotation
  • RECON-C5/C7: per-node health check loop; node maintenance mode detection
  • RECON-F5/F6: platform controller reconnect jitter; federation stream improvements
  • RECON-H2/H4: ConditionTypeNodeInfrastructureReady constant; RunnerConfig node-rollback op type + wipe field
  • RECON-I4: IdentityProvider/IdentityBinding live deployment wiring
  • RECON-J6/J7: upgrade progress checkpoint (completedNodes) + two-layer C proof (Talos+K8s Ready) -- see conductor for exec side
  • Merge: origin/main (93542ef k8s-drift+migration commit) -- CAPI constants/types taken from local HEAD (RECON-D1 removal)
  • Closes platform PR feat: k8s version drift corrective UpgradePolicy in DriftSignalReconciler #27 (session/25c-k8s-drift-remediation is superseded by local commits)

Test plan

  • go test ./... passes (all unit + integration suites)
  • No CAPI import in any non-provider reconciler file
  • MachineConfigSync reconciler creates/updates machineconfig secrets correctly
  • NodeAddress field serializes with role field

ontave added 30 commits May 7, 2026 07:33
When a drift-k8s-version-{cluster} DriftSignal arrives (emitted by
KubernetesVersionDriftLoop on the tenant conductor), create a corrective
UpgradePolicy (type=kubernetes, targetKubernetesVersion=spec.kubernetesVersion)
in seam-tenant-{cluster}. UpgradePolicyReconciler picks it up and submits
a kube-upgrade executor Job to bring the cluster back to declared state.

Routing: InfrastructureTalosCluster signals are now distinguished by name
prefix -- drift-k8s-version-* routes to handleKubernetesVersionDrift,
all others continue to handleTalosVersionDrift.

1 unit test: TestDriftSignalReconciler_K8sVersionDrift_CreatesUpgradePolicy.
All 7 DriftSignal unit tests pass.
…paths

reconcileVersionUpgrade now derives UpgradePolicy type from which version
fields are set: talosVersion only -> UpgradeTypeTalos (existing), kubernetesVersion
only -> UpgradeTypeKubernetes, both -> UpgradeTypeStack. Two new unit tests:
TestTalosCluster_VersionUpgrade_KubernetesOnly_CreatesKubePolicy and
TestTalosCluster_VersionUpgrade_Stack_CreatesBothVersions. All 8 version
upgrade tests pass.
…uster}

The UpgradePolicy was created in tc.Namespace (seam-system for imported
clusters). Conductor's stackUpgradeHandler reads the UpgradePolicy from
tenantNamespace(clusterRef) = seam-tenant-{cluster}, so the executor Job
looked in the wrong namespace and could not find the policy.

Fix: create UpgradePolicy in seam-tenant-{tc.Name} where the
platform-executor SA, talosconfig Secret, and Conductor executor all
already live. Closes STACK-UPGRADE-UP-NAMESPACE; STACK-UPGRADE-MGMT-SA
and STACK-UPGRADE-TALOSCONFIG-SCOPE are superseded.

Tests: update all UpgradePolicy namespace lookups to seam-tenant-ccs-mgmt.
… under seam.ontai.dev

Defines TalosCluster under api/seam/v1alpha1 (seam.ontai.dev/v1alpha1).
Removes the dead InfrastructureTalosCluster stub from api/v1alpha1.
Adds seam.ontai.dev_talosclusters.yaml CRD manifest.
Updates main.go, reconciler, and all consumer tests to the new type.
Also adds CRD manifests for day-2 types produced during session 25.
Replace seam-core -> seam in go.mod replace/require. Update all Go
import paths from github.com/ontai-dev/seam-core/ to
github.com/ontai-dev/seam/. Add seam-sdk replace + require. Update
runnerconfig_cr.go type aliases to use post-MIGRATION-3.8 names
(RunnerConfig, RunnerConfigSpec, RunnerConfigStatus).
Replace ../seam-core with ../seam following the seam-core -> seam
filesystem rename. Module path github.com/ontai-dev/seam was already
updated in Phase 4; this aligns the local path pointer.
…ontai.dev

Update rbacPolicyGVK, rbacProfileGVK and APIGroups arrays from security.ontai.dev
to guardian.ontai.dev in taloscluster_helpers.go and associated tests.
…names

- Replace testdata/crds/infrastructure.ontai.dev_infrastructurerunnerconfigs.yaml
  with seam.ontai.dev_runnerconfigs.yaml (current seam CRD, same group/kind)
- Comments: InfrastructureTalosCluster -> TalosCluster,
  InfrastructureTalosClusterOperationResult -> ClusterLog,
  seam-core -> seam (module/repo references, not schema doc names)
All 3 platform test packages pass (unit, integration/capi, integration/day2).
Fresh documentation from current codebase. seam-core references replaced
with seam. wrapper references replaced with dispatcher. TalosCluster and
ClusterLog ownership under seam.ontai.dev clarified. platform.ontai.dev
day-2 CRD catalog updated to match current Go types.
All short-lived day2 operation CRs (NodeOperation, NodeMaintenance,
EtcdMaintenance, PKIRotation, MaintenanceBundle, MachineConfigBackup,
MachineConfigRestore) now self-delete 6 hours after their completion
condition transitions to True. The reconciler requeues with the exact
remaining duration so no polling occurs. ClusterLog retains the
permanent operational record; the CR is ephemeral.

Introduces day2TTLExpired helper and day2OperationTTL constant in
operational_job_base.go. MaintenanceBundle also applies the TTL to its
Degraded terminal condition since that is equally a final state.
… tag always

conductorExecuteImageName was "conductor-execute" but the built image is
"conductor-exec" (INV-011). executorImageTag was overriding to "dev" in lab
builds which also conflicts with INV-011 (conductor exec tracks Talos version).
Both fixed: image name corrected, tag always uses tc.Spec.TalosVersion.
…ig schema + MachineConfigSync CRD + health conditions

RECON-A1: Add machineconfig secret label constants and naming helpers in
  internal/controller/machineconfig_labels.go. Defines all platform.ontai.dev/
  label keys, sync status values, class values, and MachineConfigSecretName().

RECON-A5 (partial): Add MachineConfigSync CRD type in
  api/v1alpha1/machineconfigsync_types.go. Full spec/status schema with
  clusterRef, nodeClass, forceApply, reason, and lineage fields. DeepCopy
  methods added to zz_generated.deepcopy.go. Reconciler and exec capability
  handler remain pending.

RECON-B2: Add NodeHealthSummary, HumanInterventionRequired, CapacitySaturation,
  DiskPressure condition type constants and health reason constants to
  taloscluster_types.go. NodeHealthAnnotation constant for per-node JSON summary.
  Written by conductor ClusterNodeHealthLoop (RECON-B1).
…tatus

New types in api/v1alpha1/upgradepolicy_types.go:
- UpgradeProgressPhase enum: upgrading, complete
- UpgradeProgress struct: CompletedNodes, CurrentNode, FailedNode, Phase
- UpgradePolicyStatus.Progress *UpgradeProgress optional field

Enables conductor exec Jobs to write per-node checkpoint state after each
successful node step so retry Jobs can skip already-upgraded nodes. Closes
the C->T feedback gap for partial upgrade completion (RECON-J6, GAP-19).
… secrets

Read machineconfig from each Talos node during mode=import onboarding via MachineConfigReaderFn
(talos goclient in production, injectable function in tests). Classify nodes by machine.type
into controlplane/worker classes. Write seam-mc-{cluster}-{class} Secrets with sync-status:pending
label and SHA-256 hash. Create MachineConfigSync CRs with reason=import-initial-sync so conductor
injects the ONT-controlled node label via machineconfig-sync capability. Failure is non-fatal:
import proceeds to Ready=True even when MCSOT fails (e.g. node unreachable at import time).

Unit tests: node classification, both-classes from multi-endpoint, Secret idempotency,
MachineConfigSync CR idempotency, all-endpoints-fail non-fatal behavior.
…c on content change

Add Secret Watch in TalosClusterReconciler.SetupWithManager for machineconfig Secrets
labeled platform.ontai.dev/mc-class. Maps Secret events to TalosCluster reconcile requests
via LabelMachineConfigCluster label.

reconcileMachineConfigSync: detects when SHA-256(data.machineconfig) differs from
platform.ontai.dev/sync-hash label. On content change: patches Secret sync-status to
pending, deletes stale watch-triggered MachineConfigSync CR if present, creates new CR
named {cluster}-mc-sync-{class} with reason=secret-content-changed. No-op when hash
matches (avoids duplicate CRs alongside import-triggered {cluster}-mc-import-{class} CRs).

Unit tests: content-change creates CR, no-change does not create CR, stale CR replaced.
TalosCluster.status.deletionStage records progress through the deletion cascade so a
reconciler restart can resume from the correct step. Stage is written before each step
(pack-execution, pack-installed, runner-config) and on completion. deletionStageReached()
uses index-based ordering for skip logic. advanceDeletionStage() treats NotFound as
success (object GC'd after all finalizers removed). 4 unit tests added.
… day2 reconcilers

Adds retryCount/maxRetry fields to all five day2 CRD specs (MachineConfigSync,
UpgradePolicy, EtcdMaintenance, NodeMaintenance, NodeOperation). Retry logic
uses a deterministic job name encoding the retry count to avoid job naming
conflicts without requiring explicit deletion. Permanent failure sets
HumanInterventionRequired on TalosCluster.
Add ConditionTypeNodeInfrastructureReady to TalosCluster condition constants.
True when all nodes: machineconfig applied, ont-controlled label injected,
talosconfig endpoints current. Prerequisite for Kubernetes-layer B selections
(tenant conductor RuntimeDrift remediation gate). RECON-H2.
…ON-C9

Adds reconcileNodeRosterRefresh triggered by platform.ontai.dev/refresh-node-roster
annotation on a stable-Ready TalosCluster; re-reads live node roster via
ensureMachineConfigSecrets, marks vanished per-node secrets as 'decommissioned'
(audit preserved, INV-006), emits NodeRosterRefreshed Event, clears annotation.
5 unit tests pass.
… gzip compression, node-rollback op type + wipe field; fix hash-over-uncompressed bug in reconcileMachineConfigSync
…kubeconfig mount for scale-up

NodeOperationSpec gains TargetNodeIP (required for scale-up) and NodeRole
(controlplane|worker, defaults to worker). Rollback added to NodeOperationType
enum. nodeoperation_reconciler adds node-rollback capability mapping and
addKubeconfigMount call for scale-up jobs (RECON-J2 kubeconfig pattern).
ontave added 2 commits May 28, 2026 19:39
…r clean

Remove spec.capi.enabled and all CAPI dual-path reconcilers from platform and
conductor. Zero live usage (both lab clusters use direct bootstrap). This
eliminates ~1300 LOC of dual-path complexity across four day-2 reconcilers,
TalosClusterReconciler helpers, and compiler input structs.

Changes:
- api/seam/v1alpha1/taloscluster_types.go: remove CAPIConfig, CAPIControlPlaneConfig,
  CAPIWorkerConfig, CAPICiliumPackRefInput types and spec.capi field
- api/v1alpha1/: remove CAPI-specific condition constants from four day-2 types
- internal/controller/: collapse four dual-path reconcilers to direct/conductor path only;
  remove CAPI helper functions from taloscluster_helpers.go
- test/: delete four CAPI-only test files; update remaining tests to remove CAPI
  references (role=tenant replaces capi.enabled=true as the finalizer gate)

RECON-D1 Status: COMPLETE
@ontave ontave merged commit 59412f9 into main May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant