Skip to content

fix+feat: session 25y-25z'''' -- health loop namespace fix, DriftSignal resolution, MachineConfig CRD migration, RECON tracks#45

Merged
ontave merged 15 commits into
mainfrom
feature/recon-cmn1-pack-source-version-loop
Jun 1, 2026
Merged

fix+feat: session 25y-25z'''' -- health loop namespace fix, DriftSignal resolution, MachineConfig CRD migration, RECON tracks#45
ontave merged 15 commits into
mainfrom
feature/recon-cmn1-pack-source-version-loop

Conversation

@ontave
Copy link
Copy Markdown
Contributor

@ontave ontave commented Jun 1, 2026

Summary

  • Fix: all TalosCluster lookups in conductor health loop files now use seam-system (was ont-system -- caused silent not-found on every health status write)
  • Fix: SeamMembership principal corrected to system:serviceaccount:ont-system:conductor; NodeRegistrationDrift DriftSignals auto-resolve when controlled label present
  • Fix: machineconfig-sync reconstructed machineconfig YAML now includes version/debug/persist header fields
  • Fix: PLT-BUG-4 PackReceipt kind, CG-8 MGMT_KUBECONFIG_PATH env var, CG-9 dispatcher-runner emitted for tenant clusters
  • Feat: MachineConfig CRD migration Phases 3a/3b/4a/4b -- compiler emits MachineConfig CRs (not Secrets), capability reads CRs, upgrade uses order-based node iteration + powercycle reboot mode
  • Feat: ESOHealthLoop, PolicyReportDriftLoop, VulnerabilityDriftLoop, BackupHealthLoop added (RECON tracks)
  • Feat: PackSourceVersionLoop -- upstream Helm chart version drift detection (RECON-CMN1)
  • Feat: watchdog capabilities wired
  • Feat: federation ADR-F6 stream connection pool (RECON-F6)
  • Feat: compiler emits extensions-maximum PermissionSet (RECON-CMN2)

Test plan

  • go test ./internal/agent/... passes (health loop namespace fix verified by test suite)
  • go test ./cmd/compiler/... passes (MachineConfig CR output format verified)
  • go test ./internal/capability/... passes (machineconfig-sync CR path verified)
  • Live ccs-mgmt: all NodeRegistrationDrift DriftSignals show state=resolved
  • Live ccs-mgmt: TalosCluster ccs-mgmt health conditions populated within 1 check cycle post-restart

Generated with Claude Code

ontave added 15 commits May 29, 2026 10:08
Adds semaphore-bounded concurrent stream limit (D1), token-bucket
admission rate limiter (D2), and Prometheus metrics (D4) to
FederationServer. Env vars FEDERATION_MAX_CONCURRENT_STREAMS (default 50,
range 1-1000) and FEDERATION_ADMISSION_RATE (default 5, must be >0)
configure the pool at startup. ActiveStreamCount() exposes the current
gauge value for health checks.

Five unit tests cover: semaphore rejection at limit, concurrent admission
up to limit, activeCount increment/decrement on connect/disconnect, and
both env-var parser edge cases.
…ble (RECON-CMN2)

Adds extensions-maximum as a second Layer 1 PermissionSet in guardian-permissionsets.yaml.
Covers CRD groups for all ten ONT extension operator categories (EXT-1 through EXT-10):
external-secrets.io, kyverno.io, aquasecurity.github.io, velero.io, cost.grafana.com,
monitoring.coreos.com, apiextensions.crossplane.io, pkg.crossplane.io.

Updates TestEnable_OnlyManagementMaximumPermissionSet -> TestEnable_BootstrapPermissionSetCount
to expect 2 PermissionSet documents. Adds TestEnable_ExtensionsMaximumPermissionSet.
…ityDriftLoop, BackupHealthLoop

Add 4 management-only extension drift loops:

- ESOHealthLoop (RECON-K3): detects ExternalSecret sync failures (Ready=False /
  Synced=False), emits ExternalSecretSyncFailed DriftSignal. Skips when ESO
  CRDs not installed.

- PolicyReportDriftLoop (RECON-L2): detects Kyverno fail results in
  ClusterPolicyReport and PolicyReport CRs, emits KyvernoPolicyViolation
  DriftSignal. Skips when Kyverno CRDs not installed.

- VulnerabilityDriftLoop (RECON-M2): detects CRITICAL severity vulnerabilities
  in Trivy Operator VulnerabilityReport CRs, emits VulnerableImageDetected
  DriftSignal. Skips when Trivy CRDs not installed.

- BackupHealthLoop (RECON-N2): detects Velero BSL unavailability and RPO
  breaches (last successful backup older than 25h), emits BackupStorageUnavailable
  and BackupRPOBreached DriftSignals. Skips when Velero CRDs not installed.

All 4 loops:
- Support AutonomyLevel=observe-only gate (log-only when restricted).
- Have unit tests with fake dynamic client covering signal emit, confirm, and
  observe-only cases.
- Have e2e stubs with skip reasons referencing RECON backlog IDs.
- Are wired in kernel/agent.go (management role only, with ocWatcher).

Also adds unstructuredNestedSlice helper in eso_health_loop.go (package-private,
used by all 4 loops for status condition parsing).
…k-deploy

Two bugs discovered during ccs-mgmt management cluster bootstrap:

1. capability_publisher.go: runnerConfigGVR used the pre-refactor API
   group infrastructure.ontai.dev/infrastructurerunnerconfigs instead of
   seam.ontai.dev/runnerconfigs. Conductor could not publish capabilities
   to RunnerConfig, blocking all PackExecution dispatch.

2. wrapper.go: applyParsedManifest and ensureNamespaces did not set
   Force: true on server-side apply PatchOptions. Phase C kubectl apply
   took field ownership on Namespace resources; conductor-exec pack-deploy
   Jobs then failed with field ownership conflicts on every re-apply.

Both fixes are required for PackDelivery to succeed on clusters that
were bootstrapped with Phase C raw kubectl apply.
…able phase

Sources the URI from guardian-db-app secret so LineageController can archive
LineageRecords to CNPG on root declaration deletion. seam-schema.md §4.
…onfiguration

Adds writeSeamDeclaringPrincipalWebhook() to phase 3 output. The generated
MutatingWebhookConfiguration intercepts CREATE for talosclusters and
packdeliveries and routes to the seam webhook server at
/mutate-root-declaration-declaring-principal.

This wires the declaring-principal stamping into cluster bootstrap so every
new root declaration carries the requesting principal identity, enabling
LineageController to populate spec.rootBinding.declaringPrincipal with a
real actor rather than system:unknown.

Also removes stale seam.ontai.dev_runnerconfigs.yaml from conductor config/crd;
RunnerConfig is owned by seam per Decision 13.
Add 4 watchdog remediation capability handlers (pod-restart, resource-patch,
force-volume-detach, credential-refresh) triggered by RuntimeDrift DriftSignals.

Replace the placeholder Kueue Job submission in RuntimeDriftHandler with real
Job construction: capability selection from failureReason, execute image from
RunnerConfig.spec.runnerImage, kubeconfig Secret mount, Kueue watchdog-queue
LocalQueue admission in ont-system.

Add watchdog-queue LocalQueue to compiler enable phase 05 output. 12 capability
handler tests and 3 compiler tests cover the new functionality.
When MC_NODE_IP env var is set, machineconfig-sync capability targets only that node IP
rather than all nodes from talosconfig. Supports machineconfig.yaml data key (compiler
per-node secret format) as fallback alongside gzip-compressed machineconfig key.
…, CG-9 dispatcher-runner for tenant, pre-existing API group in test

PLT-BUG-4: packinstance_pull_loop.go used pre-migration kind name
InfrastructurePackReceipt; API server requires PackReceipt.

CG-8: buildOperatorDeployment now adds MGMT_KUBECONFIG_PATH env var and
conductor-mgmt-kubeconfig volume to tenant conductor Deployments. Without
this env var the gate in agent.go silently disables every drift loop that
requires management cluster access (TalosVersionDrift, KubernetesVersionDrift,
PackReceiptDrift, PackPodHealth, OperatorContext loops).

CG-9: writePhase5PostBootstrap now generates pack-deploy-queue.yaml and
dispatcher-runner.yaml for all cluster roles including tenant. watchdog-queue.yaml
remains management-cluster-only. The enable script applies these files to
ccs-mgmt (where seam-tenant-ccs-dev resources live), not to ccs-dev.

Test: capability_publisher_test.go updated from pre-migration infrastructure.ontai.dev/
infrastructurerunnerconfigs to current seam.ontai.dev/runnerconfigs. All unit
tests pass.
Compiler now generates MachineConfig CRs (platform.ontai.dev/v1alpha1)
instead of Secrets. Adds addnode subcommand for post-bootstrap node
template generation. machineconfig-sync reads MachineConfig CRs via
DynamicClient. Upgrade capability uses powercycle reboot and orders
nodes ascending by spec.order from MachineConfig CRs.

- compiler bootstrap: buildMachineConfigCR replaces buildMachineConfigSecret;
  MachineConfig CR YAML output with typed fields (role, order, nodeIP,
  nodeHostname, clusterRef) and unstructured spec.machine/spec.cluster
- compiler addnode: new subcommand generates MachineConfig CR template
  with --existing-cr cloning or skeleton placeholder output
- machineconfig-sync: DynamicClient fetch of MachineConfig CR,
  reconstructMachineConfigYAML splits machine/cluster sections;
  machineConfigSyncDataKey constant extracted to shared file
- platform_upgrade: nodesFromMachineConfigCRs lists and sorts CRs
  ascending by spec.order; falls back to TalosClient.Nodes();
  RebootPowercycle replaces Reboot for post-stage node cycling
- TalosNodeClient: RebootPowercycle added to interface; all stubs updated
- 24 new tests across addnode, machineconfig-sync, and upgrade packages
…sist header fields

Talos v1alpha1 machineconfig requires top-level version, debug, persist fields.
Omitting them caused 'this config change can't be applied in immediate mode'
when Talos diffed the incoming config against the running one.
…nDrift auto-resolution

identity.go: correct PrincipalRef from seam-system to ont-system (conductor runs in ont-system;
seam-system caused DomainIdentityMismatch / Validated=False on SeamMembership seam-conductor).

cluster_node_health_loop.go: add resolveNodeRegistrationDrift() -- after each checkNodeRegistration
pass, patches any DriftSignal of kind NodeRegistrationDrift to state=resolved for nodes that now
carry the ont.platform.dev/controlled=true label. No DriftSignalController exists in seam; signals
must be resolved by the component that detected them once the condition clears.

Tests: assertion on ont-system principalRef; three table-driven tests for resolve (label present,
label absent, already resolved).
…h loop

TalosCluster CRs live in seam-system, not in the operator namespace (ont-system).
All health loop files were using l.namespace (ont-system) for TalosCluster patches,
causing "not found" errors that silently dropped: NodeHealthSummary annotation/condition,
HumanInterventionRequired status patches, DiskPressure condition, endpoint drift condition,
etcd health annotation, PKI expiry reads, capacity saturation condition.

Fix: import namespaces package and use namespaces.SeamSystem in all six files
(cluster_node_health_loop, cluster_pki_expiry, cluster_disk_pressure,
cluster_endpoint_drift, cluster_etcd_health). Update test fixtures accordingly
(makeTalosCluster / makeTalosClusterWithPKIExpiry namespace arg: seam-system).
@ontave ontave merged commit 40c5965 into main Jun 1, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant