fix+feat: session 25y-25z'''' -- health loop namespace fix, DriftSignal resolution, MachineConfig CRD migration, RECON tracks#45
Merged
Conversation
Adds semaphore-bounded concurrent stream limit (D1), token-bucket admission rate limiter (D2), and Prometheus metrics (D4) to FederationServer. Env vars FEDERATION_MAX_CONCURRENT_STREAMS (default 50, range 1-1000) and FEDERATION_ADMISSION_RATE (default 5, must be >0) configure the pool at startup. ActiveStreamCount() exposes the current gauge value for health checks. Five unit tests cover: semaphore rejection at limit, concurrent admission up to limit, activeCount increment/decrement on connect/disconnect, and both env-var parser edge cases.
…art version drift detection
…ble (RECON-CMN2) Adds extensions-maximum as a second Layer 1 PermissionSet in guardian-permissionsets.yaml. Covers CRD groups for all ten ONT extension operator categories (EXT-1 through EXT-10): external-secrets.io, kyverno.io, aquasecurity.github.io, velero.io, cost.grafana.com, monitoring.coreos.com, apiextensions.crossplane.io, pkg.crossplane.io. Updates TestEnable_OnlyManagementMaximumPermissionSet -> TestEnable_BootstrapPermissionSetCount to expect 2 PermissionSet documents. Adds TestEnable_ExtensionsMaximumPermissionSet.
…ityDriftLoop, BackupHealthLoop Add 4 management-only extension drift loops: - ESOHealthLoop (RECON-K3): detects ExternalSecret sync failures (Ready=False / Synced=False), emits ExternalSecretSyncFailed DriftSignal. Skips when ESO CRDs not installed. - PolicyReportDriftLoop (RECON-L2): detects Kyverno fail results in ClusterPolicyReport and PolicyReport CRs, emits KyvernoPolicyViolation DriftSignal. Skips when Kyverno CRDs not installed. - VulnerabilityDriftLoop (RECON-M2): detects CRITICAL severity vulnerabilities in Trivy Operator VulnerabilityReport CRs, emits VulnerableImageDetected DriftSignal. Skips when Trivy CRDs not installed. - BackupHealthLoop (RECON-N2): detects Velero BSL unavailability and RPO breaches (last successful backup older than 25h), emits BackupStorageUnavailable and BackupRPOBreached DriftSignals. Skips when Velero CRDs not installed. All 4 loops: - Support AutonomyLevel=observe-only gate (log-only when restricted). - Have unit tests with fake dynamic client covering signal emit, confirm, and observe-only cases. - Have e2e stubs with skip reasons referencing RECON backlog IDs. - Are wired in kernel/agent.go (management role only, with ocWatcher). Also adds unstructuredNestedSlice helper in eso_health_loop.go (package-private, used by all 4 loops for status condition parsing).
…k-deploy Two bugs discovered during ccs-mgmt management cluster bootstrap: 1. capability_publisher.go: runnerConfigGVR used the pre-refactor API group infrastructure.ontai.dev/infrastructurerunnerconfigs instead of seam.ontai.dev/runnerconfigs. Conductor could not publish capabilities to RunnerConfig, blocking all PackExecution dispatch. 2. wrapper.go: applyParsedManifest and ensureNamespaces did not set Force: true on server-side apply PatchOptions. Phase C kubectl apply took field ownership on Namespace resources; conductor-exec pack-deploy Jobs then failed with field ownership conflicts on every re-apply. Both fixes are required for PackDelivery to succeed on clusters that were bootstrapped with Phase C raw kubectl apply.
…able phase Sources the URI from guardian-db-app secret so LineageController can archive LineageRecords to CNPG on root declaration deletion. seam-schema.md §4.
…onfiguration Adds writeSeamDeclaringPrincipalWebhook() to phase 3 output. The generated MutatingWebhookConfiguration intercepts CREATE for talosclusters and packdeliveries and routes to the seam webhook server at /mutate-root-declaration-declaring-principal. This wires the declaring-principal stamping into cluster bootstrap so every new root declaration carries the requesting principal identity, enabling LineageController to populate spec.rootBinding.declaringPrincipal with a real actor rather than system:unknown. Also removes stale seam.ontai.dev_runnerconfigs.yaml from conductor config/crd; RunnerConfig is owned by seam per Decision 13.
Add 4 watchdog remediation capability handlers (pod-restart, resource-patch, force-volume-detach, credential-refresh) triggered by RuntimeDrift DriftSignals. Replace the placeholder Kueue Job submission in RuntimeDriftHandler with real Job construction: capability selection from failureReason, execute image from RunnerConfig.spec.runnerImage, kubeconfig Secret mount, Kueue watchdog-queue LocalQueue admission in ont-system. Add watchdog-queue LocalQueue to compiler enable phase 05 output. 12 capability handler tests and 3 compiler tests cover the new functionality.
When MC_NODE_IP env var is set, machineconfig-sync capability targets only that node IP rather than all nodes from talosconfig. Supports machineconfig.yaml data key (compiler per-node secret format) as fallback alongside gzip-compressed machineconfig key.
…, CG-9 dispatcher-runner for tenant, pre-existing API group in test PLT-BUG-4: packinstance_pull_loop.go used pre-migration kind name InfrastructurePackReceipt; API server requires PackReceipt. CG-8: buildOperatorDeployment now adds MGMT_KUBECONFIG_PATH env var and conductor-mgmt-kubeconfig volume to tenant conductor Deployments. Without this env var the gate in agent.go silently disables every drift loop that requires management cluster access (TalosVersionDrift, KubernetesVersionDrift, PackReceiptDrift, PackPodHealth, OperatorContext loops). CG-9: writePhase5PostBootstrap now generates pack-deploy-queue.yaml and dispatcher-runner.yaml for all cluster roles including tenant. watchdog-queue.yaml remains management-cluster-only. The enable script applies these files to ccs-mgmt (where seam-tenant-ccs-dev resources live), not to ccs-dev. Test: capability_publisher_test.go updated from pre-migration infrastructure.ontai.dev/ infrastructurerunnerconfigs to current seam.ontai.dev/runnerconfigs. All unit tests pass.
Compiler now generates MachineConfig CRs (platform.ontai.dev/v1alpha1) instead of Secrets. Adds addnode subcommand for post-bootstrap node template generation. machineconfig-sync reads MachineConfig CRs via DynamicClient. Upgrade capability uses powercycle reboot and orders nodes ascending by spec.order from MachineConfig CRs. - compiler bootstrap: buildMachineConfigCR replaces buildMachineConfigSecret; MachineConfig CR YAML output with typed fields (role, order, nodeIP, nodeHostname, clusterRef) and unstructured spec.machine/spec.cluster - compiler addnode: new subcommand generates MachineConfig CR template with --existing-cr cloning or skeleton placeholder output - machineconfig-sync: DynamicClient fetch of MachineConfig CR, reconstructMachineConfigYAML splits machine/cluster sections; machineConfigSyncDataKey constant extracted to shared file - platform_upgrade: nodesFromMachineConfigCRs lists and sorts CRs ascending by spec.order; falls back to TalosClient.Nodes(); RebootPowercycle replaces Reboot for post-stage node cycling - TalosNodeClient: RebootPowercycle added to interface; all stubs updated - 24 new tests across addnode, machineconfig-sync, and upgrade packages
…sist header fields Talos v1alpha1 machineconfig requires top-level version, debug, persist fields. Omitting them caused 'this config change can't be applied in immediate mode' when Talos diffed the incoming config against the running one.
…nDrift auto-resolution identity.go: correct PrincipalRef from seam-system to ont-system (conductor runs in ont-system; seam-system caused DomainIdentityMismatch / Validated=False on SeamMembership seam-conductor). cluster_node_health_loop.go: add resolveNodeRegistrationDrift() -- after each checkNodeRegistration pass, patches any DriftSignal of kind NodeRegistrationDrift to state=resolved for nodes that now carry the ont.platform.dev/controlled=true label. No DriftSignalController exists in seam; signals must be resolved by the component that detected them once the condition clears. Tests: assertion on ont-system principalRef; three table-driven tests for resolve (label present, label absent, already resolved).
…h loop TalosCluster CRs live in seam-system, not in the operator namespace (ont-system). All health loop files were using l.namespace (ont-system) for TalosCluster patches, causing "not found" errors that silently dropped: NodeHealthSummary annotation/condition, HumanInterventionRequired status patches, DiskPressure condition, endpoint drift condition, etcd health annotation, PKI expiry reads, capacity saturation condition. Fix: import namespaces package and use namespaces.SeamSystem in all six files (cluster_node_health_loop, cluster_pki_expiry, cluster_disk_pressure, cluster_endpoint_drift, cluster_etcd_health). Update test fixtures accordingly (makeTalosCluster / makeTalosClusterWithPKIExpiry namespace arg: seam-system).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
seam-system(wasont-system-- caused silent not-found on every health status write)system:serviceaccount:ont-system:conductor; NodeRegistrationDrift DriftSignals auto-resolve when controlled label presentTest plan
go test ./internal/agent/...passes (health loop namespace fix verified by test suite)go test ./cmd/compiler/...passes (MachineConfig CR output format verified)go test ./internal/capability/...passes (machineconfig-sync CR path verified)Generated with Claude Code