feat(governor): Lane H PR-2 — TOML policy file loader + validator by joelteply · Pull Request #1350 · CambrianTech/continuum

joelteply · 2026-05-16T23:04:39Z

Re-opens #1349 which GitHub auto-closed when its base (#1345 feat/substrate-governor-pr1-types) was deleted at merge time. Rebased onto canary cleanly (skipped the now-already-merged PR-1 commit).

Summary

Lane H PR-2 per GENOME-FOUNDRY-SENTINEL #1327 Part 11 'Policy File Format'. PR-1 (#1345, merged) shipped the published GovernorPolicy shape. This PR-2 reads a TOML file matching the spec's schema and converts it to a GovernorPolicy. PR-3 wires the file watcher + cascade state machine.

What ships

src/workers/continuum-core/src/governor/policy_file.rs:

PolicyFile + file-format sibling structs (TierSizesFile, CadenceMultipliersFile, etc.) — snake_case for TOML idiom, separate from camelCase wire-format types in types.rs.
parse_policy_text(text) — pure parser, testable with embedded TOML strings.
load_policy_file(path) — thin file-opener.
validate() — enforces semantic invariants: recall_weights sum to 1.0 within 1% tolerance; tier_sizes all > 0; cadence_multipliers >= 1.0.
into_governor_policy(file, hw_class, ts) — composes file + HardwareClass + timestamp into the published GovernorPolicy.
PolicyFileError typed enum (Io / Toml / RecallWeightsImbalanced / InvalidTierSize / InvalidCadenceMultiplier).

Failure-mode discipline

Imbalanced weights → typed err with sum named (not silently rescaled).
Zero tier_size → typed err per-field.
Cadence < 1.0 → typed err (almost certainly a typo).
TOML/IO errors propagate typed.

Tests

17 passing on cargo test --lib --features metal,accelerate governor::policy_file:: (+ 36 PR-1 tests = 53 total in governor::).

Canonical M-Air + Blackwell 5090 policies from spec both parse + validate
Every validation rule with field-named error
Boundary tests: exact 1.0 recall sum, exact 1.0 cadence
Full pipeline: hw_probe → classify_hardware → parse_policy_text → into_governor_policy
I/O smoke + invalid TOML + nonexistent path

Stack

feat(governor): Lane H PR-1 — governor-types + classify_hardware bridge from hw_probe #1345 governor PR-1 (MERGED at e091973 via codex's conflict-resolve)
This PR (PR-2): TOML loader + validator
Future PR-3: file watcher (notify) + policy selection + cascade state machine + LocalSubstrateGovernor reference impl + arc_swap publish
Future PR-4: PressureBroker → governor wiring

Per GENOME-FOUNDRY-SENTINEL #1327 Part 11 'Policy File Format'. Stacks on #1345 (PR-1 governor-types). What ships in src/workers/continuum-core/src/governor/policy_file.rs: - PolicyFile + file-format sibling structs (TierSizesFile, CadenceMultipliersFile, ConcurrencyCapsFile, FederationCadenceFile, RecallScoreWeightsFile, SpeculationFileSection, ConsolidationFileSection) — snake_case for TOML idiom, separate from wire-format camelCase types in types.rs - parse_policy_text(text) — pure parser (no I/O), testable with embedded TOML strings - load_policy_file(path) — thin file-opener wrapping parse_policy_text - validate() — enforces semantic invariants: * recall_weights sum to 1.0 within RECALL_WEIGHTS_TOLERANCE (0.01) * tier_sizes all > 0 (zero would disable a tier; not supported) * cadence_multipliers >= 1.0 (< 1.0 would speed up cadence; typo) - into_governor_policy(file, hw_class, ts) — composes file + caller- supplied HardwareClass + timestamp into the published GovernorPolicy - PolicyFileError typed enum with Display + Error + From for io::Error + toml::de::Error Failure-mode discipline: - imbalanced recall_weights returns RecallWeightsImbalanced { sum, tolerance } — not silently rescaled. Operator sees what they typed. - zero tier_size returns InvalidTierSize { field, value } per-field. - cadence_multiplier < 1.0 returns InvalidCadenceMultiplier { field, value }. - TOML syntax errors propagate as PolicyFileError::Toml. - Missing file returns PolicyFileError::Io with the path named. Tests: 17 passing on cargo test --lib --features metal,accelerate governor::policy_file:: - canonical M-Air policy parses + validates (from spec) - canonical Blackwell 5090 policy parses + validates (same schema, larger numbers — pins scaling) - imbalanced recall_weights rejected (with sum named) - exact-1.0 recall_weights accepted (boundary) - zero l1_lora_layers rejected (with field named) - zero any tier_size rejected (loop over all fields) - cadence_multiplier < 1.0 rejected (with field + value) - cadence_multiplier = 1.0 accepted (boundary) - into_governor_policy composes correctly with hw_class - load_policy_file reads valid file (I/O smoke) - load_policy_file nonexistent → Io err - load_policy_file invalid TOML → Toml err - PolicyFileError Display + Error trait - From<io::Error> + From<toml::de::Error> - SpeculationLevel kebab-case strings parse (off/conservative/balanced/aggressive) - ConsolidationSchedule kebab-case strings parse (always/idle/idle-plugged-in/manual) - full pipeline: hw_probe → classify_hardware → parse_policy_text → into_governor_policy Stack: - #1335 hw_probe (MERGED) - #1345 PR-1 governor-types (OPEN) - This PR (PR-2): TOML loader + validator - Future PR-3: file watcher (notify crate) + policy selection by HardwareClass fingerprint + cascade state machine + LocalSubstrateGovernor reference impl + arc_swap publish - Future PR-4: PressureBroker → governor wiring VDD evidence N/A — pure parser + validator. Evidence with PR-3 when governor reads policy in production.

joelteply · 2026-05-16T23:07:36Z

Rebased onto current canary after #1345/#1348 landed. Note: #1345 is now merged; this PR is no longer stacked on an open PR. I also tightened PR-2 before reopening: removed stale/fallback wording from the loader docs, updated the governor module comments, and replaced bare test unwraps with explicit expectations in the owned test file.\n\nLocal proof on rebased branch:\n- cargo test --lib --features metal,accelerate governor:: => 53 passed\n- precommit for cleanup commit: TypeScript passed, clippy baseline held at 148, browser tests skipped because local jtag/core socket prerequisites were down\n- pre-push: TypeScript clean, ESLint ratchet held, Rust compile clean, Rust tests passed, native arm64 image push completed

Stacks on #1352 (codex's PR-3a policy_selector, MERGED). Per GENOME-FOUNDRY-SENTINEL #1327 Part 11. LocalSubstrateGovernor is the reference impl of the SubstrateGovernor trait (from #1345 PR-1). Holds the live policy behind arc_swap for wait-free reads; mutex-protected snapshot history for telemetry. What ships in src/workers/continuum-core/src/governor/local.rs: - LocalSubstrateGovernor struct: Arc<ArcSwap<GovernorPolicy>> for policy + Mutex<SnapshotState> for cascade-transition-count + recent-signals ring - new(initial_policy) constructor — ready to serve current_policy() immediately - set_candidates(Vec<PolicyFile>) — file watcher (PR-3d) will call this on fs change events; for PR-3b, set manually - try_hardware_detected(hw) → Result<(), PolicySelectionError> — fallible variant for callers that want the typed error - on_hardware_detected(hw) — trait method, swallows errors per spec (logs/telemetry surface them separately) - on_pressure_signal(signal) — records into ring (PR-3c adds threshold + cascade logic; PR-3b only records) - snapshot() → GovernorSnapshot — telemetry consumer reads this - candidate_count() — diagnostic for 'did the file watcher load anything?' Concurrency model (matches spec's 'never blocks reads'): - Reads: arc_swap.load_full() → Arc<GovernorPolicy> clone (wait-free) - Writes: arc_swap.store(Arc::new(new_policy)) + mutex on snapshot state for transition-count bump (~µs hold) - Tests prove the wait-free guarantee: many_concurrent_reads_dont_block + concurrent_read_during_write_sees_consistent_snapshot What this PR DOES NOT do: - Cascade state machine + threshold/hysteresis (PR-3c) - File watcher / hot reload (PR-3d) - PressureBroker subscription wiring (PR-4) - Built-in default policy fallback (caller handles NoMatchingPolicy) Failure-mode discipline: - on_hardware_detected with no matching candidate KEEPS previous policy (trait swallows error per spec — operator monitors via snapshot.cascade_transition_count which stays unchanged on Err) - on_hardware_detected with empty candidates is a no-op (first-boot before file watcher loads anything — governor still serves initial_policy) - cascade_transition_count increments per PUBLISH, not per call — failed selections don't count - on_pressure_signal does NOT bump cascade_transition_count in PR-3b (test pins this so PR-3c lands the threshold logic together) Tests: 16 passing on cargo test --lib --features metal,accelerate governor::local:: (79 total governor:: across PR-1/PR-2/PR-3a/PR-3b) - new() serves initial policy immediately - candidate_count reflects set_candidates - on_hardware_detected publishes matching policy - try_hardware_detected returns NoMatchingPolicy err - on_hardware_detected no-match KEEPS previous policy - on_hardware_detected empty candidates no-op - Successive hardware_detected publishes multiple times - on_pressure_signal records signal - recent_signals ring capped at RECENT_SIGNALS_CAPACITY=32 (FIFO eviction) - snapshot includes policy + signals - cascade_transition_count increments per publish - cascade_transition_count UNCHANGED on no-match - on_pressure_signal does NOT transition in PR-3b (PR-3c adds it) - many_concurrent_reads_dont_block (Arc<Self> + 16 threads × 1000 reads each) - concurrent_read_during_write_sees_consistent_snapshot (writer mutates + reader observes Arc snapshots that are always one of {1, 2, 8} — no torn read) - current_policy returns same Arc when no writes (Arc::ptr_eq) Added deps: arc-swap = '1.7' (tiny crate, no transitive deps). Coordination: ceded my own PR-3a (#1351 closed) in favor of codex's #1352 which has stricter AmbiguousPolicy refusal + hardware_fingerprint diagnostic surface. This PR-3b rebased onto codex's policy_selector API (arg order: select_policy(policies, hw), not (hw, policies)) + imports updated. Stack: - #1335 hw_probe (MERGED) - #1345 PR-1 governor-types (MERGED) - #1350 PR-2 TOML loader (MERGED) - #1352 PR-3a policy_selector (codex's, MERGED) - This PR (PR-3b): LocalSubstrateGovernor + arc_swap publish - Future PR-3c: cascade state machine + hysteresis (5 steps; restore- speculation-one-step-later anti-oscillation rule per spec) - Future PR-3d: file watcher (notify crate) - Future PR-4: PressureBroker → governor wiring VDD evidence N/A — pure-state impl. Evidence with PR-3c when the cascade is wired + with PR-4 when actual pressure signals flow. Co-authored-by: Test <test@test.com>

Stacks on canary post-#1360 merge. PR-3c2 wired cascade evaluator into on_pressure_signal to update cascade_step. This PR-3c3 ships apply_cascade_step_to_policy — the pure function that ACTUALLY transforms tier_sizes/cadence/concurrency/ speculation/federation/consolidation per the cascade step. Per spec §'Adjustment Cascade' table: - Step 0: unchanged (normal operation) - Step 1: speculation_aggressiveness drops one notch toward Off (Aggressive → Balanced → Conservative → Off → Off) - Step 2: cumulative + personas_concurrent -= 1 (floor 1) + defer non-realtime (cadence_multipliers.delayed/.background = max(current, 2.0)) - Step 3: cumulative + tier_sizes.l1_lora_layers + l1_kv_tokens shrunk to 75% (floor 1) - Step 4: cumulative + federation_pull_cadence.pull_cadence_seconds = MAX_FEDERATION_PULL_CADENCE_SECONDS (3600s = once-per-hour) - Step 5: cumulative + consolidation_schedule = Manual (operator must explicitly trigger; substrate stops on its own under max pressure) Transformations are CUMULATIVE — step N includes all transformations from steps 1..N. Caller passes BASE policy (cascade_step=0) and step; function returns a NEW policy with cascade_step + transformations applied. Caller is responsible for bumping policy_version + updating committed_at_ms at publish time. Pure function — no I/O, no state, no globals. Deterministic. Anti-oscillation note (caller responsibility, documented in fn docstring): the spec's 'restore-speculation-one-step-later' rule lives in the WIRING layer (LocalSubstrateGovernor follow-up), not this pure transformation. When retreating N → N-1, caller applies step N-1 for everything EXCEPT speculation, which uses step N for one more cycle. This separation keeps apply_cascade_step_to_policy a clean deterministic mapping. Also documented (test pins this): apply_cascade_step_to_policy is NOT reversible from a transformed policy. apply(transformed, 0) does NOT restore base — the caller must hold the original base separately and re-apply step 0 from it. LocalSubstrateGovernor will need to evolve to store base + active separately (PR-3c4). Constants: - MAX_FEDERATION_PULL_CADENCE_SECONDS = 3600 (once-per-hour ceiling) Pinned by test to catch silent tuning. Tests: 46 passing on cargo test --lib --features metal,accelerate governor::cascade:: (30 from PR-3c1 + 16 new) NEW (16) for apply_cascade_step_to_policy: - step 0 == base except cascade_step (identity) - step 1 drops Aggressive → Balanced - step 1 covers full speculation ladder (4 variants) - step 2 personas-1 + cumulative speculation drop - step 2 personas floor at 1 (defensive) - step 2 stretches non-realtime cadence (delayed + background → 2.0) - step 2 doesn't shrink already-stretched cadence (max-not-set semantics) - step 3 shrinks l1 by 25% (8→6, 16384→12288) - step 3 l1 floors at 1 (1*0.75=0.75→0→max(0,1)=1) - step 4 federation_pull_cadence_seconds = MAX (60→3600) - step 5 consolidation = Manual - step 5 cumulative — all prior transformations applied - step > MAX clamps to MAX (defensive against caller bugs) - determinism - not reversible from transformed (documented limitation, test pinned) - MAX_FEDERATION_PULL_CADENCE_SECONDS const pinned Stack: - #1345 PR-1 governor-types (MERGED) - #1350 PR-2 TOML loader (MERGED) - #1352 PR-3a policy_selector (MERGED) - #1354 PR-3b LocalSubstrateGovernor (MERGED) - #1356 PR-3c1 cascade evaluator (MERGED) - #1360 PR-3c2 cascade wiring + time-in-step gate (MERGED) - This PR (PR-3c3): apply_cascade_step_to_policy field rewrites - Future PR-3c4: wire apply_cascade_step_to_policy into LocalSubstrateGovernor + restore-speculation-one-step-later semantics + base-vs-active policy split - Future PR-3d: file watcher (notify crate) - Future PR-4: PressureBroker → governor wiring VDD evidence N/A — pure transformation. Evidence with PR-3c4 wiring + PR-4 + downstream consumers reading the throttled policy. Coordination: explicit claim posted to airc 00:25Z; codex on orthogonal VDD work per their 00:25:13Z broadcast. No collision. Co-authored-by: Test <test@test.com>

…ase/active split + restore-speculation-one-step-later Stacks on #1364 (PR-3c3 apply_cascade_step_to_policy, MERGED). PR-3c3 shipped the pure function. PR-3c4 wires it into LocalSubstrateGovernor with the base-vs-active policy split + the spec's restore-speculation-one-step-later anti-oscillation rule. What changed in local.rs: - LocalSubstrateGovernor.base_policy: Mutex<GovernorPolicy> field added. Holds the canonical un-throttled policy (cascade_step always 0). Cascade transitions re-derive active from base via apply_cascade_step_to_policy, never from the already-throttled current. This addresses PR-3c3's not-reversible-from-transformed documented limitation. - SnapshotState.pending_speculation_retreat: bool added. Tracks whether the cascade just retreated; if true, the NEXT Hold or Retreat restores speculation to the lower-step value. The first retreat keeps speculation at the higher-step (pre-retreat) value for one more cycle. - new() initializes base_policy from the supplied initial_policy (cascade_step normalized to 0 on the base; active keeps the supplied cascade_step). - try_hardware_detected() refreshes base_policy + resets cascade (step 0, last_step_change_ms now, pending_speculation_retreat cleared). New hardware = fresh start; existing pressure state discarded. - on_pressure_signal() rewired: * Same time-in-step gate as PR-3c2 (Advance from step > 0 within MIN_TIME_IN_STEP_MS Hold; emergency bypasses; retreat never gated) * On step change: clone base_policy + call apply_cascade_step_to_policy + bump policy_version + update committed_at_ms * On retreat: also apply prev_step's speculation to next_policy (one-step-later semantics) + set pending_speculation_retreat * On Advance after pending-retreat: clear marker (new pressure re-throttles speculation immediately) * On Hold with pending marker: deliver the restoration (publish new policy with current_step's speculation; clear marker) Restore-speculation-one-step-later rationale (from spec): Speculation thrash is the most user-visible cascade flapping. By keeping speculation throttled for ONE EXTRA cycle after the cascade retreats, we dampen the most observable form of oscillation while letting the rest of the policy (tier sizes, cadence, concurrency) restore immediately. The cost is one cycle of slightly-throttled speculation; the benefit is no observable flicker between Aggressive and Balanced (or whatever pair the cascade is bouncing between). Failure-mode discipline: - Base policy is the ONLY source of truth for transformations. Active is always derived; never mutated in place. - Restore-one-step-later is typed (bool marker, not a magic time comparison or a sentinel value). - Hardware change wipes pending retreat marker — new hardware = clean slate; old cascade state doesn't bleed into new policy. Tests: 29 passing on cargo test --lib --features metal,accelerate governor::local:: (22 prior + 7 new for PR-3c4) NEW (7): - advance_derives_active_from_base_with_step_transformations - emergency_advance_applies_full_throttle_transformations (full step-5 cumulative: tier_sizes shrunk, federation maxed, consolidation Manual, speculation dropped, personas-1) - retreat_holds_speculation_for_one_more_cycle (anti-oscillation rule pinned: Advance 0→1 drops Aggr→Balanced; Retreat 1→0 KEEPS Balanced; next Hold RESTORES Aggressive) - advance_during_pending_retreat_clears_marker - hardware_detected_refreshes_base_and_resets_cascade - advance_then_retreat_returns_to_base_values_modulo_speculation_dampening (proves derive-from-base prevents compounding transformations — was PR-3c3's not-reversible warning) - (helpers: policy_with_l1, policy_with_l1_nvidia) Stack: - #1345 / #1350 / #1352 / #1354 / #1356 / #1360 / #1364 — Lane H PRs MERGED - This PR (PR-3c4): wire apply_cascade_step_to_policy + base/active split + restore-speculation-one-step-later - Future PR-3d: file watcher (notify crate) — hot-reload policy file changes via set_candidates - Future PR-4: PressureBroker → governor wiring (subscribe to typed pressure events from broker) VDD evidence N/A — wiring + state machine. Evidence with PR-4 + harness measurements when real pressure flows + downstream consumers read throttled policy fields. Coordination: explicit claim posted 00:40Z; codex on demand-aligned- recall PR-1 per their 00:40:22Z broadcast. claude-tab-1 on whatever- next. No collision.

github-actions Bot added the size: XL label May 16, 2026

Test added 2 commits May 16, 2026 18:05

chore(governor): tighten policy loader diagnostics

6965137

joelteply force-pushed the feat/substrate-governor-pr2-toml-loader branch from 1d4cc2d to 6965137 Compare May 16, 2026 23:07

joelteply merged commit dcddcae into canary May 16, 2026
3 checks passed

joelteply deleted the feat/substrate-governor-pr2-toml-loader branch May 16, 2026 23:08

This was referenced May 16, 2026

feat(governor): Lane H PR-3a — policy selection from HardwareClass + applies_to fingerprint #1351

Closed

feat(governor): Lane H PR-3b — LocalSubstrateGovernor reference impl + arc_swap publish #1354

Merged

This was referenced May 16, 2026

feat(governor): Lane H PR-3c1 — cascade evaluator pure function + CascadeThresholds #1356

Merged

feat(governor): Lane H PR-3c2 — wire cascade evaluator into on_pressure_signal + time-in-step gate #1360

Merged

joelteply mentioned this pull request May 17, 2026

feat(governor): Lane H PR-3c4 — wire apply_cascade_step_to_policy + base/active split + restore-speculation-one-step-later #1365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(governor): Lane H PR-2 — TOML policy file loader + validator#1350

feat(governor): Lane H PR-2 — TOML policy file loader + validator#1350
joelteply merged 2 commits into
canaryfrom
feat/substrate-governor-pr2-toml-loader

joelteply commented May 16, 2026

Uh oh!

joelteply commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented May 16, 2026

Summary

What ships

Failure-mode discipline

Tests

Stack

Uh oh!

joelteply commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant