Skip to content

feat(governor): Lane H PR-3a — policy selection from HardwareClass + applies_to fingerprint#1351

Closed
joelteply wants to merge 1 commit into
canaryfrom
feat/substrate-governor-pr3a-policy-selection
Closed

feat(governor): Lane H PR-3a — policy selection from HardwareClass + applies_to fingerprint#1351
joelteply wants to merge 1 commit into
canaryfrom
feat/substrate-governor-pr3a-policy-selection

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

Lane H PR-3a per GENOME-FOUNDRY-SENTINEL #1327 Part 11. Stacks on #1345 (PR-1 types, MERGED) + #1350 (PR-2 TOML loader, MERGED).

PR-2 shipped file → PolicyFile. This PR-3a ships the SELECTION layer: given a HardwareClass + a list of PolicyFiles, pick the right one.

Splitting PR-3 into atomic sub-slices: this PR-3a (selection), future PR-3b (LocalSubstrateGovernor reference impl with arc_swap), PR-3c (cascade state machine + hysteresis), PR-3d (file watcher).

Match algorithm

Comma-separated constraints in applies_to:

  • Silicon tag: apple-m / nvidia / amd / vulkan / none
  • Thermal tag: thinandlight / workstation / server / mobile
  • UMA tag: uma (redundant with apple-m, for reader clarity)
  • Numeric range: vram_mb=lo..hi, ram_mb=lo..hi (inclusive both ends)

ALL constraints must hold. Multiple matches → longest applies_to wins (most specific). Zero matches → typed NoMatchingPolicy error with HardwareClass + candidate_count — never silent default to a wrong-hardware policy.

Failure-mode discipline

  • NoMatchingPolicy on zero matches (named candidate count + named hardware)
  • MalformedConstraint for range syntax errors (field + reason named)
  • UnknownConstraintTag for unrecognized tags — no silent wildcard interpretation
  • Pure function — same (hardware_class, candidates) always returns same result. No I/O, no globals.

Test plan

23 passing on cargo test --lib --features metal,accelerate governor::policy_selection::

  • M-Air policy matches M2 Air hardware (canonical Mac path)
  • Blackwell policy matches Blackwell hardware (canonical discrete-GPU)
  • Multiple candidates → only matching returned
  • Multiple matches → longest applies_to wins (tiebreaker)
  • Empty candidates → NoMatchingPolicy { candidate_count=0 }
  • Silicon + thermal tag must match (per-variant)
  • Range inclusive at lower + upper boundary (off-by-one defense)
  • Range misses one-below-lower + one-above-upper
  • VRAM range matches Blackwell
  • UMA tag holds for Apple, fails for discrete
  • Unknown tag → typed err with tag named
  • Range without ..MalformedConstraint
  • Range non-numeric → MalformedConstraint
  • Range hi < loMalformedConstraint (nonsense rejected)
  • Unknown range field (cpu_ghz) → MalformedConstraint
  • Whitespace tolerated in applies_to
  • Empty applies_to acts as wildcard (documented)
  • PolicySelectionError: Display + Error trait
  • Determinism

Stack

VDD evidence

N/A — pure function. Evidence with PR-3b when LocalSubstrateGovernor publishes via arc_swap in production.

…applies_to fingerprint

Per GENOME-FOUNDRY-SENTINEL #1327 Part 11. Stacks on #1350 (PR-2 TOML loader).

PR-2 ships file → PolicyFile. This PR-3a ships the SELECTION layer:
given a HardwareClass + a list of PolicyFile, pick the right one.

Match algorithm (documented in module docstring):
- Comma-separated constraints in applies_to string
- Constraint kinds: silicon tag (apple-m/nvidia/amd/vulkan/none),
  thermal tag (thinandlight/workstation/server/mobile), uma tag
  (redundant with apple-m, for reader clarity), numeric range
  (vram_mb=lo..hi, ram_mb=lo..hi, both inclusive)
- ALL constraints must hold
- If multiple files match, LONGEST applies_to wins (most specific)
- ZERO matches → typed NoMatchingPolicy error with HardwareClass + candidate_count

Pure function. Same (hardware_class, candidates) always returns same result.

Failure-mode discipline:
- NoMatchingPolicy on zero matches (never silent default to wrong-hardware policy)
- MalformedConstraint with field + reason for range syntax errors
- UnknownConstraintTag for unrecognized tags (no silent wildcard interpretation)

Tests: 23 passing on cargo test --lib --features metal,accelerate
governor::policy_selection::

- M-Air policy matches M2 Air hardware (canonical Mac path)
- Blackwell policy matches Blackwell hardware (canonical discrete-GPU path)
- Multiple candidates → only matching returned
- Multiple matches → longest applies_to wins (tiebreaker)
- Empty candidates → NoMatchingPolicy candidate_count=0
- Silicon tag must match (each variant)
- Thermal tag must match
- Range inclusive at lower + upper boundary (off-by-one defense)
- Range misses one-below-lower + one-above-upper
- vram_mb range matches Blackwell
- UMA tag holds for Apple, fails for discrete
- Unknown tag → typed err with tag named
- Range without '..' → MalformedConstraint
- Range non-numeric lo → MalformedConstraint
- Range with hi < lo → MalformedConstraint
- Unknown range field (cpu_ghz) → MalformedConstraint
- Whitespace tolerated in applies_to
- Empty applies_to acts as wildcard (documented)
- PolicySelectionError: Display + Error trait
- Determinism

Stack:
- #1345 governor PR-1 (MERGED)
- #1350 governor PR-2 TOML loader (OPEN)
- This PR (PR-3a): policy selection
- Future PR-3b: LocalSubstrateGovernor reference impl with arc_swap
- Future PR-3c: cascade state machine + hysteresis
- Future PR-3d: file watcher (notify crate)
- Future PR-4: PressureBroker → governor wiring
@joelteply
Copy link
Copy Markdown
Contributor Author

Closing in favor of codex's #1352 which addresses the same scope (policy selection from HardwareClass + applies_to). We shipped this in parallel within ~10min — coordination miss on my end (should have checked queue before claiming PR-3a).

codex's #1352 adds: AmbiguousMatch refusal (stricter than my longest-applies_to tiebreaker), stable hardware fingerprints surface for diagnostics/VDD. Both well-designed; #1352 wins on the ambiguity-refusal which is more aligned with the no-silent-defaults rule.

Will rebase my PR-3b LocalSubstrateGovernor work onto #1352's selector API once it merges + take a harness lane item in parallel (T1 #1 chat-roundtrip-live-harness is unclaimed).

@joelteply joelteply closed this May 16, 2026
joelteply added a commit that referenced this pull request May 16, 2026
Stacks on #1352 (codex's PR-3a policy_selector, MERGED). Per
GENOME-FOUNDRY-SENTINEL #1327 Part 11.

LocalSubstrateGovernor is the reference impl of the SubstrateGovernor
trait (from #1345 PR-1). Holds the live policy behind arc_swap for
wait-free reads; mutex-protected snapshot history for telemetry.

What ships in src/workers/continuum-core/src/governor/local.rs:

- LocalSubstrateGovernor struct: Arc<ArcSwap<GovernorPolicy>> for
  policy + Mutex<SnapshotState> for cascade-transition-count +
  recent-signals ring
- new(initial_policy) constructor — ready to serve current_policy()
  immediately
- set_candidates(Vec<PolicyFile>) — file watcher (PR-3d) will call
  this on fs change events; for PR-3b, set manually
- try_hardware_detected(hw) → Result<(), PolicySelectionError> —
  fallible variant for callers that want the typed error
- on_hardware_detected(hw) — trait method, swallows errors per spec
  (logs/telemetry surface them separately)
- on_pressure_signal(signal) — records into ring (PR-3c adds threshold
  + cascade logic; PR-3b only records)
- snapshot() → GovernorSnapshot — telemetry consumer reads this
- candidate_count() — diagnostic for 'did the file watcher load anything?'

Concurrency model (matches spec's 'never blocks reads'):

- Reads: arc_swap.load_full() → Arc<GovernorPolicy> clone (wait-free)
- Writes: arc_swap.store(Arc::new(new_policy)) + mutex on snapshot
  state for transition-count bump (~µs hold)
- Tests prove the wait-free guarantee: many_concurrent_reads_dont_block
  + concurrent_read_during_write_sees_consistent_snapshot

What this PR DOES NOT do:
- Cascade state machine + threshold/hysteresis (PR-3c)
- File watcher / hot reload (PR-3d)
- PressureBroker subscription wiring (PR-4)
- Built-in default policy fallback (caller handles NoMatchingPolicy)

Failure-mode discipline:
- on_hardware_detected with no matching candidate KEEPS previous
  policy (trait swallows error per spec — operator monitors via
  snapshot.cascade_transition_count which stays unchanged on Err)
- on_hardware_detected with empty candidates is a no-op (first-boot
  before file watcher loads anything — governor still serves initial_policy)
- cascade_transition_count increments per PUBLISH, not per call —
  failed selections don't count
- on_pressure_signal does NOT bump cascade_transition_count in PR-3b
  (test pins this so PR-3c lands the threshold logic together)

Tests: 16 passing on cargo test --lib --features metal,accelerate
governor::local:: (79 total governor:: across PR-1/PR-2/PR-3a/PR-3b)

- new() serves initial policy immediately
- candidate_count reflects set_candidates
- on_hardware_detected publishes matching policy
- try_hardware_detected returns NoMatchingPolicy err
- on_hardware_detected no-match KEEPS previous policy
- on_hardware_detected empty candidates no-op
- Successive hardware_detected publishes multiple times
- on_pressure_signal records signal
- recent_signals ring capped at RECENT_SIGNALS_CAPACITY=32 (FIFO eviction)
- snapshot includes policy + signals
- cascade_transition_count increments per publish
- cascade_transition_count UNCHANGED on no-match
- on_pressure_signal does NOT transition in PR-3b (PR-3c adds it)
- many_concurrent_reads_dont_block (Arc<Self> + 16 threads × 1000 reads each)
- concurrent_read_during_write_sees_consistent_snapshot (writer mutates +
  reader observes Arc snapshots that are always one of {1, 2, 8} — no torn read)
- current_policy returns same Arc when no writes (Arc::ptr_eq)

Added deps: arc-swap = '1.7' (tiny crate, no transitive deps).

Coordination: ceded my own PR-3a (#1351 closed) in favor of codex's
#1352 which has stricter AmbiguousPolicy refusal + hardware_fingerprint
diagnostic surface. This PR-3b rebased onto codex's policy_selector API
(arg order: select_policy(policies, hw), not (hw, policies)) +
imports updated.

Stack:
- #1335 hw_probe (MERGED)
- #1345 PR-1 governor-types (MERGED)
- #1350 PR-2 TOML loader (MERGED)
- #1352 PR-3a policy_selector (codex's, MERGED)
- This PR (PR-3b): LocalSubstrateGovernor + arc_swap publish
- Future PR-3c: cascade state machine + hysteresis (5 steps; restore-
  speculation-one-step-later anti-oscillation rule per spec)
- Future PR-3d: file watcher (notify crate)
- Future PR-4: PressureBroker → governor wiring

VDD evidence N/A — pure-state impl. Evidence with PR-3c when the
cascade is wired + with PR-4 when actual pressure signals flow.

Co-authored-by: Test <test@test.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant