feat: orc eval + tiered verification#11
Merged
Merged
Conversation
The measurable-gate + tiered-router feature both validation studies named as the top gap. Eval and tiering share one dependency — a user-owned labeled gold set — so the gate is measured on the real corpus and the cheap->expensive router is calibrated, never tuned blind. Captures the data model (gold_claim, eval_run, tiered_policy; schema v2), the metrics-library extraction, orc eval run/show/label/ import/calibrate, the tiered_verify meta-mode with a configurable cross-family top judge, and the calibrate achievability guard behind the 0.95 default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Six staged, TDD task groups: metrics library extraction, gold store + schema v2, orc eval run/show, tiered_verify meta-mode, calibrate loop with achievability guard, and docs. Each stage independently testable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds gold_claim, eval_run, and tiered_policy tables for the eval + tiered-verification feature. ensure_schema() re-runs the idempotent CREATE-IF-NOT-EXISTS script when a workspace's stored version lags, so existing v1 workspaces gain the tables the first time newer orc opens them. Replaces the prior unchecked-version handling with a real forward migration. resolve() runs it on open. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gold set's producer surface. `orc eval import` seeds from a YAML file (the existing claims fixture format); `orc eval label <run_id>` promotes or corrects a real verdict into gold, pulling the claim and corpus_version straight from the trace so the label is grounded in exactly what orc verified; `orc eval gold list` shows entries and flags stale chunk-level labels (corpus_version behind the workspace). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Verifies every gold claim frozen against its labeled corpus_version, inside a per-claim traced Run tagged with the eval id, so an eval is inspectable claim-by-claim and replayable. Aggregates exact-match judge accuracy, supported-class precision/recall/F1, confidence calibration (reliability bins + ECE), and retrieval recall@k where chunk-level labels exist. Persists to eval_run; load_eval reloads it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
orc eval run scores the gate against the gold set and prints judge accuracy, supported-class P/R/F1, confidence ECE, retrieval recall, and a stale-label warning (--json for the full metrics dict). orc eval show reprints a persisted eval report by id. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
mode="tiered" runs a cheap Tier-1 binary judge on every claim and ships its verdict when confidence clears the calibrated escalation threshold; below it, the claim escalates to a stronger Tier-2 evidence judge — optionally a different model family (top_judge_model) so the escalation judge doesn't share Tier 1's blind spots. The deciding tier, both confidences, and the escalation reason are recorded in the trace. The threshold comes from the workspace's tiered_policy (set by `orc eval calibrate`); with no policy a conservative default routes and a warning fires. Lives in a new modes/ package and takes the skill instance to call self.run per tier — same pattern as decomposed — so no import cycle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
sweep_threshold finds the lowest Tier-1 confidence cutoff whose accepted claims reach the target accuracy — lowest because it accepts the most at Tier 1 and escalates the fewest. When no cutoff reaches the target (Tier 1 caps below it) the result is achievable=False with the best accepted accuracy, so the caller can refuse to write an always-escalate policy. calibrate() runs the gold set through the cheap binary judge and sweeps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes the eval<->tiering loop: runs the gold set through Tier 1, sweeps for the lowest threshold meeting --target (default 0.95), writes the tiered_policy that tiered_verify reads, and reports the escalation rate so the cost is visible immediately. When the target is unachievable it says so on stderr (best accuracy + the stored fallback threshold) rather than silently configuring always-escalate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
README commands + an honesty note that orc eval measures the unsupported-claims coverage row against the user's own labelled gold set (and still cannot measure faithful-but-wrong corpus content); CHANGELOG Unreleased entries; the same one-line caveat mirrored into the competitive and EU AI Act docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Both validation studies (adversarial web research + Delphi panel) named the same top gap: orc's gate was unmeasured on the user's own corpus, and any tiered cost-saving router would be tuned blind. This makes the gate measurable on a user-owned labelled gold set, then uses that measurement to calibrate a cheap→expensive tiered router. Eval and tiering share one dependency — the gold set — by design.
Built to the approved spec (
docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md) and plan (docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md), 15 commits across 6 stages, every behavior TDD'd.What's in it
orc.metrics) — confusion/scores extracted from the benchmark (published HaluBench numbers byte-identical, verified) + new confidence calibration (reliability bins + ECE).gold_claim/eval_run/tiered_policy+ a real forward migration on workspace open).orc eval importseeds from YAML;orc eval label <run_id>promotes/corrects a real verdict into gold, pinned to its corpus version.orc eval run/show— judge accuracy, supported-class P/R/F1, calibration ECE, retrieval recall@k; each gold claim verified frozen against its corpus version in a per-claim traced Run (so an eval is replayable claim-by-claim).tiered_verify(verify --mode tiered) — cheap Tier-1 binary judge on every claim, escalate to a stronger (optionally cross-familytop_judge_model) Tier-2 below the calibrated threshold; deciding tier + reason recorded in the trace.orc eval calibrate— derives the lowest threshold meeting--target(default 0.95) from the gold set and writes the policytiered_verifyreads. Achievability guard: if Tier 1 can't reach the target at any cutoff, it says so (best accuracy + fallback threshold) instead of silently configuring always-escalate.Testing
uv run pytest -q: 395 passed, 2 skipped, ~3.5s; ruff clean. +28 tests, all RED-first. Schema v2 auto-migrates existing v1 workspaces on open.Deliberate deviations from the plan
modes/extraction (plan Task 4.1) — a pure refactor with circular-import risk; not worth doing unsupervised.tieredis a fresh additive module instead; themodes/package now exists so moving the other two later is low-friction.Built autonomously across several loop ticks, stage-by-stage. Not merged — for your review.
🤖 Generated with Claude Code