feat: orc eval + tiered verification by Thormatt · Pull Request #11 · Thormatt/orc

Thormatt · 2026-06-12T20:41:33Z

Context

Both validation studies (adversarial web research + Delphi panel) named the same top gap: orc's gate was unmeasured on the user's own corpus, and any tiered cost-saving router would be tuned blind. This makes the gate measurable on a user-owned labelled gold set, then uses that measurement to calibrate a cheap→expensive tiered router. Eval and tiering share one dependency — the gold set — by design.

Built to the approved spec (docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md) and plan (docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md), 15 commits across 6 stages, every behavior TDD'd.

What's in it

Metrics library (orc.metrics) — confusion/scores extracted from the benchmark (published HaluBench numbers byte-identical, verified) + new confidence calibration (reliability bins + ECE).
Gold set — schema v2 (gold_claim/eval_run/tiered_policy + a real forward migration on workspace open). orc eval import seeds from YAML; orc eval label <run_id> promotes/corrects a real verdict into gold, pinned to its corpus version.
orc eval run/show — judge accuracy, supported-class P/R/F1, calibration ECE, retrieval recall@k; each gold claim verified frozen against its corpus version in a per-claim traced Run (so an eval is replayable claim-by-claim).
tiered_verify (verify --mode tiered) — cheap Tier-1 binary judge on every claim, escalate to a stronger (optionally cross-family top_judge_model) Tier-2 below the calibrated threshold; deciding tier + reason recorded in the trace.
orc eval calibrate — derives the lowest threshold meeting --target (default 0.95) from the gold set and writes the policy tiered_verify reads. Achievability guard: if Tier 1 can't reach the target at any cutoff, it says so (best accuracy + fallback threshold) instead of silently configuring always-escalate.
Docs — README commands, coverage-ceiling honesty note (eval measures the unsupported-claims row against your labels; still can't catch faithful-but-wrong corpus), CHANGELOG, compliance/positioning caveats.

Testing

uv run pytest -q: 395 passed, 2 skipped, ~3.5s; ruff clean. +28 tests, all RED-first. Schema v2 auto-migrates existing v1 workspaces on open.

Deliberate deviations from the plan

Skipped the decomposed/arithmetic modes/ extraction (plan Task 4.1) — a pure refactor with circular-import risk; not worth doing unsupervised. tiered is a fresh additive module instead; the modes/ package now exists so moving the other two later is low-friction.
Corrected the plan's calibrate sweep test — the sketched expectation (threshold 0.98) was wrong; the lowest threshold meeting the target is 0.80 (minimal escalation). Verified by hand.

Built autonomously across several loop ticks, stage-by-stage. Not merged — for your review.

🤖 Generated with Claude Code

The measurable-gate + tiered-router feature both validation studies named as the top gap. Eval and tiering share one dependency — a user-owned labeled gold set — so the gate is measured on the real corpus and the cheap->expensive router is calibrated, never tuned blind. Captures the data model (gold_claim, eval_run, tiered_policy; schema v2), the metrics-library extraction, orc eval run/show/label/ import/calibrate, the tiered_verify meta-mode with a configurable cross-family top judge, and the calibrate achievability guard behind the 0.95 default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Six staged, TDD task groups: metrics library extraction, gold store + schema v2, orc eval run/show, tiered_verify meta-mode, calibrate loop with achievability guard, and docs. Each stage independently testable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Adds gold_claim, eval_run, and tiered_policy tables for the eval + tiered-verification feature. ensure_schema() re-runs the idempotent CREATE-IF-NOT-EXISTS script when a workspace's stored version lags, so existing v1 workspaces gain the tables the first time newer orc opens them. Replaces the prior unchecked-version handling with a real forward migration. resolve() runs it on open. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The gold set's producer surface. `orc eval import` seeds from a YAML file (the existing claims fixture format); `orc eval label <run_id>` promotes or corrects a real verdict into gold, pulling the claim and corpus_version straight from the trace so the label is grounded in exactly what orc verified; `orc eval gold list` shows entries and flags stale chunk-level labels (corpus_version behind the workspace). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Verifies every gold claim frozen against its labeled corpus_version, inside a per-claim traced Run tagged with the eval id, so an eval is inspectable claim-by-claim and replayable. Aggregates exact-match judge accuracy, supported-class precision/recall/F1, confidence calibration (reliability bins + ECE), and retrieval recall@k where chunk-level labels exist. Persists to eval_run; load_eval reloads it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

orc eval run scores the gate against the gold set and prints judge accuracy, supported-class P/R/F1, confidence ECE, retrieval recall, and a stale-label warning (--json for the full metrics dict). orc eval show reprints a persisted eval report by id. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mode="tiered" runs a cheap Tier-1 binary judge on every claim and ships its verdict when confidence clears the calibrated escalation threshold; below it, the claim escalates to a stronger Tier-2 evidence judge — optionally a different model family (top_judge_model) so the escalation judge doesn't share Tier 1's blind spots. The deciding tier, both confidences, and the escalation reason are recorded in the trace. The threshold comes from the workspace's tiered_policy (set by `orc eval calibrate`); with no policy a conservative default routes and a warning fires. Lives in a new modes/ package and takes the skill instance to call self.run per tier — same pattern as decomposed — so no import cycle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

sweep_threshold finds the lowest Tier-1 confidence cutoff whose accepted claims reach the target accuracy — lowest because it accepts the most at Tier 1 and escalates the fewest. When no cutoff reaches the target (Tier 1 caps below it) the result is achievable=False with the best accepted accuracy, so the caller can refuse to write an always-escalate policy. calibrate() runs the gold set through the cheap binary judge and sweeps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Closes the eval<->tiering loop: runs the gold set through Tier 1, sweeps for the lowest threshold meeting --target (default 0.95), writes the tiered_policy that tiered_verify reads, and reports the escalation rate so the cost is visible immediately. When the target is unachievable it says so on stderr (best accuracy + the stored fallback threshold) rather than silently configuring always-escalate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

README commands + an honesty note that orc eval measures the unsupported-claims coverage row against the user's own labelled gold set (and still cannot measure faithful-but-wrong corpus content); CHANGELOG Unreleased entries; the same one-line caveat mirrored into the competitive and EU AI Act docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Thormatt and others added 15 commits June 12, 2026 14:09

feat(metrics): confusion + scores library

370bfc5

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(metrics): confidence calibration (reliability bins + ECE)

16620e1

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

refactor(benchmarks): use orc.metrics scoring library

9f451bf

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(eval): gold-set store (add + list)

73caa59

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

feat(eval): tiered_policy store (load/save)

b06c1ef

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Thormatt merged commit 3a5b854 into main Jun 12, 2026
3 checks passed

Thormatt deleted the feat/eval-tiered-verification branch June 12, 2026 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: orc eval + tiered verification#11

feat: orc eval + tiered verification#11
Thormatt merged 15 commits into
mainfrom
feat/eval-tiered-verification

Thormatt commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thormatt commented Jun 12, 2026

Context

What's in it

Testing

Deliberate deviations from the plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant