Skip to content

feat: orc eval + tiered verification#11

Merged
Thormatt merged 15 commits into
mainfrom
feat/eval-tiered-verification
Jun 12, 2026
Merged

feat: orc eval + tiered verification#11
Thormatt merged 15 commits into
mainfrom
feat/eval-tiered-verification

Conversation

@Thormatt

Copy link
Copy Markdown
Owner

Context

Both validation studies (adversarial web research + Delphi panel) named the same top gap: orc's gate was unmeasured on the user's own corpus, and any tiered cost-saving router would be tuned blind. This makes the gate measurable on a user-owned labelled gold set, then uses that measurement to calibrate a cheap→expensive tiered router. Eval and tiering share one dependency — the gold set — by design.

Built to the approved spec (docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md) and plan (docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md), 15 commits across 6 stages, every behavior TDD'd.

What's in it

  • Metrics library (orc.metrics) — confusion/scores extracted from the benchmark (published HaluBench numbers byte-identical, verified) + new confidence calibration (reliability bins + ECE).
  • Gold set — schema v2 (gold_claim/eval_run/tiered_policy + a real forward migration on workspace open). orc eval import seeds from YAML; orc eval label <run_id> promotes/corrects a real verdict into gold, pinned to its corpus version.
  • orc eval run/show — judge accuracy, supported-class P/R/F1, calibration ECE, retrieval recall@k; each gold claim verified frozen against its corpus version in a per-claim traced Run (so an eval is replayable claim-by-claim).
  • tiered_verify (verify --mode tiered) — cheap Tier-1 binary judge on every claim, escalate to a stronger (optionally cross-family top_judge_model) Tier-2 below the calibrated threshold; deciding tier + reason recorded in the trace.
  • orc eval calibrate — derives the lowest threshold meeting --target (default 0.95) from the gold set and writes the policy tiered_verify reads. Achievability guard: if Tier 1 can't reach the target at any cutoff, it says so (best accuracy + fallback threshold) instead of silently configuring always-escalate.
  • Docs — README commands, coverage-ceiling honesty note (eval measures the unsupported-claims row against your labels; still can't catch faithful-but-wrong corpus), CHANGELOG, compliance/positioning caveats.

Testing

uv run pytest -q: 395 passed, 2 skipped, ~3.5s; ruff clean. +28 tests, all RED-first. Schema v2 auto-migrates existing v1 workspaces on open.

Deliberate deviations from the plan

  • Skipped the decomposed/arithmetic modes/ extraction (plan Task 4.1) — a pure refactor with circular-import risk; not worth doing unsupervised. tiered is a fresh additive module instead; the modes/ package now exists so moving the other two later is low-friction.
  • Corrected the plan's calibrate sweep test — the sketched expectation (threshold 0.98) was wrong; the lowest threshold meeting the target is 0.80 (minimal escalation). Verified by hand.

Built autonomously across several loop ticks, stage-by-stage. Not merged — for your review.

🤖 Generated with Claude Code

Thormatt and others added 15 commits June 12, 2026 14:09
The measurable-gate + tiered-router feature both validation studies
named as the top gap. Eval and tiering share one dependency — a
user-owned labeled gold set — so the gate is measured on the real
corpus and the cheap->expensive router is calibrated, never tuned
blind. Captures the data model (gold_claim, eval_run, tiered_policy;
schema v2), the metrics-library extraction, orc eval run/show/label/
import/calibrate, the tiered_verify meta-mode with a configurable
cross-family top judge, and the calibrate achievability guard behind
the 0.95 default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Six staged, TDD task groups: metrics library extraction, gold store +
schema v2, orc eval run/show, tiered_verify meta-mode, calibrate loop
with achievability guard, and docs. Each stage independently testable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds gold_claim, eval_run, and tiered_policy tables for the eval +
tiered-verification feature. ensure_schema() re-runs the idempotent
CREATE-IF-NOT-EXISTS script when a workspace's stored version lags,
so existing v1 workspaces gain the tables the first time newer orc
opens them. Replaces the prior unchecked-version handling with a real
forward migration. resolve() runs it on open.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The gold set's producer surface. `orc eval import` seeds from a YAML
file (the existing claims fixture format); `orc eval label <run_id>`
promotes or corrects a real verdict into gold, pulling the claim and
corpus_version straight from the trace so the label is grounded in
exactly what orc verified; `orc eval gold list` shows entries and
flags stale chunk-level labels (corpus_version behind the workspace).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Verifies every gold claim frozen against its labeled corpus_version,
inside a per-claim traced Run tagged with the eval id, so an eval is
inspectable claim-by-claim and replayable. Aggregates exact-match
judge accuracy, supported-class precision/recall/F1, confidence
calibration (reliability bins + ECE), and retrieval recall@k where
chunk-level labels exist. Persists to eval_run; load_eval reloads it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
orc eval run scores the gate against the gold set and prints judge
accuracy, supported-class P/R/F1, confidence ECE, retrieval recall,
and a stale-label warning (--json for the full metrics dict). orc eval
show reprints a persisted eval report by id.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
mode="tiered" runs a cheap Tier-1 binary judge on every claim and
ships its verdict when confidence clears the calibrated escalation
threshold; below it, the claim escalates to a stronger Tier-2
evidence judge — optionally a different model family
(top_judge_model) so the escalation judge doesn't share Tier 1's
blind spots. The deciding tier, both confidences, and the escalation
reason are recorded in the trace. The threshold comes from the
workspace's tiered_policy (set by `orc eval calibrate`); with no
policy a conservative default routes and a warning fires.

Lives in a new modes/ package and takes the skill instance to call
self.run per tier — same pattern as decomposed — so no import cycle.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
sweep_threshold finds the lowest Tier-1 confidence cutoff whose
accepted claims reach the target accuracy — lowest because it accepts
the most at Tier 1 and escalates the fewest. When no cutoff reaches
the target (Tier 1 caps below it) the result is achievable=False with
the best accepted accuracy, so the caller can refuse to write an
always-escalate policy. calibrate() runs the gold set through the
cheap binary judge and sweeps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes the eval<->tiering loop: runs the gold set through Tier 1,
sweeps for the lowest threshold meeting --target (default 0.95),
writes the tiered_policy that tiered_verify reads, and reports the
escalation rate so the cost is visible immediately. When the target
is unachievable it says so on stderr (best accuracy + the stored
fallback threshold) rather than silently configuring always-escalate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
README commands + an honesty note that orc eval measures the
unsupported-claims coverage row against the user's own labelled gold
set (and still cannot measure faithful-but-wrong corpus content);
CHANGELOG Unreleased entries; the same one-line caveat mirrored into
the competitive and EU AI Act docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Thormatt Thormatt merged commit 3a5b854 into main Jun 12, 2026
3 checks passed
@Thormatt Thormatt deleted the feat/eval-tiered-verification branch June 12, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant