From 3c0e5dec6301d0afaf093e874dcf2c008f48fa7b Mon Sep 17 00:00:00 2001 From: Thormatt Date: Fri, 12 Jun 2026 14:09:03 -0400 Subject: [PATCH 01/15] docs(spec): orc eval + tiered verification design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The measurable-gate + tiered-router feature both validation studies named as the top gap. Eval and tiering share one dependency — a user-owned labeled gold set — so the gate is measured on the real corpus and the cheap->expensive router is calibrated, never tuned blind. Captures the data model (gold_claim, eval_run, tiered_policy; schema v2), the metrics-library extraction, orc eval run/show/label/ import/calibrate, the tiered_verify meta-mode with a configurable cross-family top judge, and the calibrate achievability guard behind the 0.95 default. Co-Authored-By: Claude Fable 5 --- ...-12-orc-eval-tiered-verification-design.md | 239 ++++++++++++++++++ 1 file changed, 239 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md diff --git a/docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md b/docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md new file mode 100644 index 0000000..d65017b --- /dev/null +++ b/docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md @@ -0,0 +1,239 @@ +# orc eval + tiered verification — design + +**Status:** approved design, pre-implementation +**Date:** 2026-06-12 + +## Context + +Two independent validation studies of orc's architecture (a 106-agent adversarial +web-research run and a 4-model Delphi panel) converged on the same top gap: the +verification gate is **unmeasured on the user's own corpus**, and any tiered +cost-saving routing built on top of it would be **tuned blind**. The Delphi panel +was explicit — "an unmeasured verification gate is theater regardless of retrieval +quality," and a labeled gold set is the *shared* prerequisite for both judge +calibration and retrieval-recall evaluation. Both studies also recommended tiered +verification (cheap pass for all claims, expensive escalation only when needed) and +warned it cannot be tuned without that gold set. + +This feature makes the gate measurable on a user-owned, labeled gold set, then uses +that measurement to calibrate a cheap→expensive tiered router. Eval and tiering +share one dependency — the gold set — by design. + +orc already provides most of the machinery: the benchmark harness has the scoring +math (`_confusion`/`_scores`/`_per_source_breakdown` in `benchmarks/faithfulness/run.py`), +`verify_claim.run()` already accepts a per-call `model` and `mode` (so cross-family +judging and tiered escalation need no core surgery), there is a precedent labeled +format (`tests/fixtures/claims.yaml`), and the schema is cleanly versioned for an +additive bump. + +## Goals + +1. A per-workspace **gold set** of human-confirmed (claim → verdict) labels, seeded + by import and grown by promoting real verdicts. +2. `orc eval` that measures the gate on that gold set: judge accuracy + (precision/recall/F1 per mode and domain), confidence **calibration**, and + **retrieval recall** where chunk-level labels exist. +3. A **tiered_verify** strategy: cheap Haiku pass → escalate to an expensive + (optionally cross-family) judge when confidence is below a *calibrated* + threshold, with the deciding tier recorded in the trace. +4. `orc eval calibrate` that closes the loop: derive the escalation threshold from + the gold set so tiering is never tuned blind. + +## Non-goals (this iteration) + +- Corpus provenance/freshness controls (the "faithful-but-wrong corpus content" + failure mode no gold set can catch — documented, not built). +- A hosted/web calibration dashboard. +- Automatic gold-set generation (gold entries are always human-confirmed). + +## Data model + +New per-workspace table, additive **schema v2** (current is v1; `db.py` stamps +`SCHEMA_VERSION` in `schema_meta`): + +```sql +CREATE TABLE gold_claim ( + gold_id TEXT PRIMARY KEY, -- ULID + workspace TEXT NOT NULL, + claim TEXT NOT NULL, + expected_label TEXT NOT NULL, -- supported|contradicted|not_found|partial + corpus_version INTEGER NOT NULL, -- snapshot the label is valid against + relevant_chunk_ids TEXT, -- JSON list, nullable (retrieval-recall gold) + source TEXT NOT NULL, -- import|promoted + source_run_id TEXT, -- the run a promoted label came from + note TEXT, + added_at TEXT NOT NULL, + added_by TEXT +); +CREATE INDEX idx_gold_claim_workspace ON gold_claim(workspace); +``` + +**Corpus-version pinning is load-bearing.** Chunk IDs change on re-ingest, so +`relevant_chunk_ids` are valid only for the `corpus_version` they were labeled +against. Retrieval-recall eval therefore runs **frozen** against each entry's +`corpus_version` (reusing the replay machinery — `bm25_search`/`retrieve()` already +accept `corpus_version`). Judge-accuracy labels (the verdict) survive re-ingest; +only chunk-relevance is version-bound. Eval flags entries whose `corpus_version` +lags the workspace's current version so stale chunk labels are visible, not silent. + +The schema bump needs a real migration step (the current `db.py` has only a broad +`suppress(Exception)` ALTER): add a minimal forward migration that creates +`gold_claim` on open when `schema_version < 2`, then re-stamps. (This also retires +an existing low-severity finding about unchecked schema version.) + +## Metrics library (extraction) + +Move the benchmark-private scoring into an importable `src/orc/metrics/` package so +both `benchmarks/` and `orc eval` share one implementation: + +- `metrics/confusion.py` — `confusion(items, *, predicted, expected, positive_label) -> Confusion` +- `metrics/scores.py` — `scores(confusion) -> Scores` (accuracy, precision, recall, F1) +- `metrics/breakdown.py` — `per_group(items, key, ...) -> dict[str, GroupResult]` +- `metrics/calibration.py` — **new**: `reliability_bins(items, n_bins=10) -> list[Bin]` + and `expected_calibration_error(bins) -> float`. A bin holds + (confidence range, count, predicted_mean, actual_accuracy). + +`benchmarks/faithfulness/run.py` is updated to import these instead of its private +copies (no behavioral change; the published numbers must stay identical — pinned by +the existing benchmark tests). + +## Gold-set CLI + +- `orc eval import -w WS` — seed from the `claims.yaml` format + (`id`, `text`, `expected`, optional `relevant_chunk_ids`, optional `note`). + Stamps `source=import` and the workspace's current `corpus_version`. +- `orc eval label --verdict