From 3c0e5dec6301d0afaf093e874dcf2c008f48fa7b Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 14:09:03 -0400
Subject: [PATCH 01/15] docs(spec): orc eval + tiered verification design
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The measurable-gate + tiered-router feature both validation studies
named as the top gap. Eval and tiering share one dependency — a
user-owned labeled gold set — so the gate is measured on the real
corpus and the cheap->expensive router is calibrated, never tuned
blind. Captures the data model (gold_claim, eval_run, tiered_policy;
schema v2), the metrics-library extraction, orc eval run/show/label/
import/calibrate, the tiered_verify meta-mode with a configurable
cross-family top judge, and the calibrate achievability guard behind
the 0.95 default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 ...-12-orc-eval-tiered-verification-design.md | 239 ++++++++++++++++++
 1 file changed, 239 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md

diff --git a/docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md b/docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md
new file mode 100644
index 0000000..d65017b
--- /dev/null
+++ b/docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md
@@ -0,0 +1,239 @@
+# orc eval + tiered verification — design
+
+**Status:** approved design, pre-implementation
+**Date:** 2026-06-12
+
+## Context
+
+Two independent validation studies of orc's architecture (a 106-agent adversarial
+web-research run and a 4-model Delphi panel) converged on the same top gap: the
+verification gate is **unmeasured on the user's own corpus**, and any tiered
+cost-saving routing built on top of it would be **tuned blind**. The Delphi panel
+was explicit — "an unmeasured verification gate is theater regardless of retrieval
+quality," and a labeled gold set is the *shared* prerequisite for both judge
+calibration and retrieval-recall evaluation. Both studies also recommended tiered
+verification (cheap pass for all claims, expensive escalation only when needed) and
+warned it cannot be tuned without that gold set.
+
+This feature makes the gate measurable on a user-owned, labeled gold set, then uses
+that measurement to calibrate a cheap→expensive tiered router. Eval and tiering
+share one dependency — the gold set — by design.
+
+orc already provides most of the machinery: the benchmark harness has the scoring
+math (`_confusion`/`_scores`/`_per_source_breakdown` in `benchmarks/faithfulness/run.py`),
+`verify_claim.run()` already accepts a per-call `model` and `mode` (so cross-family
+judging and tiered escalation need no core surgery), there is a precedent labeled
+format (`tests/fixtures/claims.yaml`), and the schema is cleanly versioned for an
+additive bump.
+
+## Goals
+
+1. A per-workspace **gold set** of human-confirmed (claim → verdict) labels, seeded
+   by import and grown by promoting real verdicts.
+2. `orc eval` that measures the gate on that gold set: judge accuracy
+   (precision/recall/F1 per mode and domain), confidence **calibration**, and
+   **retrieval recall** where chunk-level labels exist.
+3. A **tiered_verify** strategy: cheap Haiku pass → escalate to an expensive
+   (optionally cross-family) judge when confidence is below a *calibrated*
+   threshold, with the deciding tier recorded in the trace.
+4. `orc eval calibrate` that closes the loop: derive the escalation threshold from
+   the gold set so tiering is never tuned blind.
+
+## Non-goals (this iteration)
+
+- Corpus provenance/freshness controls (the "faithful-but-wrong corpus content"
+  failure mode no gold set can catch — documented, not built).
+- A hosted/web calibration dashboard.
+- Automatic gold-set generation (gold entries are always human-confirmed).
+
+## Data model
+
+New per-workspace table, additive **schema v2** (current is v1; `db.py` stamps
+`SCHEMA_VERSION` in `schema_meta`):
+
+```sql
+CREATE TABLE gold_claim (
+    gold_id            TEXT PRIMARY KEY,          -- ULID
+    workspace          TEXT NOT NULL,
+    claim              TEXT NOT NULL,
+    expected_label     TEXT NOT NULL,             -- supported|contradicted|not_found|partial
+    corpus_version     INTEGER NOT NULL,          -- snapshot the label is valid against
+    relevant_chunk_ids TEXT,                      -- JSON list, nullable (retrieval-recall gold)
+    source             TEXT NOT NULL,             -- import|promoted
+    source_run_id      TEXT,                      -- the run a promoted label came from
+    note               TEXT,
+    added_at           TEXT NOT NULL,
+    added_by           TEXT
+);
+CREATE INDEX idx_gold_claim_workspace ON gold_claim(workspace);
+```
+
+**Corpus-version pinning is load-bearing.** Chunk IDs change on re-ingest, so
+`relevant_chunk_ids` are valid only for the `corpus_version` they were labeled
+against. Retrieval-recall eval therefore runs **frozen** against each entry's
+`corpus_version` (reusing the replay machinery — `bm25_search`/`retrieve()` already
+accept `corpus_version`). Judge-accuracy labels (the verdict) survive re-ingest;
+only chunk-relevance is version-bound. Eval flags entries whose `corpus_version`
+lags the workspace's current version so stale chunk labels are visible, not silent.
+
+The schema bump needs a real migration step (the current `db.py` has only a broad
+`suppress(Exception)` ALTER): add a minimal forward migration that creates
+`gold_claim` on open when `schema_version < 2`, then re-stamps. (This also retires
+an existing low-severity finding about unchecked schema version.)
+
+## Metrics library (extraction)
+
+Move the benchmark-private scoring into an importable `src/orc/metrics/` package so
+both `benchmarks/` and `orc eval` share one implementation:
+
+- `metrics/confusion.py` — `confusion(items, *, predicted, expected, positive_label) -> Confusion`
+- `metrics/scores.py` — `scores(confusion) -> Scores` (accuracy, precision, recall, F1)
+- `metrics/breakdown.py` — `per_group(items, key, ...) -> dict[str, GroupResult]`
+- `metrics/calibration.py` — **new**: `reliability_bins(items, n_bins=10) -> list[Bin]`
+  and `expected_calibration_error(bins) -> float`. A bin holds
+  (confidence range, count, predicted_mean, actual_accuracy).
+
+`benchmarks/faithfulness/run.py` is updated to import these instead of its private
+copies (no behavioral change; the published numbers must stay identical — pinned by
+the existing benchmark tests).
+
+## Gold-set CLI
+
+- `orc eval import <file.yaml> -w WS` — seed from the `claims.yaml` format
+  (`id`, `text`, `expected`, optional `relevant_chunk_ids`, optional `note`).
+  Stamps `source=import` and the workspace's current `corpus_version`.
+- `orc eval label <run_id> --verdict <label> [--relevant <chunk_id>...] [--note ...]`
+  — promote/correct a real verdict into gold. Pulls `claim` and `corpus_version`
+  from the trace (`load_trace`), stamps `source=promoted`, `source_run_id`.
+- `orc eval gold list -w WS [--json]` — list entries (with stale-version flag).
+
+## `orc eval run`
+
+For every gold claim in the workspace, verify against its pinned `corpus_version`
+and compute:
+
+- **Judge accuracy** — `confusion` → `scores`, broken down per mode and per domain.
+  The 4-label verdict is mapped to correct/incorrect against `expected_label`
+  (exact match; `partial` is its own class, not folded into FAIL as the benchmark
+  does — eval is about the gate's own labels, not a binary PASS/FAIL task).
+- **Calibration** — reliability bins + Expected Calibration Error over predicted
+  confidence. This is the artifact that surfaces the escalation threshold.
+- **Retrieval recall@k** — for entries with `relevant_chunk_ids`, the fraction of
+  labeled-relevant chunks that frozen retrieval surfaced in the top k.
+
+Each eval is itself auditable and replayable: an `eval_run` row
+(`eval_id`, `workspace`, `created_at`, `config_json`, `metrics_json`) plus the
+per-claim verify Runs it spawned — each a normal trace, tagged with `eval_id` in its
+inputs. `orc eval show <eval_id> [--json]` prints the report (console table by
+default). The per-claim runs mean an eval can be inspected claim-by-claim and
+replayed like any other orc run.
+
+## Tiered verification
+
+A new `tiered_verify` meta-strategy, a sibling to the existing `decomposed` and
+`arithmetic` meta-modes. **Refactor note:** `verify_claim.py` is ~800 lines and
+already dispatches `decomposed`/`arithmetic` to internal `_run_*` helpers; extract
+those plus the new tiered strategy into a `directives/research/skills/modes/`
+submodule (`modes/decomposed.py`, `modes/arithmetic.py`, `modes/tiered.py`) so the
+core `verify_claim.run()` stays a thin dispatcher. This is a targeted improvement of
+code being touched, not an unrelated refactor.
+
+Tier policy:
+
+- **Tier 1** — Haiku, binary mode, on every claim (cheap).
+- **Escalate to Tier 2** — Sonnet, evidence mode + decomposed — when Tier-1
+  confidence `<` the calibrated threshold.
+- **Top-tier judge model is configurable.** Default Sonnet (Anthropic). A user may
+  set a true cross-family judge (e.g. a GPT/Gemini/Llama model via OpenRouter) to
+  break the self-consistency bias both studies flagged. orc already routes any model
+  string and handles OpenRouter, so this is configuration, not new plumbing.
+- The trace records **which tier decided and why**: both verdicts, the Tier-1
+  confidence, the threshold, and the escalation reason. Tiering is auditable.
+
+`tiered_verify` is reachable via `mode="tiered"` and can be wired into
+`route_to_mode` for a domain that should default to it.
+
+### The calibration loop
+
+`orc eval calibrate -w WS [--target 0.95] [--tier1-model ...] [--top-judge ...]`:
+
+1. Run the gold set through **Tier 1 only** (Haiku binary), recording each verdict's
+   confidence and correctness.
+2. Sweep the confidence threshold; find the lowest cutoff at which Tier-1-accepted
+   claims reach the `--target` accuracy (default **0.95**).
+3. **Achievability guard:** if no cutoff reaches the target (Tier-1 accuracy caps
+   below it at every confidence level), report that plainly — *"Tier 1 cannot reach
+   0.95 at any cutoff on this gold set (max 0.91 at conf≥0.97); escalating all
+   claims — lower --target or improve the gold set"* — rather than silently writing
+   an always-escalate policy.
+4. **Always report the resulting escalation rate** (fraction of gold claims that
+   would escalate at the chosen threshold) so the cost implication is visible
+   immediately.
+5. Write a `tiered_policy` row into `orc.db` (one row per workspace, replacing any
+   prior policy — keyed by workspace, same store as `gold_claim`/`eval_run` so it
+   travels with the workspace backup and is queryable):
+   `{workspace, tier1_model, tier2_model, top_judge_model?, escalation_threshold,
+   target, calibrated_at, calibrated_against_eval_id, n_gold}`. (The effects
+   allow-list stays in `config.toml`; calibration state is data, not policy a human
+   hand-edits, so it lives in the DB.)
+
+`tiered_verify` reads `tiered_policy`; if absent, it falls back to a documented
+default threshold and warns once that tiering is uncalibrated. This is the loop the
+studies demanded: tiering is tuned on the gold set, never blind, and the policy
+records which eval calibrated it.
+
+**Why 0.95:** orc is verification-first, so the default leans toward quality (only
+auto-accept a cheap verdict when it is ~95% trustworthy). The achievability guard
+plus escalation-rate reporting prevent the over-escalation failure mode. `--target`
+lets cost-sensitive users dial it down.
+
+## Honesty / coverage ceiling
+
+The coverage-ceiling docs (README, `docs/compliance/eu-ai-act.md`,
+`docs/positioning/competitive.md`) gain a sentence: `orc eval` measures judge
+accuracy and retrieval recall **against the user's own labels** — it quantifies how
+well the gate matches the gold set, and cannot detect faithful-but-wrong corpus
+content (the third failure-mode class; no gold set can). A stale gold set produces
+confident-but-miscalibrated gating, so eval flags stale-corpus-version entries and
+`tiered_policy` records when it was last calibrated.
+
+## Testing
+
+All TDD, no network, against a deterministic fake LLM (the existing `tests/_fake_llm`
+pattern) and `FakeEmbedder` where retrieval is involved:
+
+- `metrics/` — hand-computed confusion matrices, scores, reliability bins, and a
+  known-ECE fixture. Benchmark tests must stay green after the extraction (proves no
+  behavioral drift).
+- gold CLI — import round-trip, label/promote pulls claim+corpus_version from a real
+  trace, stale-version flagging, `--json`.
+- `orc eval run` — judge accuracy on a scripted gold set (fake verdicts), calibration
+  bins, retrieval recall@k with `relevant_chunk_ids`, eval_run + per-claim run trace
+  tagging, frozen corpus_version pinning.
+- `tiered_verify` — Tier-1 accept above threshold, escalation below it, top-judge
+  model override reaches the Tier-2 call (assert the model string), trace records
+  both verdicts + escalation reason, uncalibrated fallback warns.
+- `orc eval calibrate` — threshold sweep finds the right cutoff on a scripted
+  reliability curve, the achievability guard fires when the target is unreachable,
+  escalation rate reported, `tiered_policy` persisted and read back by `tiered_verify`.
+
+## Build order (staged plan)
+
+1. **Metrics extraction** — `src/orc/metrics/` (confusion/scores/breakdown +
+   calibration); rewire `benchmarks/` to import it; benchmark tests stay green.
+2. **Gold store** — `gold_claim` table + schema-v2 migration + gold CLI
+   (import / label / list).
+3. **`orc eval run` / `show`** — judge accuracy + calibration + retrieval recall;
+   `eval_run` table + per-claim run tagging.
+4. **`tiered_verify`** — `modes/` extraction + the tiered strategy + trace records.
+5. **`orc eval calibrate`** — threshold sweep + achievability guard + `tiered_policy`
+   + wire `tiered_verify` to read it.
+6. **Docs** — coverage-ceiling sentence, CHANGELOG Unreleased entries, README
+   commands.
+
+Each stage is independently testable and shippable; stages 4–5 depend on 1–3.
+
+## Open questions
+
+None blocking. The `--target` default (0.95) is decided with the achievability guard
+above; the cross-family top judge is opt-in configuration with an Anthropic default.

From 727df3d515f49edb9ef97d0e7a01df14b2668788 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 14:17:04 -0400
Subject: [PATCH 02/15] docs(plan): orc eval + tiered verification
 implementation plan

Six staged, TDD task groups: metrics library extraction, gold store +
schema v2, orc eval run/show, tiered_verify meta-mode, calibrate loop
with achievability guard, and docs. Each stage independently testable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 ...2026-06-12-orc-eval-tiered-verification.md | 875 ++++++++++++++++++
 1 file changed, 875 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md

diff --git a/docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md b/docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md
new file mode 100644
index 0000000..457212c
--- /dev/null
+++ b/docs/superpowers/plans/2026-06-12-orc-eval-tiered-verification.md
@@ -0,0 +1,875 @@
+# orc eval + tiered verification — Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Make orc's verification gate measurable on a user-owned labeled gold set, then use that measurement to calibrate a cheap→expensive tiered verification router.
+
+**Architecture:** A new `src/orc/metrics/` library (extracted from the benchmark) computes confusion/scores/calibration. A per-workspace `gold_claim` table (schema v2) stores human-confirmed labels. `orc eval run` scores the gate against the gold set; `orc eval calibrate` derives a tiered escalation threshold; a `tiered_verify` meta-mode escalates cheap→expensive using that threshold. Eval runs are themselves traced and replayable.
+
+**Tech Stack:** Python 3.11+, click CLI, SQLite (per-workspace `orc.db`), the existing fake-LLM test harness (`tests/_fake_llm.py`), pytest.
+
+**Spec:** `docs/superpowers/specs/2026-06-12-orc-eval-tiered-verification-design.md`
+
+**Conventions (apply to every task):**
+- TDD: write the failing test, run it, confirm it fails for the right reason, implement minimally, confirm green, commit.
+- Run the full suite (`uv run pytest -q`) and `uv run ruff check src tests` before each commit; both must be clean.
+- Frozen dataclasses, keyword-only kwargs, docstrings explaining WHY. Commit subjects ≤50 chars, body explains why, end with `Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>`.
+- Branch: `feat/eval-tiered-verification` (already created).
+
+---
+
+## Stage 1 — Metrics library
+
+Extract the benchmark's private scoring into an importable package and add calibration. Verdicts here use the 4-label vocabulary directly (`supported`/`contradicted`/`not_found`/`partial`); correctness is exact match of predicted vs expected label.
+
+### Task 1.1: Confusion + scores
+
+**Files:**
+- Create: `src/orc/metrics/__init__.py`
+- Create: `src/orc/metrics/scoring.py`
+- Test: `tests/unit/test_metrics_scoring.py`
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_metrics_scoring.py
+from orc.metrics.scoring import LabeledResult, confusion, scores
+
+
+def test_confusion_counts_exact_label_matches() -> None:
+    results = [
+        LabeledResult(predicted="supported", expected="supported"),
+        LabeledResult(predicted="supported", expected="not_found"),
+        LabeledResult(predicted="not_found", expected="not_found"),
+        LabeledResult(predicted="not_found", expected="supported"),
+        LabeledResult(predicted=None, expected="supported"),  # errored, skipped
+    ]
+    cm = confusion(results, positive="supported")
+    assert cm == {"tp": 1, "fp": 1, "tn": 1, "fn": 1}
+
+
+def test_scores_precision_recall_f1_accuracy() -> None:
+    s = scores({"tp": 3, "fp": 1, "tn": 4, "fn": 2})
+    assert s["accuracy"] == 0.7
+    assert s["precision"] == 0.75
+    assert round(s["recall"], 4) == 0.6
+    assert round(s["f1"], 4) == 0.6667
+
+
+def test_scores_empty_is_zero() -> None:
+    assert scores({"tp": 0, "fp": 0, "tn": 0, "fn": 0})["f1"] == 0.0
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `uv run pytest tests/unit/test_metrics_scoring.py -q`
+Expected: FAIL — `ModuleNotFoundError: No module named 'orc.metrics'`
+
+- [ ] **Step 3: Write minimal implementation**
+
+```python
+# src/orc/metrics/__init__.py
+"""Scoring + calibration metrics shared by benchmarks and `orc eval`."""
+```
+
+```python
+# src/orc/metrics/scoring.py
+"""Confusion matrix and precision/recall/F1 over exact-label predictions.
+
+Positive class is caller-chosen (e.g. "supported"); everything else is the
+negative class. Predictions of None (the claim errored) are skipped, not
+counted as wrong — an eval distinguishes "judged incorrectly" from "could not
+judge"."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)
+class LabeledResult:
+    predicted: str | None
+    expected: str
+
+
+def confusion(results: list[LabeledResult], *, positive: str) -> dict[str, int]:
+    tp = fp = tn = fn = 0
+    for r in results:
+        if r.predicted is None:
+            continue
+        pred_pos = r.predicted == positive
+        exp_pos = r.expected == positive
+        if pred_pos and exp_pos:
+            tp += 1
+        elif pred_pos and not exp_pos:
+            fp += 1
+        elif not pred_pos and not exp_pos:
+            tn += 1
+        else:
+            fn += 1
+    return {"tp": tp, "fp": fp, "tn": tn, "fn": fn}
+
+
+def scores(cm: dict[str, int]) -> dict[str, float]:
+    tp, fp, tn, fn = cm["tp"], cm["fp"], cm["tn"], cm["fn"]
+    n = tp + fp + tn + fn
+    if n == 0:
+        return {"accuracy": 0.0, "precision": 0.0, "recall": 0.0, "f1": 0.0}
+    precision = tp / (tp + fp) if (tp + fp) else 0.0
+    recall = tp / (tp + fn) if (tp + fn) else 0.0
+    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0.0
+    return {"accuracy": (tp + tn) / n, "precision": precision, "recall": recall, "f1": f1}
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `uv run pytest tests/unit/test_metrics_scoring.py -q`
+Expected: PASS (3 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/orc/metrics tests/unit/test_metrics_scoring.py
+git commit -m "feat(metrics): confusion + scores library"
+```
+
+### Task 1.2: Calibration (reliability bins + ECE)
+
+**Files:**
+- Create: `src/orc/metrics/calibration.py`
+- Test: `tests/unit/test_metrics_calibration.py`
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_metrics_calibration.py
+from orc.metrics.calibration import ConfidenceResult, expected_calibration_error, reliability_bins
+
+
+def test_reliability_bins_group_by_confidence_decile() -> None:
+    # Two claims at ~0.95 (one right), two at ~0.55 (both right).
+    results = [
+        ConfidenceResult(confidence=0.95, correct=True),
+        ConfidenceResult(confidence=0.92, correct=False),
+        ConfidenceResult(confidence=0.55, correct=True),
+        ConfidenceResult(confidence=0.51, correct=True),
+    ]
+    bins = reliability_bins(results, n_bins=10)
+    top = next(b for b in bins if b.lo <= 0.95 < b.hi or b.hi == 1.0 and b.lo <= 0.95)
+    assert top.count == 2
+    assert top.accuracy == 0.5
+    assert round(top.mean_confidence, 3) == 0.935
+
+
+def test_ece_is_weighted_gap_between_confidence_and_accuracy() -> None:
+    # Perfectly calibrated: confidence == accuracy in every bin -> ECE 0.
+    perfect = [ConfidenceResult(confidence=1.0, correct=True) for _ in range(4)]
+    assert expected_calibration_error(reliability_bins(perfect, n_bins=10)) == 0.0
+    # Overconfident: conf 1.0 but half wrong -> ECE 0.5.
+    over = (
+        [ConfidenceResult(confidence=1.0, correct=True) for _ in range(2)]
+        + [ConfidenceResult(confidence=1.0, correct=False) for _ in range(2)]
+    )
+    assert expected_calibration_error(reliability_bins(over, n_bins=10)) == 0.5
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `uv run pytest tests/unit/test_metrics_calibration.py -q`
+Expected: FAIL — `ModuleNotFoundError: No module named 'orc.metrics.calibration'`
+
+- [ ] **Step 3: Write minimal implementation**
+
+```python
+# src/orc/metrics/calibration.py
+"""Confidence calibration: do the gate's confidence scores mean what they say?
+
+A well-calibrated judge that reports 0.9 confidence is right ~90% of the time.
+reliability_bins groups predictions by confidence and reports actual accuracy
+per bin; ECE is the count-weighted average gap between stated confidence and
+realized accuracy. This is the signal `orc eval calibrate` uses to choose a
+tier-1 escalation threshold."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)
+class ConfidenceResult:
+    confidence: float
+    correct: bool
+
+
+@dataclass(frozen=True)
+class Bin:
+    lo: float
+    hi: float
+    count: int
+    mean_confidence: float
+    accuracy: float
+
+
+def reliability_bins(results: list[ConfidenceResult], *, n_bins: int = 10) -> list[Bin]:
+    width = 1.0 / n_bins
+    out: list[Bin] = []
+    for i in range(n_bins):
+        lo = i * width
+        hi = 1.0 if i == n_bins - 1 else (i + 1) * width
+        # Top bin is closed on the right so confidence==1.0 lands somewhere.
+        members = [
+            r for r in results
+            if r.confidence >= lo and (r.confidence < hi or (hi == 1.0 and r.confidence <= hi))
+        ]
+        if not members:
+            continue
+        count = len(members)
+        out.append(
+            Bin(
+                lo=lo,
+                hi=hi,
+                count=count,
+                mean_confidence=sum(r.confidence for r in members) / count,
+                accuracy=sum(1 for r in members if r.correct) / count,
+            )
+        )
+    return out
+
+
+def expected_calibration_error(bins: list[Bin]) -> float:
+    total = sum(b.count for b in bins)
+    if total == 0:
+        return 0.0
+    return sum(b.count * abs(b.mean_confidence - b.accuracy) for b in bins) / total
+```
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `uv run pytest tests/unit/test_metrics_calibration.py -q`
+Expected: PASS (2 passed)
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/orc/metrics/calibration.py tests/unit/test_metrics_calibration.py
+git commit -m "feat(metrics): confidence calibration (reliability bins + ECE)"
+```
+
+### Task 1.3: Rewire the benchmark to the shared library
+
+**Files:**
+- Modify: `benchmarks/faithfulness/run.py` (replace private `_confusion`/`_scores` with imports from `orc.metrics.scoring`; adapt the PASS/FAIL binary attrs by constructing `LabeledResult(predicted=r.orc_binary, expected=r.ground_truth)` with `positive="PASS"`).
+
+- [ ] **Step 1: Confirm the benchmark tests exist and pass on main**
+
+Run: `uv run pytest tests/ -q -k benchmark` (if none, the benchmark has no unit tests — then this task only needs the import swap + a manual `python -c` smoke).
+Expected: baseline recorded.
+
+- [ ] **Step 2: Swap the implementation, keep `_confusion`/`_scores` as thin shims**
+
+```python
+# benchmarks/faithfulness/run.py — replace the bodies, keep the names so the
+# rest of the file is untouched:
+from orc.metrics.scoring import LabeledResult, confusion as _confusion_lib, scores as _scores
+
+def _confusion(results, binary_attr):
+    labeled = [
+        LabeledResult(predicted=getattr(r, binary_attr), expected=r.ground_truth)
+        for r in results
+    ]
+    return _confusion_lib(labeled, positive="PASS")
+```
+
+Keep `_scores` pointing at the library (it already returns accuracy/precision/recall/f1 — rename keys in the benchmark's own report assembly if it reads `precision_pass`; grep `precision_pass` in `run.py` and update those readers to `precision`).
+
+- [ ] **Step 3: Smoke the scoring path**
+
+Run: `uv run python -c "from benchmarks.faithfulness.run import _confusion, _scores; print(_scores(_confusion([], 'orc_binary')))"`
+Expected: `{'accuracy': 0.0, 'precision': 0.0, 'recall': 0.0, 'f1': 0.0}`
+
+- [ ] **Step 4: Full suite**
+
+Run: `uv run pytest -q && uv run ruff check src tests`
+Expected: all green.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add benchmarks/faithfulness/run.py
+git commit -m "refactor(benchmarks): use orc.metrics scoring library"
+```
+
+---
+
+## Stage 2 — Gold store (schema v2 + CLI)
+
+### Task 2.1: schema v2 + gold_claim/eval_run/tiered_policy tables + migration
+
+**Files:**
+- Modify: `src/orc/storage/schema.sql` (append three tables; bump header to v2)
+- Modify: `src/orc/storage/db.py` (`SCHEMA_VERSION = 2`; add `ensure_schema(conn)` that re-runs the idempotent `IF NOT EXISTS` script + re-stamps when stored version < 2)
+- Modify: `src/orc/storage/workspace.py` (`resolve()` calls `ensure_schema` so existing v1 workspaces gain the tables on open)
+- Test: `tests/unit/test_schema_migration.py`
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_schema_migration.py
+from orc.storage import db
+from orc.storage import workspace as ws_module
+
+
+def test_existing_v1_workspace_gains_gold_tables_on_resolve(orc_home, monkeypatch) -> None:
+    # Create at v1 by forcing the old version, then resolve under v2 code.
+    monkeypatch.setattr(db, "SCHEMA_VERSION", 1)
+    ws_module.create("legacy")
+    monkeypatch.setattr(db, "SCHEMA_VERSION", 2)
+    ws_module.resolve("legacy")  # must migrate
+    from orc.paths import workspace_db_path
+    with db.open_connection(workspace_db_path("legacy")) as conn:
+        names = {r["name"] for r in conn.execute(
+            "SELECT name FROM sqlite_master WHERE type='table'")}
+        assert {"gold_claim", "eval_run", "tiered_policy"} <= names
+        ver = conn.execute(
+            "SELECT value FROM schema_meta WHERE key='schema_version'").fetchone()["value"]
+        assert ver == "2"
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `uv run pytest tests/unit/test_schema_migration.py -q`
+Expected: FAIL — `gold_claim` not in table set.
+
+- [ ] **Step 3: Implement**
+
+Append to `src/orc/storage/schema.sql`:
+
+```sql
+CREATE TABLE IF NOT EXISTS gold_claim (
+    gold_id            TEXT PRIMARY KEY,
+    workspace          TEXT NOT NULL,
+    claim              TEXT NOT NULL,
+    expected_label     TEXT NOT NULL,
+    corpus_version     INTEGER NOT NULL,
+    relevant_chunk_ids TEXT,
+    source             TEXT NOT NULL,
+    source_run_id      TEXT,
+    note               TEXT,
+    added_at           TEXT NOT NULL,
+    added_by           TEXT
+);
+CREATE INDEX IF NOT EXISTS idx_gold_claim_workspace ON gold_claim(workspace);
+
+CREATE TABLE IF NOT EXISTS eval_run (
+    eval_id      TEXT PRIMARY KEY,
+    workspace    TEXT NOT NULL,
+    created_at   TEXT NOT NULL,
+    config_json  TEXT NOT NULL,
+    metrics_json TEXT NOT NULL
+);
+
+CREATE TABLE IF NOT EXISTS tiered_policy (
+    workspace               TEXT PRIMARY KEY,
+    tier1_model             TEXT NOT NULL,
+    tier2_model             TEXT NOT NULL,
+    top_judge_model         TEXT,
+    escalation_threshold    REAL NOT NULL,
+    target                  REAL NOT NULL,
+    calibrated_at           TEXT NOT NULL,
+    calibrated_against_eval_id TEXT,
+    n_gold                  INTEGER NOT NULL
+);
+```
+
+In `src/orc/storage/db.py`: set `SCHEMA_VERSION = 2`; add
+
+```python
+def ensure_schema(conn: sqlite3.Connection) -> None:
+    """Bring a connection's schema up to SCHEMA_VERSION. All tables use
+    CREATE TABLE IF NOT EXISTS, so re-running the script is the migration for
+    additive v1->v2 (gold_claim/eval_run/tiered_policy). Re-stamps the version."""
+    row = conn.execute("SELECT value FROM schema_meta WHERE key='schema_version'").fetchone()
+    stored = int(row["value"]) if row else 1
+    if stored >= SCHEMA_VERSION:
+        return
+    bootstrap_schema(conn)
+```
+
+In `src/orc/storage/workspace.py` `resolve()`: after opening the connection, call `db.ensure_schema(conn)` before returning the Workspace.
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `uv run pytest tests/unit/test_schema_migration.py -q && uv run pytest -q`
+Expected: PASS; full suite green.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/orc/storage tests/unit/test_schema_migration.py
+git commit -m "feat(storage): schema v2 — gold/eval/policy tables + migration"
+```
+
+### Task 2.2: Gold store module (insert / list)
+
+**Files:**
+- Create: `src/orc/eval/__init__.py`
+- Create: `src/orc/eval/gold.py`
+- Test: `tests/unit/test_gold_store.py`
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_gold_store.py
+from orc.eval import gold
+from orc.storage import workspace as ws_module
+
+
+def test_add_and_list_gold_claim(orc_home) -> None:
+    ws_module.create("demo")
+    gid = gold.add(
+        "demo", claim="The sky is blue", expected_label="supported",
+        corpus_version=0, source="import", note="seed",
+    )
+    [g] = gold.list_gold("demo")
+    assert g.gold_id == gid
+    assert g.claim == "The sky is blue"
+    assert g.expected_label == "supported"
+    assert g.relevant_chunk_ids is None
+    assert g.source == "import"
+
+
+def test_add_rejects_unknown_label(orc_home) -> None:
+    ws_module.create("demo")
+    import pytest
+    with pytest.raises(ValueError, match="expected_label"):
+        gold.add("demo", claim="x", expected_label="maybe", corpus_version=0, source="import")
+```
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `uv run pytest tests/unit/test_gold_store.py -q`
+Expected: FAIL — `No module named 'orc.eval'`
+
+- [ ] **Step 3: Implement**
+
+```python
+# src/orc/eval/gold.py
+"""Per-workspace gold-set store: human-confirmed (claim -> verdict) labels."""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+
+from orc.core.clock import now_iso
+from orc.core.ids import new_id
+from orc.paths import workspace_db_path
+from orc.storage.db import open_connection, transaction
+
+VALID_LABELS = frozenset({"supported", "contradicted", "not_found", "partial"})
+
+
+@dataclass(frozen=True)
+class GoldClaim:
+    gold_id: str
+    workspace: str
+    claim: str
+    expected_label: str
+    corpus_version: int
+    relevant_chunk_ids: list[str] | None
+    source: str
+    source_run_id: str | None
+    note: str | None
+    added_at: str
+    added_by: str | None
+
+
+def add(
+    workspace: str,
+    *,
+    claim: str,
+    expected_label: str,
+    corpus_version: int,
+    source: str,
+    relevant_chunk_ids: list[str] | None = None,
+    source_run_id: str | None = None,
+    note: str | None = None,
+    added_by: str | None = None,
+) -> str:
+    if expected_label not in VALID_LABELS:
+        raise ValueError(f"expected_label must be one of {sorted(VALID_LABELS)}")
+    gold_id = new_id()
+    with open_connection(workspace_db_path(workspace)) as conn, transaction(conn):
+        conn.execute(
+            "INSERT INTO gold_claim(gold_id, workspace, claim, expected_label, "
+            "corpus_version, relevant_chunk_ids, source, source_run_id, note, "
+            "added_at, added_by) VALUES (?,?,?,?,?,?,?,?,?,?,?)",
+            (
+                gold_id, workspace, claim, expected_label, corpus_version,
+                json.dumps(relevant_chunk_ids) if relevant_chunk_ids else None,
+                source, source_run_id, note, now_iso(), added_by,
+            ),
+        )
+    return gold_id
+
+
+def list_gold(workspace: str) -> list[GoldClaim]:
+    with open_connection(workspace_db_path(workspace)) as conn:
+        rows = conn.execute(
+            "SELECT * FROM gold_claim WHERE workspace=? ORDER BY added_at", (workspace,)
+        ).fetchall()
+    return [
+        GoldClaim(
+            gold_id=r["gold_id"], workspace=r["workspace"], claim=r["claim"],
+            expected_label=r["expected_label"], corpus_version=r["corpus_version"],
+            relevant_chunk_ids=json.loads(r["relevant_chunk_ids"]) if r["relevant_chunk_ids"] else None,
+            source=r["source"], source_run_id=r["source_run_id"], note=r["note"],
+            added_at=r["added_at"], added_by=r["added_by"],
+        )
+        for r in rows
+    ]
+```
+
+Note: confirm `orc.core.clock.now_iso` and `orc.core.ids.new_id` exist (grep; they are used across the codebase). If the clock helper has a different name, match it.
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `uv run pytest tests/unit/test_gold_store.py -q`
+Expected: PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/orc/eval tests/unit/test_gold_store.py
+git commit -m "feat(eval): gold-set store (add + list)"
+```
+
+### Task 2.3: Gold CLI — import / label / list
+
+**Files:**
+- Create: `src/orc/cli_commands/eval_cmd.py` (a click group `eval` with subcommands; named `eval_cmd` to avoid shadowing builtin)
+- Modify: `src/orc/cli.py` (register `eval_cmd.eval_group`)
+- Test: `tests/unit/test_eval_cli.py`
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_eval_cli.py
+import json
+from click.testing import CliRunner
+from orc.cli import main
+from orc.eval import gold
+from orc.storage import workspace as ws_module
+
+
+def test_eval_import_seeds_gold_from_yaml(orc_home, tmp_path) -> None:
+    ws_module.create("demo")
+    f = tmp_path / "claims.yaml"
+    f.write_text(
+        "- id: c1\n  text: The sky is blue\n  expected: supported\n"
+        "- id: c2\n  text: Pigs fly\n  expected: not_found\n"
+    )
+    res = CliRunner().invoke(main, ["eval", "import", str(f), "-w", "demo"])
+    assert res.exit_code == 0, res.output
+    labels = {g.expected_label for g in gold.list_gold("demo")}
+    assert labels == {"supported", "not_found"}
+
+
+def test_eval_label_promotes_a_real_verdict(orc_home, monkeypatch) -> None:
+    # Build one verify run via the fake-LLM idiom (reuse the helper pattern from
+    # tests/unit/test_verify_claim_modes.py), capture its run_id, then promote.
+    ...  # see test_verify_claim_modes.py for _run_skill + fake client setup
+```
+
+(For the second test, follow the exact fake-LLM run setup in `tests/unit/test_verify_claim_modes.py` — create a workspace, ingest a corpus, run `verify_claim` under a `FakeAnthropic`, take `result["_run_id"]`, then invoke `["eval", "label", run_id, "--verdict", "supported", "-w", "demo"]` and assert a promoted `GoldClaim` exists with `source="promoted"`, `source_run_id=run_id`, and `corpus_version` pulled from the trace.)
+
+- [ ] **Step 2: Run test to verify it fails**
+
+Run: `uv run pytest tests/unit/test_eval_cli.py -q`
+Expected: FAIL — `No such command 'eval'`.
+
+- [ ] **Step 3: Implement**
+
+```python
+# src/orc/cli_commands/eval_cmd.py
+"""`orc eval ...` — gold set, gate measurement, and tiered calibration."""
+
+from __future__ import annotations
+
+import json as json_lib
+from pathlib import Path
+
+import click
+import yaml
+
+from orc.errors import WorkspaceNotFoundError
+from orc.eval import gold
+from orc.storage import workspace as ws_module
+from orc.storage.trace_store import load_trace
+
+
+@click.group("eval")
+def eval_group() -> None:
+    """Measure and calibrate the verification gate against a gold set."""
+
+
+@eval_group.command("import")
+@click.argument("path", type=click.Path(exists=True, dir_okay=False, path_type=Path))
+@click.option("--workspace", "-w", default=None)
+def import_command(path: Path, workspace: str | None) -> None:
+    """Seed gold claims from a YAML file (id/text/expected[/relevant_chunk_ids/note])."""
+    try:
+        ws = ws_module.resolve(workspace)
+    except WorkspaceNotFoundError as exc:
+        raise click.ClickException(str(exc)) from exc
+    items = yaml.safe_load(path.read_text()) or []
+    n = 0
+    for item in items:
+        gold.add(
+            ws.name,
+            claim=item["text"],
+            expected_label=item["expected"],
+            corpus_version=ws.corpus_version,
+            relevant_chunk_ids=item.get("relevant_chunk_ids"),
+            source="import",
+            note=item.get("note"),
+        )
+        n += 1
+    click.echo(f"Imported {n} gold claim(s) into {ws.name}")
+
+
+@eval_group.command("label")
+@click.argument("run_id")
+@click.option("--verdict", required=True,
+              type=click.Choice(["supported", "contradicted", "not_found", "partial"]))
+@click.option("--relevant", "relevant", multiple=True, help="Relevant chunk id (repeatable)")
+@click.option("--workspace", "-w", default=None)
+@click.option("--note", default=None)
+def label_command(run_id, verdict, relevant, workspace, note) -> None:
+    """Promote/correct a real verdict into the gold set."""
+    trace = load_trace(run_id)
+    claim = (trace.get("inputs") or {}).get("claim") or trace.get("output", {}).get("claim")
+    if not claim:
+        raise click.ClickException(f"Run {run_id} has no claim to label")
+    gold.add(
+        trace["workspace"],
+        claim=claim,
+        expected_label=verdict,
+        corpus_version=trace["corpus_version"],
+        relevant_chunk_ids=list(relevant) or None,
+        source="promoted",
+        source_run_id=run_id,
+        note=note,
+    )
+    click.echo(f"Labelled run {run_id} as {verdict} in {trace['workspace']}")
+
+
+@eval_group.command("gold")
+@click.argument("action", type=click.Choice(["list"]))
+@click.option("--workspace", "-w", default=None)
+@click.option("--json", "as_json", is_flag=True)
+def gold_command(action, workspace, as_json) -> None:
+    """Inspect the gold set (currently: list)."""
+    try:
+        ws = ws_module.resolve(workspace)
+    except WorkspaceNotFoundError as exc:
+        raise click.ClickException(str(exc)) from exc
+    items = gold.list_gold(ws.name)
+    stale = [g for g in items if g.relevant_chunk_ids and g.corpus_version < ws.corpus_version]
+    if as_json:
+        click.echo(json_lib.dumps([
+            {"gold_id": g.gold_id, "claim": g.claim, "expected_label": g.expected_label,
+             "corpus_version": g.corpus_version, "source": g.source,
+             "stale_chunk_labels": g in stale}
+            for g in items], indent=2))
+        return
+    for g in items:
+        flag = "  [stale chunk labels]" if g in stale else ""
+        click.echo(f"{g.gold_id}  {g.expected_label:<12} {g.claim[:60]}{flag}")
+```
+
+Register in `src/orc/cli.py`: `from orc.cli_commands import eval_cmd` and `main.add_command(eval_cmd.eval_group)`. Confirm `yaml` (pyyaml) is already a dependency (it is — used by manifests).
+
+- [ ] **Step 4: Run test to verify it passes**
+
+Run: `uv run pytest tests/unit/test_eval_cli.py -q && uv run pytest -q`
+Expected: PASS; full suite green.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/orc/cli_commands/eval_cmd.py src/orc/cli.py tests/unit/test_eval_cli.py
+git commit -m "feat(cli): orc eval import/label/gold list"
+```
+
+---
+
+## Stage 3 — `orc eval run` / `show`
+
+### Task 3.1: Eval runner (judge accuracy + calibration + retrieval recall)
+
+**Files:**
+- Create: `src/orc/eval/runner.py` (`run_eval(workspace, *, mode=None, k=10) -> EvalReport`)
+- Test: `tests/unit/test_eval_runner.py`
+
+The runner, for each gold claim: opens a Run tagged `inputs={"_eval_id": eval_id, "gold_id": ...}`, calls `verify_claim.run(workspace, run, claim=g.claim, mode=mode, corpus_version=g.corpus_version)`, records `LabeledResult(predicted=verdict_label, expected=g.expected_label)` and `ConfidenceResult(confidence, correct)`. For gold with `relevant_chunk_ids`, compute recall@k = |retrieved ∩ relevant| / |relevant| using `result["retrieval_chunk_ids"]`. Aggregates via `orc.metrics`. Persists an `eval_run` row.
+
+- [ ] **Step 1: Write the failing test** — script a `FakeAnthropic` returning known verdicts for two gold claims (one correct, one wrong), assert `report.judge["f1"]`, `report.calibration.ece`, and `report.retrieval_recall` match hand-computed values, and that an `eval_run` row was written. Reuse the corpus-ingest + fake-client setup from `tests/unit/test_verify_claim_modes.py`.
+
+- [ ] **Step 2: Run — FAIL** (`No module named 'orc.eval.runner'`).
+
+- [ ] **Step 3: Implement** `run_eval` per the description above, returning a frozen `EvalReport(eval_id, n, judge: dict, per_mode: dict, per_domain: dict, calibration_ece: float, reliability: list[Bin], retrieval_recall: float | None, stale_entries: int)`. Persist `eval_run(eval_id, workspace, created_at, config_json, metrics_json)`. Each per-claim verify is a normal traced Run (so the eval is replayable claim-by-claim).
+
+- [ ] **Step 4: Run — PASS**, then full suite.
+
+- [ ] **Step 5: Commit** `feat(eval): run_eval — judge accuracy, calibration, recall`.
+
+### Task 3.2: `orc eval run` / `orc eval show` CLI
+
+**Files:**
+- Modify: `src/orc/cli_commands/eval_cmd.py` (`run` + `show` subcommands)
+- Modify: `src/orc/eval/runner.py` (add `load_eval(workspace, eval_id) -> EvalReport`)
+- Test: extend `tests/unit/test_eval_cli.py`
+
+- [ ] **Step 1** Failing CLI test: import a 2-claim gold set, run `["eval", "run", "-w", "demo", "--json"]` under a fake client, assert the JSON carries `judge.f1`, `calibration.ece`, `n`; then `["eval", "show", eval_id, "-w", "demo", "--json"]` round-trips the same metrics.
+- [ ] **Step 2** Run — FAIL (`No such command 'run'`).
+- [ ] **Step 3** Implement `run_command` (rich table by default: per-mode/domain scores, an ECE line, a reliability table, recall@k, a stale-entries warning; `--json` emits the metrics dict) and `show_command` (loads the persisted `eval_run`).
+- [ ] **Step 4** Run — PASS + full suite.
+- [ ] **Step 5** Commit `feat(cli): orc eval run/show`.
+
+---
+
+## Stage 4 — tiered_verify
+
+### Task 4.1: Extract existing meta-modes into `modes/`
+
+**Files:**
+- Create: `src/orc/directives/research/skills/modes/__init__.py`
+- Create: `src/orc/directives/research/skills/modes/decomposed.py` (move `_run_decomposed` + `_decompose_claim`)
+- Create: `src/orc/directives/research/skills/modes/arithmetic.py` (move `_run_arithmetic`)
+- Modify: `src/orc/directives/research/skills/verify_claim.py` (import the moved helpers; the dispatcher stays)
+- Test: existing `tests/unit/test_verify_claim_modes.py` must stay green unchanged.
+
+- [ ] **Step 1** Run the existing mode tests to record the green baseline: `uv run pytest tests/unit/test_verify_claim_modes.py -q`.
+- [ ] **Step 2** Move `_run_decomposed`/`_decompose_claim` to `modes/decomposed.py` and `_run_arithmetic` to `modes/arithmetic.py`, re-exporting them from `verify_claim` (import at top). Keep signatures identical (they already take `self`/explicit kwargs). No behavior change.
+- [ ] **Step 3** Run the same tests — still PASS unchanged (this is a pure refactor; no new test, the existing suite is the guard).
+- [ ] **Step 4** Full suite + ruff.
+- [ ] **Step 5** Commit `refactor(verify): extract meta-modes into modes/`.
+
+### Task 4.2: tiered_verify meta-mode
+
+**Files:**
+- Create: `src/orc/directives/research/skills/modes/tiered.py` (`run_tiered(...)`)
+- Modify: `src/orc/directives/research/skills/verify_claim.py` (dispatch `mode == "tiered"` to `run_tiered`)
+- Create: `src/orc/eval/policy.py` (`load_policy(workspace) -> TieredPolicy | None`, `save_policy(...)`)
+- Test: `tests/unit/test_tiered_verify.py`
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_tiered_verify.py — sketch
+# Fake client scripted so Tier-1 (binary) returns faithful=True confidence=0.99
+# -> accept without escalating (only ONE llm call recorded).
+def test_tier1_accept_above_threshold(...):
+    ...
+    assert result["label"] == "supported"
+    assert result["tier"] == 1
+    # exactly one verdict call (no escalation)
+
+# Tier-1 confidence 0.60 < threshold -> escalate; Tier-2 evidence verdict wins,
+# trace records both verdicts + escalation reason + the tier-2 model.
+def test_low_confidence_escalates_to_tier2(...):
+    ...
+    assert result["tier"] == 2
+    assert result["escalated"] is True
+    # tier-2 used the configured top_judge_model when set
+```
+
+- [ ] **Step 2** Run — FAIL (`mode "tiered"` unknown / no `run_tiered`).
+- [ ] **Step 3** Implement `run_tiered`: load `TieredPolicy` (or a documented default `escalation_threshold=0.9`, `tier1_model="claude-haiku-4-5"`, `tier2_model="claude-sonnet-4-6"`, warn once if no policy). Tier 1 = `verify_claim.run(..., mode="binary", model=policy.tier1_model)`. If `confidence >= threshold` return it tagged `tier=1, escalated=False`. Else Tier 2 = `verify_claim.run(..., mode="evidence", model=policy.top_judge_model or policy.tier2_model)`, return tagged `tier=2, escalated=True`, and `run.record("tiered", {...both verdicts, threshold, reason...})`. Add `"tiered"` to the valid-mode set; optionally let `route_to_mode` map a domain to it.
+- [ ] **Step 4** Run — PASS + full suite.
+- [ ] **Step 5** Commit `feat(verify): tiered_verify meta-mode`.
+
+---
+
+## Stage 5 — `orc eval calibrate`
+
+### Task 5.1: Threshold sweep + achievability guard
+
+**Files:**
+- Create: `src/orc/eval/calibrate.py` (`calibrate(workspace, *, target=0.95, tier1_model, tier2_model, top_judge=None) -> CalibrationResult`)
+- Test: `tests/unit/test_calibrate.py`
+
+`calibrate` runs the gold set through Tier 1 only (Haiku binary), collects `ConfidenceResult`s, then sweeps candidate thresholds (the distinct observed confidences) and, for each, computes the accuracy of *accepted* claims (confidence ≥ threshold). It returns the lowest threshold whose accepted-accuracy ≥ target, the escalation rate at that threshold, and `achievable: bool`. When no threshold reaches the target it returns `achievable=False` with the max accepted-accuracy and the threshold that achieved it.
+
+- [ ] **Step 1: Write the failing test**
+
+```python
+# tests/unit/test_calibrate.py — pure function over a scripted reliability curve
+from orc.eval.calibrate import sweep_threshold
+from orc.metrics.calibration import ConfidenceResult
+
+
+def test_sweep_finds_lowest_threshold_meeting_target() -> None:
+    results = [
+        ConfidenceResult(0.99, True), ConfidenceResult(0.98, True),
+        ConfidenceResult(0.80, True), ConfidenceResult(0.79, False),
+    ]
+    r = sweep_threshold(results, target=0.95)
+    assert r.achievable is True
+    assert r.threshold == 0.98          # accepting >=0.98 -> 2/2 correct
+    assert r.escalation_rate == 0.5     # 2 of 4 fall below
+
+
+def test_sweep_reports_unachievable_target() -> None:
+    results = [ConfidenceResult(0.99, False), ConfidenceResult(0.98, True)]
+    r = sweep_threshold(results, target=0.95)
+    assert r.achievable is False
+    assert round(r.max_accuracy, 3) == 1.0 or r.max_accuracy <= 1.0  # best accepted subset
+```
+
+- [ ] **Step 2** Run — FAIL (`No module named 'orc.eval.calibrate'`).
+- [ ] **Step 3** Implement `sweep_threshold` (pure, fully unit-testable) + `calibrate` (runs Tier 1 over the gold set via a fresh eval, then sweeps). Define `CalibrationResult(achievable, threshold, escalation_rate, accepted_accuracy, max_accuracy)`.
+- [ ] **Step 4** Run — PASS + full suite.
+- [ ] **Step 5** Commit `feat(eval): calibrate threshold sweep + guard`.
+
+### Task 5.2: `orc eval calibrate` CLI → tiered_policy
+
+**Files:**
+- Modify: `src/orc/cli_commands/eval_cmd.py` (`calibrate` subcommand)
+- Modify: `src/orc/eval/policy.py` (`save_policy`)
+- Test: extend `tests/unit/test_eval_cli.py` + `tests/unit/test_tiered_verify.py` (policy read-back)
+
+- [ ] **Step 1** Failing test: with a scripted gold set + fake client, `["eval", "calibrate", "-w", "demo", "--target", "0.95"]` writes a `tiered_policy` row whose `escalation_threshold` matches the sweep, prints the escalation rate, and on an unachievable target prints the guard message and a nonzero-but-graceful note. Then `run_tiered` reads that policy back (assert `load_policy("demo").escalation_threshold`).
+- [ ] **Step 2** Run — FAIL (`No such command 'calibrate'`).
+- [ ] **Step 3** Implement `calibrate_command`: run `calibrate`, on success `save_policy(...)` and echo the threshold + escalation rate; on `achievable=False` echo the guard message (*"Tier 1 cannot reach {target} at any cutoff (max {max_accuracy:.2f}); escalating all claims — lower --target or improve the gold set"*) and still save a policy with the best threshold so behavior is defined.
+- [ ] **Step 4** Run — PASS + full suite.
+- [ ] **Step 5** Commit `feat(cli): orc eval calibrate -> tiered_policy`.
+
+---
+
+## Stage 6 — Docs
+
+### Task 6.1: Coverage ceiling, CHANGELOG, README
+
+**Files:**
+- Modify: `README.md` (commands: `orc eval import|label|run|show|calibrate`, `verify --mode tiered`; one coverage-ceiling sentence about eval measuring against the user's own labels)
+- Modify: `docs/compliance/eu-ai-act.md`, `docs/positioning/competitive.md` (the same one-line caveat)
+- Modify: `CHANGELOG.md` (Unreleased: gold set, `orc eval`, tiered verification, calibration)
+
+- [ ] **Step 1** Add the README command lines + the coverage-ceiling sentence: *"`orc eval` measures judge accuracy and retrieval recall against your own labelled gold set — it quantifies how well the gate matches your labels and cannot detect faithful-but-wrong corpus content (no gold set can)."*
+- [ ] **Step 2** Mirror the one-line caveat into the two docs; add CHANGELOG Unreleased bullets.
+- [ ] **Step 3** Run `uv run pytest -q` (docs change nothing) + a manual `uv run orc eval --help` to confirm the command tree renders.
+- [ ] **Step 4** Commit `docs: document orc eval + tiered verification`.
+
+### Task 6.2: PR
+
+- [ ] Push `feat/eval-tiered-verification`; open a PR summarizing the gold set → eval → calibrate → tiered loop, with the suite/ruff results and the calibrate achievability-guard behavior. Do not merge.
+
+---
+
+## Self-review notes
+
+- **Spec coverage:** gold store (2.1–2.3) ✓; import+promote (2.3) ✓; judge accuracy + calibration + retrieval recall (3.1) ✓; corpus-version frozen retrieval recall (3.1, uses `g.corpus_version`) ✓; eval as traced/replayable runs (3.1) ✓; tiered_verify + configurable cross-family top judge (4.2) ✓; modes/ extraction (4.1) ✓; calibrate + achievability guard + escalation rate (5.1–5.2) ✓; tiered_policy in orc.db (2.1, 5.2) ✓; metrics extraction (1.1–1.3) ✓; coverage-ceiling honesty (6.1) ✓; schema-v2 migration (2.1) ✓.
+- **Type consistency:** `LabeledResult`/`ConfidenceResult`/`Bin`/`GoldClaim`/`TieredPolicy`/`CalibrationResult`/`EvalReport` named consistently across tasks; `verify_claim.run()` kwargs (`mode`, `model`, `corpus_version`) match the real signature confirmed in the spec.
+- **Stage independence:** 1 ships alone (library); 2 ships alone (gold store usable); 3 needs 1+2; 4 needs nothing new but is most useful after 3; 5 needs 3+4. Each stage is independently green.

From 370bfc5ef5abddaa8edfdb6bde06433424a0ef27 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 14:38:08 -0400
Subject: [PATCH 03/15] feat(metrics): confusion + scores library

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/metrics/__init__.py        |  1 +
 src/orc/metrics/scoring.py         | 45 ++++++++++++++++++++++++++++++
 tests/unit/test_metrics_scoring.py | 25 +++++++++++++++++
 3 files changed, 71 insertions(+)
 create mode 100644 src/orc/metrics/__init__.py
 create mode 100644 src/orc/metrics/scoring.py
 create mode 100644 tests/unit/test_metrics_scoring.py

diff --git a/src/orc/metrics/__init__.py b/src/orc/metrics/__init__.py
new file mode 100644
index 0000000..6343f35
--- /dev/null
+++ b/src/orc/metrics/__init__.py
@@ -0,0 +1 @@
+"""Scoring + calibration metrics shared by benchmarks and `orc eval`."""
diff --git a/src/orc/metrics/scoring.py b/src/orc/metrics/scoring.py
new file mode 100644
index 0000000..74aeaf5
--- /dev/null
+++ b/src/orc/metrics/scoring.py
@@ -0,0 +1,45 @@
+"""Confusion matrix and precision/recall/F1 over exact-label predictions.
+
+Positive class is caller-chosen (e.g. "supported"); everything else is the
+negative class. Predictions of None (the claim errored) are skipped, not
+counted as wrong — an eval distinguishes "judged incorrectly" from "could not
+judge"."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)
+class LabeledResult:
+    predicted: str | None
+    expected: str
+
+
+def confusion(results: list[LabeledResult], *, positive: str) -> dict[str, int]:
+    tp = fp = tn = fn = 0
+    for r in results:
+        if r.predicted is None:
+            continue
+        pred_pos = r.predicted == positive
+        exp_pos = r.expected == positive
+        if pred_pos and exp_pos:
+            tp += 1
+        elif pred_pos and not exp_pos:
+            fp += 1
+        elif not pred_pos and not exp_pos:
+            tn += 1
+        else:
+            fn += 1
+    return {"tp": tp, "fp": fp, "tn": tn, "fn": fn}
+
+
+def scores(cm: dict[str, int]) -> dict[str, float]:
+    tp, fp, tn, fn = cm["tp"], cm["fp"], cm["tn"], cm["fn"]
+    n = tp + fp + tn + fn
+    if n == 0:
+        return {"accuracy": 0.0, "precision": 0.0, "recall": 0.0, "f1": 0.0}
+    precision = tp / (tp + fp) if (tp + fp) else 0.0
+    recall = tp / (tp + fn) if (tp + fn) else 0.0
+    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0.0
+    return {"accuracy": (tp + tn) / n, "precision": precision, "recall": recall, "f1": f1}
diff --git a/tests/unit/test_metrics_scoring.py b/tests/unit/test_metrics_scoring.py
new file mode 100644
index 0000000..5b119a3
--- /dev/null
+++ b/tests/unit/test_metrics_scoring.py
@@ -0,0 +1,25 @@
+from orc.metrics.scoring import LabeledResult, confusion, scores
+
+
+def test_confusion_counts_exact_label_matches() -> None:
+    results = [
+        LabeledResult(predicted="supported", expected="supported"),
+        LabeledResult(predicted="supported", expected="not_found"),
+        LabeledResult(predicted="not_found", expected="not_found"),
+        LabeledResult(predicted="not_found", expected="supported"),
+        LabeledResult(predicted=None, expected="supported"),  # errored, skipped
+    ]
+    cm = confusion(results, positive="supported")
+    assert cm == {"tp": 1, "fp": 1, "tn": 1, "fn": 1}
+
+
+def test_scores_precision_recall_f1_accuracy() -> None:
+    s = scores({"tp": 3, "fp": 1, "tn": 4, "fn": 2})
+    assert s["accuracy"] == 0.7
+    assert s["precision"] == 0.75
+    assert round(s["recall"], 4) == 0.6
+    assert round(s["f1"], 4) == 0.6667
+
+
+def test_scores_empty_is_zero() -> None:
+    assert scores({"tp": 0, "fp": 0, "tn": 0, "fn": 0})["f1"] == 0.0

From 16620e140195d3818014104552c79c972797d1a4 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 14:38:42 -0400
Subject: [PATCH 04/15] feat(metrics): confidence calibration (reliability bins
 + ECE)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/metrics/calibration.py         | 59 ++++++++++++++++++++++++++
 tests/unit/test_metrics_calibration.py | 28 ++++++++++++
 2 files changed, 87 insertions(+)
 create mode 100644 src/orc/metrics/calibration.py
 create mode 100644 tests/unit/test_metrics_calibration.py

diff --git a/src/orc/metrics/calibration.py b/src/orc/metrics/calibration.py
new file mode 100644
index 0000000..35a5764
--- /dev/null
+++ b/src/orc/metrics/calibration.py
@@ -0,0 +1,59 @@
+"""Confidence calibration: do the gate's confidence scores mean what they say?
+
+A well-calibrated judge that reports 0.9 confidence is right ~90% of the time.
+reliability_bins groups predictions by confidence and reports actual accuracy
+per bin; ECE is the count-weighted average gap between stated confidence and
+realized accuracy. This is the signal `orc eval calibrate` uses to choose a
+tier-1 escalation threshold."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)
+class ConfidenceResult:
+    confidence: float
+    correct: bool
+
+
+@dataclass(frozen=True)
+class Bin:
+    lo: float
+    hi: float
+    count: int
+    mean_confidence: float
+    accuracy: float
+
+
+def reliability_bins(results: list[ConfidenceResult], *, n_bins: int = 10) -> list[Bin]:
+    width = 1.0 / n_bins
+    out: list[Bin] = []
+    for i in range(n_bins):
+        lo = i * width
+        hi = 1.0 if i == n_bins - 1 else (i + 1) * width
+        # Top bin is closed on the right so confidence==1.0 lands somewhere.
+        members = [
+            r for r in results
+            if r.confidence >= lo and (r.confidence < hi or (hi == 1.0 and r.confidence <= hi))
+        ]
+        if not members:
+            continue
+        count = len(members)
+        out.append(
+            Bin(
+                lo=lo,
+                hi=hi,
+                count=count,
+                mean_confidence=sum(r.confidence for r in members) / count,
+                accuracy=sum(1 for r in members if r.correct) / count,
+            )
+        )
+    return out
+
+
+def expected_calibration_error(bins: list[Bin]) -> float:
+    total = sum(b.count for b in bins)
+    if total == 0:
+        return 0.0
+    return sum(b.count * abs(b.mean_confidence - b.accuracy) for b in bins) / total
diff --git a/tests/unit/test_metrics_calibration.py b/tests/unit/test_metrics_calibration.py
new file mode 100644
index 0000000..2db7148
--- /dev/null
+++ b/tests/unit/test_metrics_calibration.py
@@ -0,0 +1,28 @@
+from orc.metrics.calibration import ConfidenceResult, expected_calibration_error, reliability_bins
+
+
+def test_reliability_bins_group_by_confidence_decile() -> None:
+    # Two claims at ~0.95 (one right), two at ~0.55 (both right).
+    results = [
+        ConfidenceResult(confidence=0.95, correct=True),
+        ConfidenceResult(confidence=0.92, correct=False),
+        ConfidenceResult(confidence=0.55, correct=True),
+        ConfidenceResult(confidence=0.51, correct=True),
+    ]
+    bins = reliability_bins(results, n_bins=10)
+    top = next(b for b in bins if b.lo <= 0.95 < b.hi or (b.hi == 1.0 and b.lo <= 0.95))
+    assert top.count == 2
+    assert top.accuracy == 0.5
+    assert round(top.mean_confidence, 3) == 0.935
+
+
+def test_ece_is_weighted_gap_between_confidence_and_accuracy() -> None:
+    # Perfectly calibrated: confidence == accuracy in every bin -> ECE 0.
+    perfect = [ConfidenceResult(confidence=1.0, correct=True) for _ in range(4)]
+    assert expected_calibration_error(reliability_bins(perfect, n_bins=10)) == 0.0
+    # Overconfident: conf 1.0 but half wrong -> ECE 0.5.
+    over = (
+        [ConfidenceResult(confidence=1.0, correct=True) for _ in range(2)]
+        + [ConfidenceResult(confidence=1.0, correct=False) for _ in range(2)]
+    )
+    assert expected_calibration_error(reliability_bins(over, n_bins=10)) == 0.5

From 9f451bfe22854b138a6a70ff8baca6c2a3951967 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 14:39:53 -0400
Subject: [PATCH 05/15] refactor(benchmarks): use orc.metrics scoring library

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 benchmarks/faithfulness/run.py | 47 +++++++++++++++-------------------
 1 file changed, 20 insertions(+), 27 deletions(-)

diff --git a/benchmarks/faithfulness/run.py b/benchmarks/faithfulness/run.py
index e945cc7..7d7c4d2 100644
--- a/benchmarks/faithfulness/run.py
+++ b/benchmarks/faithfulness/run.py
@@ -113,37 +113,27 @@ def _load_dataset(n: int, source_filter: str | None) -> list[dict[str, Any]]:
 
 
 def _confusion(results: list[ItemResult], binary_attr: str) -> dict[str, int]:
-    tp = fp = tn = fn = 0
-    for r in results:
-        if getattr(r, binary_attr) is None:
-            continue
-        pred = getattr(r, binary_attr)
-        gt = r.ground_truth
-        if pred == "PASS" and gt == "PASS":
-            tp += 1
-        elif pred == "PASS" and gt == "FAIL":
-            fp += 1
-        elif pred == "FAIL" and gt == "FAIL":
-            tn += 1
-        elif pred == "FAIL" and gt == "PASS":
-            fn += 1
-    return {"tp": tp, "fp": fp, "tn": tn, "fn": fn}
+    # Thin adapter over orc.metrics.scoring (PASS as the positive class) so the
+    # benchmark and `orc eval` share one implementation. Skips items whose
+    # binary attr is None (errored/unscored), same as before.
+    labeled = [
+        LabeledResult(predicted=getattr(r, binary_attr), expected=r.ground_truth)
+        for r in results
+    ]
+    return _confusion_lib(labeled, positive="PASS")
 
 
 def _scores(cm: dict[str, int]) -> dict[str, float]:
-    """Treat PASS as the positive class. Reviewers may re-score with FAIL-positive."""
-    tp, fp, tn, fn = cm["tp"], cm["fp"], cm["tn"], cm["fn"]
-    n = tp + fp + tn + fn
-    if n == 0:
-        return {"accuracy": 0.0, "precision_pass": 0.0, "recall_pass": 0.0, "f1_pass": 0.0}
-    precision = tp / (tp + fp) if (tp + fp) else 0.0
-    recall = tp / (tp + fn) if (tp + fn) else 0.0
-    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0.0
+    """Treat PASS as the positive class. Reviewers may re-score with FAIL-positive.
+
+    Adapts the shared library's generic keys to this report's `*_pass` keys so
+    the downstream report assembly is untouched."""
+    s = _scores_lib(cm)
     return {
-        "accuracy": (tp + tn) / n,
-        "precision_pass": precision,
-        "recall_pass": recall,
-        "f1_pass": f1,
+        "accuracy": s["accuracy"],
+        "precision_pass": s["precision"],
+        "recall_pass": s["recall"],
+        "f1_pass": s["f1"],
     }
 
 
@@ -187,6 +177,9 @@ def _run_lynx_style_one(item: dict[str, Any], orc_home: Path) -> ItemResult:
 from orc.directives.research.routing import (  # noqa: E402
     BENCHMARK_SOURCE_TO_MODE as SOURCE_TO_MODE,
 )
+from orc.metrics.scoring import LabeledResult  # noqa: E402
+from orc.metrics.scoring import confusion as _confusion_lib  # noqa: E402
+from orc.metrics.scoring import scores as _scores_lib  # noqa: E402
 
 
 def _run_with_mode(item: dict[str, Any], orc_home: Path, mode: str) -> ItemResult:

From 994f61bcf9124b85d500186740ed720ea7c4bfbf Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 15:08:38 -0400
Subject: [PATCH 06/15] =?UTF-8?q?feat(storage):=20schema=20v2=20=E2=80=94?=
 =?UTF-8?q?=20gold/eval/policy=20tables=20+=20migration?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds gold_claim, eval_run, and tiered_policy tables for the eval +
tiered-verification feature. ensure_schema() re-runs the idempotent
CREATE-IF-NOT-EXISTS script when a workspace's stored version lags,
so existing v1 workspaces gain the tables the first time newer orc
opens them. Replaces the prior unchecked-version handling with a real
forward migration. resolve() runs it on open.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/storage/db.py               | 18 ++++++++++++++-
 src/orc/storage/schema.sql          | 36 +++++++++++++++++++++++++++++
 src/orc/storage/workspace.py        | 12 +++++++++-
 tests/unit/test_schema_migration.py | 22 ++++++++++++++++++
 tests/unit/test_workspace.py        |  2 +-
 5 files changed, 87 insertions(+), 3 deletions(-)
 create mode 100644 tests/unit/test_schema_migration.py

diff --git a/src/orc/storage/db.py b/src/orc/storage/db.py
index d7bf05d..39f01ee 100644
--- a/src/orc/storage/db.py
+++ b/src/orc/storage/db.py
@@ -14,7 +14,7 @@
 from importlib.resources import files
 from pathlib import Path
 
-SCHEMA_VERSION = 1
+SCHEMA_VERSION = 2
 
 
 def schema_sql() -> str:
@@ -45,6 +45,22 @@ def bootstrap_schema(conn: sqlite3.Connection) -> None:
     )
 
 
+def ensure_schema(conn: sqlite3.Connection) -> None:
+    """Bring a connection's schema up to SCHEMA_VERSION.
+
+    Every table uses CREATE TABLE IF NOT EXISTS, so re-running the script is the
+    migration for additive bumps (v1 -> v2 added gold_claim/eval_run/
+    tiered_policy). Cheap to no-op when already current, so callers can invoke it
+    on every workspace open without a version probe of their own."""
+    row = conn.execute(
+        "SELECT value FROM schema_meta WHERE key='schema_version'"
+    ).fetchone()
+    stored = int(row["value"]) if row else 1
+    if stored >= SCHEMA_VERSION:
+        return
+    bootstrap_schema(conn)
+
+
 @contextmanager
 def transaction(conn: sqlite3.Connection) -> Iterator[None]:
     conn.execute("BEGIN IMMEDIATE")
diff --git a/src/orc/storage/schema.sql b/src/orc/storage/schema.sql
index e58d626..fbe35f5 100644
--- a/src/orc/storage/schema.sql
+++ b/src/orc/storage/schema.sql
@@ -134,3 +134,39 @@ CREATE TABLE IF NOT EXISTS approval_decision (
 );
 
 CREATE INDEX IF NOT EXISTS idx_approval_decision_approval ON approval_decision(approval_id);
+
+-- v2: gold set, eval runs, and tiered-verification calibration.
+CREATE TABLE IF NOT EXISTS gold_claim (
+    gold_id            TEXT PRIMARY KEY,
+    workspace          TEXT NOT NULL,
+    claim              TEXT NOT NULL,
+    expected_label     TEXT NOT NULL,
+    corpus_version     INTEGER NOT NULL,
+    relevant_chunk_ids TEXT,
+    source             TEXT NOT NULL,
+    source_run_id      TEXT,
+    note               TEXT,
+    added_at           TEXT NOT NULL,
+    added_by           TEXT
+);
+CREATE INDEX IF NOT EXISTS idx_gold_claim_workspace ON gold_claim(workspace);
+
+CREATE TABLE IF NOT EXISTS eval_run (
+    eval_id      TEXT PRIMARY KEY,
+    workspace    TEXT NOT NULL,
+    created_at   TEXT NOT NULL,
+    config_json  TEXT NOT NULL,
+    metrics_json TEXT NOT NULL
+);
+
+CREATE TABLE IF NOT EXISTS tiered_policy (
+    workspace                  TEXT PRIMARY KEY,
+    tier1_model                TEXT NOT NULL,
+    tier2_model                TEXT NOT NULL,
+    top_judge_model            TEXT,
+    escalation_threshold       REAL NOT NULL,
+    target                     REAL NOT NULL,
+    calibrated_at              TEXT NOT NULL,
+    calibrated_against_eval_id TEXT,
+    n_gold                     INTEGER NOT NULL
+);
diff --git a/src/orc/storage/workspace.py b/src/orc/storage/workspace.py
index 8db2744..d57696b 100644
--- a/src/orc/storage/workspace.py
+++ b/src/orc/storage/workspace.py
@@ -14,7 +14,13 @@
     workspace_traces_dir,
     workspaces_root,
 )
-from orc.storage.db import SCHEMA_VERSION, bootstrap_schema, open_connection, transaction
+from orc.storage.db import (
+    SCHEMA_VERSION,
+    bootstrap_schema,
+    ensure_schema,
+    open_connection,
+    transaction,
+)
 
 
 @dataclass(frozen=True)
@@ -78,6 +84,10 @@ def resolve(name: str | None) -> Workspace:
         raise WorkspaceNotFoundError(f"Workspace {resolved_name!r} not found")
 
     with open_connection(db_path) as conn:
+        # Additive migrations run on open so a workspace created by an older orc
+        # gains new tables (gold_claim/eval_run/tiered_policy) the first time a
+        # newer orc touches it. No-op once current.
+        ensure_schema(conn)
         row = conn.execute(
             "SELECT name, schema_version, created_at, embedding_model, corpus_version "
             "FROM workspace WHERE name = ?",
diff --git a/tests/unit/test_schema_migration.py b/tests/unit/test_schema_migration.py
new file mode 100644
index 0000000..586364d
--- /dev/null
+++ b/tests/unit/test_schema_migration.py
@@ -0,0 +1,22 @@
+from orc.paths import workspace_db_path
+from orc.storage import db
+from orc.storage import workspace as ws_module
+
+
+def test_existing_v1_workspace_gains_gold_tables_on_resolve(orc_home, monkeypatch) -> None:
+    # Create at v1 by forcing the old version, then resolve under v2 code.
+    monkeypatch.setattr(db, "SCHEMA_VERSION", 1)
+    ws_module.create("legacy")
+    monkeypatch.setattr(db, "SCHEMA_VERSION", 2)
+    ws_module.resolve("legacy")  # must migrate
+
+    with db.open_connection(workspace_db_path("legacy")) as conn:
+        names = {
+            r["name"]
+            for r in conn.execute("SELECT name FROM sqlite_master WHERE type='table'")
+        }
+        assert {"gold_claim", "eval_run", "tiered_policy"} <= names
+        ver = conn.execute(
+            "SELECT value FROM schema_meta WHERE key='schema_version'"
+        ).fetchone()["value"]
+        assert ver == "2"
diff --git a/tests/unit/test_workspace.py b/tests/unit/test_workspace.py
index 2d98f9f..d76274e 100644
--- a/tests/unit/test_workspace.py
+++ b/tests/unit/test_workspace.py
@@ -17,7 +17,7 @@
 def test_create_makes_dirs_and_db(orc_home: Path) -> None:
     ws = ws_module.create("demo")
     assert ws.name == "demo"
-    assert ws.schema_version == 1
+    assert ws.schema_version == 2
     assert ws.corpus_version == 0
     assert ws.embedding_model is None
     assert workspace_db_path("demo").exists()

From 73caa59ffa2df0071d737190d612e9e48b67367f Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 15:09:20 -0400
Subject: [PATCH 07/15] feat(eval): gold-set store (add + list)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/eval/__init__.py      |  1 +
 src/orc/eval/gold.py          | 96 +++++++++++++++++++++++++++++++++++
 tests/unit/test_gold_store.py | 45 ++++++++++++++++
 3 files changed, 142 insertions(+)
 create mode 100644 src/orc/eval/__init__.py
 create mode 100644 src/orc/eval/gold.py
 create mode 100644 tests/unit/test_gold_store.py

diff --git a/src/orc/eval/__init__.py b/src/orc/eval/__init__.py
new file mode 100644
index 0000000..bc4efa6
--- /dev/null
+++ b/src/orc/eval/__init__.py
@@ -0,0 +1 @@
+"""Gate measurement: gold set, eval runs, and tiered-verification calibration."""
diff --git a/src/orc/eval/gold.py b/src/orc/eval/gold.py
new file mode 100644
index 0000000..5055cdd
--- /dev/null
+++ b/src/orc/eval/gold.py
@@ -0,0 +1,96 @@
+"""Per-workspace gold-set store: human-confirmed (claim -> verdict) labels.
+
+A gold entry pins the `corpus_version` it was labeled against, because
+chunk-level relevance (`relevant_chunk_ids`) is only valid for that snapshot —
+chunk IDs change on re-ingest. Judge-accuracy labels (the verdict) survive
+re-ingest; retrieval-recall labels must be read frozen against this version."""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+
+from orc.core.clock import now_iso
+from orc.core.ids import new_id
+from orc.paths import workspace_db_path
+from orc.storage.db import open_connection, transaction
+
+VALID_LABELS = frozenset({"supported", "contradicted", "not_found", "partial"})
+
+
+@dataclass(frozen=True)
+class GoldClaim:
+    gold_id: str
+    workspace: str
+    claim: str
+    expected_label: str
+    corpus_version: int
+    relevant_chunk_ids: list[str] | None
+    source: str
+    source_run_id: str | None
+    note: str | None
+    added_at: str
+    added_by: str | None
+
+
+def add(
+    workspace: str,
+    *,
+    claim: str,
+    expected_label: str,
+    corpus_version: int,
+    source: str,
+    relevant_chunk_ids: list[str] | None = None,
+    source_run_id: str | None = None,
+    note: str | None = None,
+    added_by: str | None = None,
+) -> str:
+    if expected_label not in VALID_LABELS:
+        raise ValueError(f"expected_label must be one of {sorted(VALID_LABELS)}")
+    gold_id = new_id()
+    with open_connection(workspace_db_path(workspace)) as conn, transaction(conn):
+        conn.execute(
+            "INSERT INTO gold_claim(gold_id, workspace, claim, expected_label, "
+            "corpus_version, relevant_chunk_ids, source, source_run_id, note, "
+            "added_at, added_by) VALUES (?,?,?,?,?,?,?,?,?,?,?)",
+            (
+                gold_id,
+                workspace,
+                claim,
+                expected_label,
+                corpus_version,
+                json.dumps(relevant_chunk_ids) if relevant_chunk_ids else None,
+                source,
+                source_run_id,
+                note,
+                now_iso(),
+                added_by,
+            ),
+        )
+    return gold_id
+
+
+def list_gold(workspace: str) -> list[GoldClaim]:
+    with open_connection(workspace_db_path(workspace)) as conn:
+        rows = conn.execute(
+            "SELECT * FROM gold_claim WHERE workspace=? ORDER BY added_at, gold_id",
+            (workspace,),
+        ).fetchall()
+    return [
+        GoldClaim(
+            gold_id=r["gold_id"],
+            workspace=r["workspace"],
+            claim=r["claim"],
+            expected_label=r["expected_label"],
+            corpus_version=r["corpus_version"],
+            relevant_chunk_ids=json.loads(r["relevant_chunk_ids"])
+            if r["relevant_chunk_ids"]
+            else None,
+            source=r["source"],
+            source_run_id=r["source_run_id"],
+            note=r["note"],
+            added_at=r["added_at"],
+            added_by=r["added_by"],
+        )
+        for r in rows
+    ]
diff --git a/tests/unit/test_gold_store.py b/tests/unit/test_gold_store.py
new file mode 100644
index 0000000..c859a6c
--- /dev/null
+++ b/tests/unit/test_gold_store.py
@@ -0,0 +1,45 @@
+import pytest
+
+from orc.eval import gold
+from orc.storage import workspace as ws_module
+
+
+def test_add_and_list_gold_claim(orc_home) -> None:
+    ws_module.create("demo")
+    gid = gold.add(
+        "demo",
+        claim="The sky is blue",
+        expected_label="supported",
+        corpus_version=0,
+        source="import",
+        note="seed",
+    )
+    [g] = gold.list_gold("demo")
+    assert g.gold_id == gid
+    assert g.claim == "The sky is blue"
+    assert g.expected_label == "supported"
+    assert g.relevant_chunk_ids is None
+    assert g.source == "import"
+
+
+def test_add_preserves_relevant_chunk_ids(orc_home) -> None:
+    ws_module.create("demo")
+    gold.add(
+        "demo",
+        claim="x",
+        expected_label="supported",
+        corpus_version=3,
+        source="promoted",
+        relevant_chunk_ids=["01ABC", "01DEF"],
+        source_run_id="01RUN",
+    )
+    [g] = gold.list_gold("demo")
+    assert g.relevant_chunk_ids == ["01ABC", "01DEF"]
+    assert g.corpus_version == 3
+    assert g.source_run_id == "01RUN"
+
+
+def test_add_rejects_unknown_label(orc_home) -> None:
+    ws_module.create("demo")
+    with pytest.raises(ValueError, match="expected_label"):
+        gold.add("demo", claim="x", expected_label="maybe", corpus_version=0, source="import")

From bd9a89b0fb99970e0ded4acd8fe5f5d1b1933faa Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 15:11:31 -0400
Subject: [PATCH 08/15] feat(cli): orc eval import/label/gold list

The gold set's producer surface. `orc eval import` seeds from a YAML
file (the existing claims fixture format); `orc eval label <run_id>`
promotes or corrects a real verdict into gold, pulling the claim and
corpus_version straight from the trace so the label is grounded in
exactly what orc verified; `orc eval gold list` shows entries and
flags stale chunk-level labels (corpus_version behind the workspace).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/cli.py                   |   2 +
 src/orc/cli_commands/eval_cmd.py | 126 +++++++++++++++++++++++++++++++
 tests/unit/test_eval_cli.py      |  75 ++++++++++++++++++
 3 files changed, 203 insertions(+)
 create mode 100644 src/orc/cli_commands/eval_cmd.py
 create mode 100644 tests/unit/test_eval_cli.py

diff --git a/src/orc/cli.py b/src/orc/cli.py
index 39ce9ad..5f23692 100644
--- a/src/orc/cli.py
+++ b/src/orc/cli.py
@@ -3,6 +3,7 @@
 from orc import __version__
 from orc.cli_commands import approve as approve_cmd
 from orc.cli_commands import audit as audit_cmd
+from orc.cli_commands import eval_cmd
 from orc.cli_commands import execute as execute_cmd
 from orc.cli_commands import ingest as ingest_cmd
 from orc.cli_commands import mcp as mcp_cmd
@@ -36,6 +37,7 @@ def main() -> None:
 main.add_command(execute_cmd.execute_command)
 main.add_command(worker_cmd.worker_command)
 main.add_command(audit_cmd.audit_group)
+main.add_command(eval_cmd.eval_group)
 main.add_command(mcp_cmd.mcp)
 
 
diff --git a/src/orc/cli_commands/eval_cmd.py b/src/orc/cli_commands/eval_cmd.py
new file mode 100644
index 0000000..f98a759
--- /dev/null
+++ b/src/orc/cli_commands/eval_cmd.py
@@ -0,0 +1,126 @@
+"""`orc eval ...` — gold set, gate measurement, and tiered calibration."""
+
+from __future__ import annotations
+
+import json as json_lib
+from pathlib import Path
+
+import click
+import yaml
+
+from orc.errors import WorkspaceNotFoundError
+from orc.eval import gold
+from orc.storage import workspace as ws_module
+from orc.storage.trace_store import load_trace
+
+_LABELS = ["supported", "contradicted", "not_found", "partial"]
+
+
+@click.group("eval")
+def eval_group() -> None:
+    """Measure and calibrate the verification gate against a gold set."""
+
+
+@eval_group.command("import")
+@click.argument("path", type=click.Path(exists=True, dir_okay=False, path_type=Path))
+@click.option("--workspace", "-w", default=None, help="Workspace name (env: ORC_DEFAULT_WORKSPACE)")
+def import_command(path: Path, workspace: str | None) -> None:
+    """Seed gold claims from a YAML file (id/text/expected[/relevant_chunk_ids/note])."""
+    ws = _resolve(workspace)
+    items = yaml.safe_load(path.read_text()) or []
+    n = 0
+    for item in items:
+        gold.add(
+            ws.name,
+            claim=item["text"],
+            expected_label=item["expected"],
+            corpus_version=ws.corpus_version,
+            relevant_chunk_ids=item.get("relevant_chunk_ids"),
+            source="import",
+            note=item.get("note"),
+        )
+        n += 1
+    click.echo(f"Imported {n} gold claim(s) into {ws.name}")
+
+
+@eval_group.command("label")
+@click.argument("run_id")
+@click.option("--verdict", required=True, type=click.Choice(_LABELS))
+@click.option("--relevant", "relevant", multiple=True, help="Relevant chunk id (repeatable)")
+@click.option("--workspace", "-w", default=None)
+@click.option("--note", default=None)
+def label_command(
+    run_id: str,
+    verdict: str,
+    relevant: tuple[str, ...],
+    workspace: str | None,
+    note: str | None,
+) -> None:
+    """Promote/correct a real verdict into the gold set.
+
+    Pulls the claim and corpus_version straight from the run's trace, so a
+    promoted label is grounded in exactly what orc verified."""
+    try:
+        trace = load_trace(run_id)
+    except Exception as exc:  # TraceNotFoundError and friends
+        raise click.ClickException(f"Run {run_id} not found: {exc}") from exc
+    claim = (trace.get("inputs") or {}).get("claim") or (trace.get("output") or {}).get("claim")
+    if not claim:
+        raise click.ClickException(f"Run {run_id} has no claim to label")
+    gold.add(
+        trace["workspace"],
+        claim=claim,
+        expected_label=verdict,
+        corpus_version=trace["corpus_version"],
+        relevant_chunk_ids=list(relevant) or None,
+        source="promoted",
+        source_run_id=run_id,
+        note=note,
+    )
+    click.echo(f"Labelled run {run_id} as {verdict} in {trace['workspace']}")
+
+
+@eval_group.command("gold")
+@click.argument("action", type=click.Choice(["list"]))
+@click.option("--workspace", "-w", default=None)
+@click.option("--json", "as_json", is_flag=True)
+def gold_command(action: str, workspace: str | None, as_json: bool) -> None:
+    """Inspect the gold set (currently: list)."""
+    ws = _resolve(workspace)
+    items = gold.list_gold(ws.name)
+    stale = {
+        g.gold_id
+        for g in items
+        if g.relevant_chunk_ids and g.corpus_version < ws.corpus_version
+    }
+    if as_json:
+        click.echo(
+            json_lib.dumps(
+                [
+                    {
+                        "gold_id": g.gold_id,
+                        "claim": g.claim,
+                        "expected_label": g.expected_label,
+                        "corpus_version": g.corpus_version,
+                        "source": g.source,
+                        "stale_chunk_labels": g.gold_id in stale,
+                    }
+                    for g in items
+                ],
+                indent=2,
+            )
+        )
+        return
+    if not items:
+        click.echo(f"No gold claims in {ws.name}")
+        return
+    for g in items:
+        flag = "  [stale chunk labels]" if g.gold_id in stale else ""
+        click.echo(f"{g.gold_id}  {g.expected_label:<12} {g.claim[:60]}{flag}")
+
+
+def _resolve(workspace: str | None) -> ws_module.Workspace:
+    try:
+        return ws_module.resolve(workspace)
+    except WorkspaceNotFoundError as exc:
+        raise click.ClickException(str(exc)) from exc
diff --git a/tests/unit/test_eval_cli.py b/tests/unit/test_eval_cli.py
new file mode 100644
index 0000000..90efdbd
--- /dev/null
+++ b/tests/unit/test_eval_cli.py
@@ -0,0 +1,75 @@
+"""`orc eval` gold-set CLI: import from YAML, promote a real verdict, list."""
+
+from __future__ import annotations
+
+from click.testing import CliRunner
+
+from orc import directives
+from orc.cli import main
+from orc.eval import gold
+from orc.ingest.pipeline import ingest as do_ingest
+from orc.llm import client as client_module
+from orc.runs import open_run
+from orc.storage import workspace as ws_module
+from tests._fake_llm import FakeAnthropic, make_verdict_response
+
+
+def test_eval_import_seeds_gold_from_yaml(orc_home, tmp_path) -> None:
+    ws_module.create("demo")
+    f = tmp_path / "claims.yaml"
+    f.write_text(
+        "- id: c1\n  text: The sky is blue\n  expected: supported\n"
+        "- id: c2\n  text: Pigs fly\n  expected: not_found\n"
+    )
+    res = CliRunner().invoke(main, ["eval", "import", str(f), "-w", "demo"])
+    assert res.exit_code == 0, res.output
+    labels = {g.expected_label for g in gold.list_gold("demo")}
+    assert labels == {"supported", "not_found"}
+
+
+def test_eval_label_promotes_a_real_verdict(orc_home, tmp_path, monkeypatch) -> None:
+    # Build one real verify run, then promote its verdict into the gold set.
+    ws = ws_module.create("demo")
+    corpus = tmp_path / "corpus"
+    corpus.mkdir()
+    (corpus / "doc.md").write_text("# Doc\n\nThe sky is blue on a clear day.\n")
+    do_ingest(ws, str(corpus))
+
+    fake = FakeAnthropic(responses=[make_verdict_response(label="supported", confidence=0.9)])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    skill = directives.get("research").skills["verify_claim"]
+    with open_run(ws, directive="research", skill="verify_claim", inputs={}) as run:
+        result = skill.run(workspace=ws, run=run, claim="The sky is blue")
+        run.close(output=result)
+    run_id = run.run_id
+
+    res = CliRunner().invoke(main, ["eval", "label", run_id, "--verdict", "contradicted", "-w", "demo"])
+    assert res.exit_code == 0, res.output
+
+    [g] = gold.list_gold("demo")
+    assert g.expected_label == "contradicted"  # human corrected the model
+    assert g.claim == "The sky is blue"
+    assert g.source == "promoted"
+    assert g.source_run_id == run_id
+    assert g.corpus_version == ws.corpus_version
+
+
+def test_eval_label_unknown_run_fails_cleanly(orc_home) -> None:
+    ws_module.create("demo")
+    res = CliRunner().invoke(main, ["eval", "label", "01NOSUCHRUN", "--verdict", "supported", "-w", "demo"])
+    assert res.exit_code != 0
+    assert "01NOSUCHRUN" in res.output
+
+
+def test_eval_gold_list_json(orc_home) -> None:
+    import json
+
+    ws_module.create("demo")
+    gold.add("demo", claim="x", expected_label="supported", corpus_version=0, source="import")
+    res = CliRunner().invoke(main, ["eval", "gold", "list", "-w", "demo", "--json"])
+    assert res.exit_code == 0, res.output
+    [item] = json.loads(res.output)
+    assert item["expected_label"] == "supported"
+    assert item["stale_chunk_labels"] is False

From d6f77b74ce25c21dc82f0e633783ded95696b449 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 15:40:24 -0400
Subject: [PATCH 09/15] =?UTF-8?q?feat(eval):=20run=5Feval=20=E2=80=94=20ju?=
 =?UTF-8?q?dge=20accuracy,=20calibration,=20recall?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Verifies every gold claim frozen against its labeled corpus_version,
inside a per-claim traced Run tagged with the eval id, so an eval is
inspectable claim-by-claim and replayable. Aggregates exact-match
judge accuracy, supported-class precision/recall/F1, confidence
calibration (reliability bins + ECE), and retrieval recall@k where
chunk-level labels exist. Persists to eval_run; load_eval reloads it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/eval/runner.py         | 164 +++++++++++++++++++++++++++++++++
 tests/unit/test_eval_runner.py |  70 ++++++++++++++
 2 files changed, 234 insertions(+)
 create mode 100644 src/orc/eval/runner.py
 create mode 100644 tests/unit/test_eval_runner.py

diff --git a/src/orc/eval/runner.py b/src/orc/eval/runner.py
new file mode 100644
index 0000000..402f7bb
--- /dev/null
+++ b/src/orc/eval/runner.py
@@ -0,0 +1,164 @@
+"""Run the verification gate against a workspace's gold set and score it.
+
+Each gold claim is verified frozen against the `corpus_version` it was labeled
+on (so retrieval-recall labels stay valid), inside its own traced Run tagged
+with the eval id — an eval is therefore inspectable claim-by-claim and
+replayable like any other orc run. The aggregate (judge accuracy, confidence
+calibration, retrieval recall) is persisted to the eval_run table."""
+
+from __future__ import annotations
+
+import json
+from dataclasses import asdict, dataclass
+
+from orc import directives
+from orc.core.clock import now_iso
+from orc.core.ids import new_id
+from orc.eval import gold as gold_store
+from orc.metrics.calibration import (
+    Bin,
+    ConfidenceResult,
+    expected_calibration_error,
+    reliability_bins,
+)
+from orc.metrics.scoring import LabeledResult, confusion, scores
+from orc.paths import workspace_db_path
+from orc.runs import open_run
+from orc.storage import workspace as ws_module
+from orc.storage.db import open_connection, transaction
+
+
+@dataclass(frozen=True)
+class EvalReport:
+    eval_id: str
+    workspace: str
+    mode: str
+    n: int
+    accuracy: float
+    supported_precision: float
+    supported_recall: float
+    supported_f1: float
+    calibration_ece: float
+    reliability: list[Bin]
+    retrieval_recall: float | None
+    n_retrieval_labeled: int
+    stale_entries: int
+
+
+def recall(*, retrieved: list[str], relevant: list[str]) -> float | None:
+    """Recall@k of relevant chunks among retrieved. None when there is nothing
+    to recall (no relevant chunks labeled)."""
+    if not relevant:
+        return None
+    hit = len(set(retrieved) & set(relevant))
+    return hit / len(relevant)
+
+
+def run_eval(workspace: str, *, mode: str = "evidence", k: int = 10) -> EvalReport:
+    ws = ws_module.resolve(workspace)
+    items = gold_store.list_gold(ws.name)
+    if not items:
+        raise ValueError(f"workspace {ws.name!r} has no gold claims to evaluate")
+
+    eval_id = new_id()
+    skill = directives.get("research").skills["verify_claim"]
+
+    labeled: list[LabeledResult] = []
+    confidences: list[ConfidenceResult] = []
+    recalls: list[float] = []
+    stale = 0
+
+    for g in items:
+        with open_run(
+            ws,
+            directive="research",
+            skill="verify_claim",
+            inputs={"_eval_id": eval_id, "gold_id": g.gold_id, "claim": g.claim},
+        ) as run:
+            result = skill.run(
+                workspace=ws, run=run, claim=g.claim, mode=mode,
+                corpus_version=g.corpus_version,
+            )
+            run.close(output=result)
+
+        predicted = result["label"]
+        correct = predicted == g.expected_label
+        labeled.append(LabeledResult(predicted=predicted, expected=g.expected_label))
+        confidences.append(ConfidenceResult(confidence=float(result["confidence"]), correct=correct))
+
+        if g.relevant_chunk_ids:
+            r = recall(retrieved=result.get("retrieval_chunk_ids", []), relevant=g.relevant_chunk_ids)
+            if r is not None:
+                recalls.append(r)
+            if g.corpus_version < ws.corpus_version:
+                stale += 1
+
+    cm = confusion(labeled, positive="supported")
+    sc = scores(cm)
+    n_correct = sum(1 for r in confidences if r.correct)
+    bins = reliability_bins(confidences, n_bins=10)
+
+    report = EvalReport(
+        eval_id=eval_id,
+        workspace=ws.name,
+        mode=mode,
+        n=len(items),
+        accuracy=n_correct / len(items),
+        supported_precision=sc["precision"],
+        supported_recall=sc["recall"],
+        supported_f1=sc["f1"],
+        calibration_ece=expected_calibration_error(bins),
+        reliability=bins,
+        retrieval_recall=(sum(recalls) / len(recalls)) if recalls else None,
+        n_retrieval_labeled=len(recalls),
+        stale_entries=stale,
+    )
+    _persist(ws.name, report, mode=mode, k=k)
+    return report
+
+
+def _persist(workspace: str, report: EvalReport, *, mode: str, k: int) -> None:
+    with open_connection(workspace_db_path(workspace)) as conn, transaction(conn):
+        conn.execute(
+            "INSERT INTO eval_run(eval_id, workspace, created_at, config_json, metrics_json) "
+            "VALUES (?,?,?,?,?)",
+            (
+                report.eval_id,
+                workspace,
+                now_iso(),
+                json.dumps({"mode": mode, "k": k}),
+                json.dumps(_metrics_dict(report)),
+            ),
+        )
+
+
+def _metrics_dict(report: EvalReport) -> dict:
+    d = asdict(report)
+    d["reliability"] = [asdict(b) for b in report.reliability]
+    return d
+
+
+def load_eval(workspace: str, eval_id: str) -> EvalReport:
+    with open_connection(workspace_db_path(workspace)) as conn:
+        row = conn.execute(
+            "SELECT metrics_json FROM eval_run WHERE workspace=? AND eval_id=?",
+            (workspace, eval_id),
+        ).fetchone()
+    if row is None:
+        raise KeyError(f"no eval_run {eval_id!r} in {workspace!r}")
+    m = json.loads(row["metrics_json"])
+    return EvalReport(
+        eval_id=m["eval_id"],
+        workspace=m["workspace"],
+        mode=m["mode"],
+        n=m["n"],
+        accuracy=m["accuracy"],
+        supported_precision=m["supported_precision"],
+        supported_recall=m["supported_recall"],
+        supported_f1=m["supported_f1"],
+        calibration_ece=m["calibration_ece"],
+        reliability=[Bin(**b) for b in m["reliability"]],
+        retrieval_recall=m["retrieval_recall"],
+        n_retrieval_labeled=m["n_retrieval_labeled"],
+        stale_entries=m["stale_entries"],
+    )
diff --git a/tests/unit/test_eval_runner.py b/tests/unit/test_eval_runner.py
new file mode 100644
index 0000000..90bf3f0
--- /dev/null
+++ b/tests/unit/test_eval_runner.py
@@ -0,0 +1,70 @@
+"""The eval runner scores the gate against a workspace's gold set."""
+
+from __future__ import annotations
+
+import pytest
+from click.testing import CliRunner  # noqa: F401  (kept for parity; unused here)
+
+from orc.eval import gold
+from orc.eval.runner import recall, run_eval
+from orc.ingest.pipeline import ingest as do_ingest
+from orc.llm import client as client_module
+from orc.storage import workspace as ws_module
+from tests._fake_llm import FakeAnthropic, make_verdict_response
+
+
+def test_recall_at_k_is_intersection_over_relevant() -> None:
+    assert recall(retrieved=["a", "b", "c"], relevant=["b", "d"]) == 0.5
+    assert recall(retrieved=["a", "b"], relevant=["a", "b"]) == 1.0
+    assert recall(retrieved=[], relevant=["a"]) == 0.0
+    assert recall(retrieved=["a"], relevant=[]) is None  # nothing to recall
+
+
+def _setup(orc_home, tmp_path) -> tuple[str, str]:
+    from orc.paths import workspace_db_path
+    from orc.storage.db import open_connection
+
+    ws = ws_module.create("demo")
+    corpus = tmp_path / "corpus"
+    corpus.mkdir()
+    (corpus / "doc.md").write_text("# Doc\n\nThe sky is blue on a clear day.\n")
+    do_ingest(ws, str(corpus))
+    with open_connection(workspace_db_path("demo")) as conn:
+        chunk_id = conn.execute("SELECT chunk_id FROM chunk ORDER BY seq LIMIT 1").fetchone()["chunk_id"]
+    # Ingest bumps corpus_version; gold must pin the version where chunks exist.
+    cv = ws_module.resolve("demo").corpus_version
+    return ws.name, chunk_id, cv
+
+
+def test_run_eval_scores_judge_accuracy_and_persists(orc_home, tmp_path, monkeypatch) -> None:
+    name, chunk_id, cv = _setup(orc_home, tmp_path)
+    # Multi-word claims so BM25 retrieves the chunk (single-char claims drop out).
+    gold.add(name, claim="The sky is blue", expected_label="supported", corpus_version=cv, source="import")
+    gold.add(name, claim="The sky is green", expected_label="contradicted", corpus_version=cv, source="import")
+
+    # Model says supported (citing a real chunk so the guard keeps it) to both:
+    # claim 1 correct, claim 2 wrong -> accuracy 0.5.
+    fake = FakeAnthropic(responses=[
+        make_verdict_response(label="supported", confidence=0.9, supporting_chunk_ids=[chunk_id]),
+        make_verdict_response(label="supported", confidence=0.6, supporting_chunk_ids=[chunk_id]),
+    ])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    report = run_eval(name, mode="evidence")
+    assert report.n == 2
+    assert report.accuracy == 0.5
+    # one correct at 0.9, one wrong at 0.6 -> ECE = (|0.9-1| + |0.6-0|)/2 = 0.35
+    assert round(report.calibration_ece, 4) == 0.35
+
+    # The eval is itself persisted and reloadable.
+    from orc.eval.runner import load_eval
+    again = load_eval(name, report.eval_id)
+    assert again.accuracy == report.accuracy
+    assert again.n == 2
+
+
+def test_run_eval_empty_gold_raises(orc_home) -> None:
+    ws_module.create("demo")
+    with pytest.raises(ValueError, match="no gold"):
+        run_eval("demo", mode="evidence")

From 2ddb16901c68f358679bc0e3c68f7367cd910cc7 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 15:41:32 -0400
Subject: [PATCH 10/15] feat(cli): orc eval run/show

orc eval run scores the gate against the gold set and prints judge
accuracy, supported-class P/R/F1, confidence ECE, retrieval recall,
and a stale-label warning (--json for the full metrics dict). orc eval
show reprints a persisted eval report by id.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/cli_commands/eval_cmd.py | 63 ++++++++++++++++++++++++++++++++
 tests/unit/test_eval_cli.py      | 40 ++++++++++++++++++++
 2 files changed, 103 insertions(+)

diff --git a/src/orc/cli_commands/eval_cmd.py b/src/orc/cli_commands/eval_cmd.py
index f98a759..79d62f9 100644
--- a/src/orc/cli_commands/eval_cmd.py
+++ b/src/orc/cli_commands/eval_cmd.py
@@ -80,6 +80,69 @@ def label_command(
     click.echo(f"Labelled run {run_id} as {verdict} in {trace['workspace']}")
 
 
+@eval_group.command("run")
+@click.option("--workspace", "-w", default=None)
+@click.option("--mode", default="evidence", help="Verify mode to evaluate")
+@click.option("--k", type=int, default=10, help="Retrieval depth for recall@k")
+@click.option("--json", "as_json", is_flag=True)
+def run_command(workspace: str | None, mode: str, k: int, as_json: bool) -> None:
+    """Score the gate against the workspace's gold set."""
+    from orc.eval.runner import run_eval
+
+    ws = _resolve(workspace)
+    try:
+        report = run_eval(ws.name, mode=mode, k=k)
+    except ValueError as exc:
+        raise click.ClickException(str(exc)) from exc
+    if as_json:
+        click.echo(_report_json(report))
+        return
+    click.echo(f"eval {report.eval_id}  mode={report.mode}  n={report.n}")
+    click.echo(f"  judge accuracy : {report.accuracy:.3f}")
+    click.echo(
+        f"  supported P/R/F1: {report.supported_precision:.3f} / "
+        f"{report.supported_recall:.3f} / {report.supported_f1:.3f}"
+    )
+    click.echo(f"  calibration ECE: {report.calibration_ece:.3f}  (lower = better calibrated)")
+    if report.retrieval_recall is not None:
+        click.echo(
+            f"  retrieval recall: {report.retrieval_recall:.3f}  "
+            f"({report.n_retrieval_labeled} labelled)"
+        )
+    if report.stale_entries:
+        click.echo(
+            f"  warning: {report.stale_entries} gold entr(ies) have chunk labels "
+            f"older than the current corpus — recall measured frozen.",
+            err=True,
+        )
+
+
+@eval_group.command("show")
+@click.argument("eval_id")
+@click.option("--workspace", "-w", default=None)
+@click.option("--json", "as_json", is_flag=True)
+def show_command(eval_id: str, workspace: str | None, as_json: bool) -> None:
+    """Reprint a persisted eval report."""
+    from orc.eval.runner import load_eval
+
+    ws = _resolve(workspace)
+    try:
+        report = load_eval(ws.name, eval_id)
+    except KeyError as exc:
+        raise click.ClickException(f"No eval {eval_id} in {ws.name}") from exc
+    click.echo(_report_json(report) if as_json else
+               f"eval {report.eval_id}  mode={report.mode}  n={report.n}  "
+               f"accuracy={report.accuracy:.3f}  ECE={report.calibration_ece:.3f}")
+
+
+def _report_json(report: object) -> str:
+    from dataclasses import asdict
+
+    d = asdict(report)
+    d["reliability"] = [asdict(b) for b in report.reliability]  # type: ignore[attr-defined]
+    return json_lib.dumps(d, indent=2)
+
+
 @eval_group.command("gold")
 @click.argument("action", type=click.Choice(["list"]))
 @click.option("--workspace", "-w", default=None)
diff --git a/tests/unit/test_eval_cli.py b/tests/unit/test_eval_cli.py
index 90efdbd..2e77535 100644
--- a/tests/unit/test_eval_cli.py
+++ b/tests/unit/test_eval_cli.py
@@ -73,3 +73,43 @@ def test_eval_gold_list_json(orc_home) -> None:
     [item] = json.loads(res.output)
     assert item["expected_label"] == "supported"
     assert item["stale_chunk_labels"] is False
+
+
+def test_eval_run_and_show_roundtrip(orc_home, tmp_path, monkeypatch) -> None:
+    import json as json_lib
+
+    from orc.paths import workspace_db_path
+    from orc.storage.db import open_connection
+
+    ws = ws_module.create("demo")
+    corpus = tmp_path / "corpus"
+    corpus.mkdir()
+    (corpus / "doc.md").write_text("# Doc\n\nThe sky is blue on a clear day.\n")
+    do_ingest(ws, str(corpus))
+    with open_connection(workspace_db_path("demo")) as conn:
+        chunk_id = conn.execute("SELECT chunk_id FROM chunk ORDER BY seq LIMIT 1").fetchone()["chunk_id"]
+    cv = ws_module.resolve("demo").corpus_version
+    gold.add("demo", claim="The sky is blue", expected_label="supported", corpus_version=cv, source="import")
+
+    fake = FakeAnthropic(responses=[
+        make_verdict_response(label="supported", confidence=0.9, supporting_chunk_ids=[chunk_id]),
+    ])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    res = CliRunner().invoke(main, ["eval", "run", "-w", "demo", "--json"])
+    assert res.exit_code == 0, res.output
+    payload = json_lib.loads(res.output)
+    assert payload["n"] == 1
+    assert payload["accuracy"] == 1.0
+
+    res2 = CliRunner().invoke(main, ["eval", "show", payload["eval_id"], "-w", "demo"])
+    assert res2.exit_code == 0, res2.output
+    assert payload["eval_id"] in res2.output
+
+
+def test_eval_run_with_no_gold_fails_cleanly(orc_home) -> None:
+    ws_module.create("demo")
+    res = CliRunner().invoke(main, ["eval", "run", "-w", "demo"])
+    assert res.exit_code != 0
+    assert "gold" in res.output.lower()

From b06c1ef0cb0edc8b6aedcd1e88ba2db3b9265a77 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 16:08:20 -0400
Subject: [PATCH 11/15] feat(eval): tiered_policy store (load/save)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/eval/policy.py           | 71 ++++++++++++++++++++++++++++++++
 tests/unit/test_tiered_policy.py | 37 +++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 src/orc/eval/policy.py
 create mode 100644 tests/unit/test_tiered_policy.py

diff --git a/src/orc/eval/policy.py b/src/orc/eval/policy.py
new file mode 100644
index 0000000..88b0fe8
--- /dev/null
+++ b/src/orc/eval/policy.py
@@ -0,0 +1,71 @@
+"""Tiered-verification policy: the calibrated escalation threshold and the
+models for each tier, one row per workspace.
+
+`orc eval calibrate` writes it from the gold set; `tiered_verify` reads it.
+Stored in orc.db (not config.toml) because it is calibration *state* — derived
+data stamped with which eval produced it — not policy a human hand-edits."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+from orc.core.clock import now_iso
+from orc.paths import workspace_db_path
+from orc.storage.db import open_connection, transaction
+
+
+@dataclass(frozen=True)
+class TieredPolicy:
+    workspace: str
+    tier1_model: str
+    tier2_model: str
+    top_judge_model: str | None
+    escalation_threshold: float
+    target: float
+    calibrated_at: str
+    calibrated_against_eval_id: str | None
+    n_gold: int
+
+
+def save_policy(
+    workspace: str,
+    *,
+    tier1_model: str,
+    tier2_model: str,
+    top_judge_model: str | None,
+    escalation_threshold: float,
+    target: float,
+    calibrated_against_eval_id: str | None,
+    n_gold: int,
+) -> None:
+    with open_connection(workspace_db_path(workspace)) as conn, transaction(conn):
+        conn.execute(
+            "INSERT OR REPLACE INTO tiered_policy(workspace, tier1_model, tier2_model, "
+            "top_judge_model, escalation_threshold, target, calibrated_at, "
+            "calibrated_against_eval_id, n_gold) VALUES (?,?,?,?,?,?,?,?,?)",
+            (
+                workspace, tier1_model, tier2_model, top_judge_model,
+                escalation_threshold, target, now_iso(),
+                calibrated_against_eval_id, n_gold,
+            ),
+        )
+
+
+def load_policy(workspace: str) -> TieredPolicy | None:
+    with open_connection(workspace_db_path(workspace)) as conn:
+        row = conn.execute(
+            "SELECT * FROM tiered_policy WHERE workspace=?", (workspace,)
+        ).fetchone()
+    if row is None:
+        return None
+    return TieredPolicy(
+        workspace=row["workspace"],
+        tier1_model=row["tier1_model"],
+        tier2_model=row["tier2_model"],
+        top_judge_model=row["top_judge_model"],
+        escalation_threshold=row["escalation_threshold"],
+        target=row["target"],
+        calibrated_at=row["calibrated_at"],
+        calibrated_against_eval_id=row["calibrated_against_eval_id"],
+        n_gold=row["n_gold"],
+    )
diff --git a/tests/unit/test_tiered_policy.py b/tests/unit/test_tiered_policy.py
new file mode 100644
index 0000000..a163cd1
--- /dev/null
+++ b/tests/unit/test_tiered_policy.py
@@ -0,0 +1,37 @@
+from orc.eval.policy import TieredPolicy, load_policy, save_policy
+from orc.storage import workspace as ws_module
+
+
+def test_load_policy_is_none_before_calibration(orc_home) -> None:
+    ws_module.create("demo")
+    assert load_policy("demo") is None
+
+
+def test_save_then_load_policy_roundtrip(orc_home) -> None:
+    ws_module.create("demo")
+    save_policy(
+        "demo",
+        tier1_model="claude-haiku-4-5",
+        tier2_model="claude-sonnet-4-6",
+        top_judge_model="gpt-4o",
+        escalation_threshold=0.92,
+        target=0.95,
+        calibrated_against_eval_id="01EVAL",
+        n_gold=40,
+    )
+    p = load_policy("demo")
+    assert isinstance(p, TieredPolicy)
+    assert p.escalation_threshold == 0.92
+    assert p.top_judge_model == "gpt-4o"
+    assert p.n_gold == 40
+
+
+def test_save_policy_replaces_prior(orc_home) -> None:
+    ws_module.create("demo")
+    save_policy("demo", tier1_model="h", tier2_model="s", top_judge_model=None,
+                escalation_threshold=0.8, target=0.9, calibrated_against_eval_id=None, n_gold=1)
+    save_policy("demo", tier1_model="h", tier2_model="s", top_judge_model=None,
+                escalation_threshold=0.95, target=0.99, calibrated_against_eval_id=None, n_gold=2)
+    p = load_policy("demo")
+    assert p.escalation_threshold == 0.95
+    assert p.n_gold == 2

From 2db53fccba0a0a3de4ed744b11a5c0be264156ae Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 16:10:08 -0400
Subject: [PATCH 12/15] feat(verify): tiered_verify meta-mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

mode="tiered" runs a cheap Tier-1 binary judge on every claim and
ships its verdict when confidence clears the calibrated escalation
threshold; below it, the claim escalates to a stronger Tier-2
evidence judge — optionally a different model family
(top_judge_model) so the escalation judge doesn't share Tier 1's
blind spots. The deciding tier, both confidences, and the escalation
reason are recorded in the trace. The threshold comes from the
workspace's tiered_policy (set by `orc eval calibrate`); with no
policy a conservative default routes and a warning fires.

Lives in a new modes/ package and takes the skill instance to call
self.run per tier — same pattern as decomposed — so no import cycle.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 .../research/skills/modes/__init__.py         |  1 +
 .../research/skills/modes/tiered.py           | 88 +++++++++++++++++
 .../research/skills/verify_claim.py           | 22 ++++-
 tests/unit/test_tiered_verify.py              | 97 +++++++++++++++++++
 4 files changed, 207 insertions(+), 1 deletion(-)
 create mode 100644 src/orc/directives/research/skills/modes/__init__.py
 create mode 100644 src/orc/directives/research/skills/modes/tiered.py
 create mode 100644 tests/unit/test_tiered_verify.py

diff --git a/src/orc/directives/research/skills/modes/__init__.py b/src/orc/directives/research/skills/modes/__init__.py
new file mode 100644
index 0000000..5f8175a
--- /dev/null
+++ b/src/orc/directives/research/skills/modes/__init__.py
@@ -0,0 +1 @@
+"""verify_claim meta-strategies that orchestrate sub-verifications."""
diff --git a/src/orc/directives/research/skills/modes/tiered.py b/src/orc/directives/research/skills/modes/tiered.py
new file mode 100644
index 0000000..5880015
--- /dev/null
+++ b/src/orc/directives/research/skills/modes/tiered.py
@@ -0,0 +1,88 @@
+"""Tiered verification: a cheap pass first, an expensive one only when needed.
+
+Tier 1 is a cheap binary judge on every claim. When its confidence clears the
+calibrated escalation threshold, that verdict ships. Otherwise the claim
+escalates to Tier 2 — a stronger evidence-mode judge, optionally a different
+model *family* (set `top_judge_model` to e.g. a GPT/Gemini/Llama model via
+OpenRouter) so the escalation judge doesn't share Tier 1's blind spots.
+
+The threshold comes from `orc eval calibrate` (tuned on the gold set, never
+guessed). With no policy, a conservative default routes and a warning fires so
+the operator knows tiering is uncalibrated.
+
+`run_tiered` takes the skill instance (`self`) and calls `self.run(...)` for
+each tier — the same pattern decomposed/arithmetic use — so this module never
+imports verify_claim and there is no import cycle."""
+
+from __future__ import annotations
+
+import warnings
+from typing import Any
+
+from orc.eval.policy import load_policy
+
+_DEFAULT_TIER1_MODEL = "claude-haiku-4-5"
+_DEFAULT_TIER2_MODEL = "claude-sonnet-4-6"
+_DEFAULT_THRESHOLD = 0.9
+
+
+def run_tiered(
+    *,
+    self: Any,
+    workspace: Any,
+    run: Any,
+    claim: str,
+    model: str | None,
+    k: int,
+    retrieval_pool: int,
+    max_tokens: int,
+    client: Any,
+    corpus_version: int | None,
+    evidence_id: str | None,
+) -> dict[str, Any]:
+    policy = load_policy(workspace.name)
+    if policy is None:
+        warnings.warn(
+            f"workspace {workspace.name!r} is not calibrated for tiered verify; "
+            "run `orc eval calibrate` to tune the threshold. Using default "
+            f"{_DEFAULT_THRESHOLD}.",
+            UserWarning,
+            stacklevel=2,
+        )
+        tier1_model = _DEFAULT_TIER1_MODEL
+        tier2_model = _DEFAULT_TIER2_MODEL
+        top_judge = None
+        threshold = _DEFAULT_THRESHOLD
+    else:
+        tier1_model = policy.tier1_model
+        tier2_model = policy.tier2_model
+        top_judge = policy.top_judge_model
+        threshold = policy.escalation_threshold
+
+    # Tier 1: cheap binary judge on every claim.
+    tier1 = self.run(
+        workspace=workspace, run=run, claim=claim, mode="binary",
+        model=tier1_model, k=k, retrieval_pool=retrieval_pool, max_tokens=max_tokens,
+        client=client, corpus_version=corpus_version, evidence_id=evidence_id,
+    )
+    if tier1["confidence"] >= threshold:
+        run.record("tiered", {
+            "tier": 1, "escalated": False, "threshold": threshold,
+            "tier1_confidence": tier1["confidence"], "tier1_model": tier1_model,
+        })
+        return {**tier1, "tier": 1, "escalated": False}
+
+    # Tier 2: stronger judge (optionally cross-family) decides.
+    tier2_judge = top_judge or tier2_model
+    tier2 = self.run(
+        workspace=workspace, run=run, claim=claim, mode="evidence",
+        model=tier2_judge, k=k, retrieval_pool=retrieval_pool, max_tokens=max_tokens,
+        client=client, corpus_version=corpus_version, evidence_id=evidence_id,
+    )
+    run.record("tiered", {
+        "tier": 2, "escalated": True, "threshold": threshold,
+        "tier1_confidence": tier1["confidence"], "tier1_label": tier1["label"],
+        "tier1_model": tier1_model, "tier2_model": tier2_judge,
+        "reason": "tier1_confidence_below_threshold",
+    })
+    return {**tier2, "tier": 2, "escalated": True}
diff --git a/src/orc/directives/research/skills/verify_claim.py b/src/orc/directives/research/skills/verify_claim.py
index 8345e3f..7a52c28 100644
--- a/src/orc/directives/research/skills/verify_claim.py
+++ b/src/orc/directives/research/skills/verify_claim.py
@@ -321,9 +321,29 @@ def run(
             raise ValueError("claim must be a non-empty string")
         if mode is None:
             mode = route_to_mode(domain) or "evidence"
-        if mode not in {"evidence", "judgment", "binary", "decomposed", "arithmetic"}:
+        if mode not in {"evidence", "judgment", "binary", "decomposed", "arithmetic", "tiered"}:
             raise ValueError(f"unknown verify mode: {mode!r}")
 
+        # Tiered mode is a meta-strategy: a cheap Tier-1 judge on every claim,
+        # escalating to an expensive (optionally cross-family) Tier-2 only below
+        # the calibrated threshold. Like decomposed, it delegates via self.run.
+        if mode == "tiered":
+            from orc.directives.research.skills.modes.tiered import run_tiered
+
+            return run_tiered(
+                self=self,
+                workspace=workspace,
+                run=run,
+                claim=claim,
+                model=model,
+                k=k,
+                retrieval_pool=retrieval_pool,
+                max_tokens=max_tokens,
+                client=client,
+                corpus_version=corpus_version,
+                evidence_id=evidence_id,
+            )
+
         # Decomposed mode is a meta-strategy: it decomposes the claim then
         # delegates each atom to a binary verify. Handle it before the regular
         # retrieval/LLM path.
diff --git a/tests/unit/test_tiered_verify.py b/tests/unit/test_tiered_verify.py
new file mode 100644
index 0000000..f8abf15
--- /dev/null
+++ b/tests/unit/test_tiered_verify.py
@@ -0,0 +1,97 @@
+"""tiered_verify: cheap Tier-1 pass, escalate to Tier-2 below a threshold."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from orc import directives
+from orc.eval.policy import save_policy
+from orc.ingest.pipeline import ingest as do_ingest
+from orc.llm import client as client_module
+from orc.paths import workspace_db_path
+from orc.runs import open_run
+from orc.storage import workspace as ws_module
+from orc.storage.db import open_connection
+from tests._fake_llm import FakeAnthropic, FakeContentBlock, FakeResponse, make_verdict_response
+
+
+def _binary(*, faithful: bool, confidence: float) -> FakeResponse:
+    return FakeResponse(
+        content=[
+            FakeContentBlock(
+                type="tool_use",
+                name="record_binary_verdict",
+                input={"faithful": faithful, "confidence": confidence, "reasoning": "r"},
+            )
+        ]
+    )
+
+
+def _setup(orc_home: Path, tmp_path: Path) -> tuple[str, str, int]:
+    ws = ws_module.create("demo")
+    corpus = tmp_path / "corpus"
+    corpus.mkdir()
+    (corpus / "doc.md").write_text("# Doc\n\nThe sky is blue on a clear day.\n")
+    do_ingest(ws, str(corpus))
+    with open_connection(workspace_db_path("demo")) as conn:
+        cid = conn.execute("SELECT chunk_id FROM chunk ORDER BY seq LIMIT 1").fetchone()["chunk_id"]
+    return ws.name, cid, ws_module.resolve("demo").corpus_version
+
+
+def _run(name: str, **kwargs) -> dict:
+    ws = ws_module.resolve(name)
+    skill = directives.get("research").skills["verify_claim"]
+    with open_run(ws, directive="research", skill="verify_claim", inputs={}) as run:
+        result = skill.run(workspace=ws, run=run, **kwargs)
+        run.close(output=result)
+    return result
+
+
+def test_tier1_accepts_above_threshold_without_escalating(orc_home, tmp_path, monkeypatch) -> None:
+    name, _cid, cv = _setup(orc_home, tmp_path)
+    save_policy(name, tier1_model="claude-haiku-4-5", tier2_model="claude-sonnet-4-6",
+                top_judge_model=None, escalation_threshold=0.9, target=0.95,
+                calibrated_against_eval_id=None, n_gold=10)
+    # Tier-1 binary returns high confidence -> accept, no Tier-2 call.
+    fake = FakeAnthropic(responses=[_binary(faithful=True, confidence=0.99)])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    result = _run(name, claim="The sky is blue", mode="tiered", corpus_version=cv)
+    assert result["label"] == "supported"
+    assert result["tier"] == 1
+    assert result["escalated"] is False
+    assert len(fake.calls) == 1  # only Tier-1 ran
+
+
+def test_low_confidence_escalates_to_tier2(orc_home, tmp_path, monkeypatch) -> None:
+    name, cid, cv = _setup(orc_home, tmp_path)
+    save_policy(name, tier1_model="claude-haiku-4-5", tier2_model="claude-sonnet-4-6",
+                top_judge_model=None, escalation_threshold=0.9, target=0.95,
+                calibrated_against_eval_id=None, n_gold=10)
+    # Tier-1 low confidence -> escalate; Tier-2 evidence verdict decides.
+    fake = FakeAnthropic(responses=[
+        _binary(faithful=True, confidence=0.5),
+        make_verdict_response(label="contradicted", confidence=0.95, contradicting_chunk_ids=[cid]),
+    ])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    result = _run(name, claim="The sky is blue", mode="tiered", corpus_version=cv)
+    assert result["tier"] == 2
+    assert result["escalated"] is True
+    assert result["label"] == "contradicted"
+    assert len(fake.calls) == 2
+
+
+def test_uncalibrated_workspace_warns_and_uses_default(orc_home, tmp_path, monkeypatch) -> None:
+    name, _cid, cv = _setup(orc_home, tmp_path)  # no save_policy
+    fake = FakeAnthropic(responses=[_binary(faithful=True, confidence=0.99)])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    with pytest.warns(UserWarning, match="not calibrated"):
+        result = _run(name, claim="The sky is blue", mode="tiered", corpus_version=cv)
+    assert result["tier"] == 1  # default threshold still routes

From 424fc9b3f54ca443cb4fd9f1d0343b7ef20fd2c1 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 16:37:51 -0400
Subject: [PATCH 13/15] feat(eval): calibrate threshold sweep + guard
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

sweep_threshold finds the lowest Tier-1 confidence cutoff whose
accepted claims reach the target accuracy — lowest because it accepts
the most at Tier 1 and escalates the fewest. When no cutoff reaches
the target (Tier 1 caps below it) the result is achievable=False with
the best accepted accuracy, so the caller can refuse to write an
always-escalate policy. calibrate() runs the gold set through the
cheap binary judge and sweeps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/eval/calibrate.py    | 99 ++++++++++++++++++++++++++++++++++++
 tests/unit/test_calibrate.py | 32 ++++++++++++
 2 files changed, 131 insertions(+)
 create mode 100644 src/orc/eval/calibrate.py
 create mode 100644 tests/unit/test_calibrate.py

diff --git a/src/orc/eval/calibrate.py b/src/orc/eval/calibrate.py
new file mode 100644
index 0000000..f7f0230
--- /dev/null
+++ b/src/orc/eval/calibrate.py
@@ -0,0 +1,99 @@
+"""Derive a tiered escalation threshold from the gold set.
+
+Runs every gold claim through Tier 1 (the cheap binary judge), then sweeps the
+confidence cutoff to find the *lowest* threshold at which Tier-1-accepted claims
+still reach the target accuracy — lowest because that accepts the most at Tier 1
+and escalates the fewest. If no cutoff reaches the target (Tier 1 caps below it),
+the result is reported as unachievable rather than silently writing a policy
+that escalates everything; the caller surfaces that and the achievable maximum."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+from orc import directives
+from orc.eval import gold as gold_store
+from orc.metrics.calibration import ConfidenceResult
+from orc.runs import open_run
+from orc.storage import workspace as ws_module
+
+DEFAULT_TIER1_MODEL = "claude-haiku-4-5"
+DEFAULT_TIER2_MODEL = "claude-sonnet-4-6"
+
+
+@dataclass(frozen=True)
+class CalibrationResult:
+    achievable: bool
+    threshold: float
+    escalation_rate: float
+    accepted_accuracy: float
+    max_accuracy: float
+    n: int
+
+
+def sweep_threshold(results: list[ConfidenceResult], *, target: float) -> CalibrationResult:
+    n = len(results)
+    if n == 0:
+        return CalibrationResult(False, 1.0, 0.0, 0.0, 0.0, 0)
+
+    thresholds = sorted({r.confidence for r in results})
+    max_accuracy = 0.0
+    max_at = thresholds[-1]
+    meeting: tuple[float, float, float] | None = None
+
+    for t in thresholds:  # ascending: first to meet target is the lowest
+        accepted = [r for r in results if r.confidence >= t]
+        if not accepted:
+            continue
+        acc = sum(1 for r in accepted if r.correct) / len(accepted)
+        esc = sum(1 for r in results if r.confidence < t) / n
+        if acc > max_accuracy:
+            max_accuracy = acc
+            max_at = t
+        if acc >= target and meeting is None:
+            meeting = (t, acc, esc)
+
+    if meeting is not None:
+        t, acc, esc = meeting
+        return CalibrationResult(True, t, esc, acc, max_accuracy, n)
+
+    esc_at_max = sum(1 for r in results if r.confidence < max_at) / n
+    return CalibrationResult(False, max_at, esc_at_max, max_accuracy, max_accuracy, n)
+
+
+def _tier1_results(workspace, *, tier1_model: str) -> list[ConfidenceResult]:
+    """Run every gold claim through the cheap binary judge and score it.
+
+    Binary collapses to grounded/ungrounded, so a "supported" verdict is correct
+    when the gold label is "supported"; anything else is correct when the gold
+    label is not "supported"."""
+    ws = ws_module.resolve(workspace)
+    items = gold_store.list_gold(ws.name)
+    skill = directives.get("research").skills["verify_claim"]
+    out: list[ConfidenceResult] = []
+    for g in items:
+        with open_run(ws, directive="research", skill="verify_claim", inputs={"claim": g.claim}) as run:
+            result = skill.run(
+                workspace=ws, run=run, claim=g.claim, mode="binary",
+                model=tier1_model, corpus_version=g.corpus_version,
+            )
+            run.close(output=result)
+        predicted_supported = result["label"] == "supported"
+        expected_supported = g.expected_label == "supported"
+        out.append(
+            ConfidenceResult(
+                confidence=float(result["confidence"]),
+                correct=predicted_supported == expected_supported,
+            )
+        )
+    return out
+
+
+def calibrate(
+    workspace: str,
+    *,
+    target: float = 0.95,
+    tier1_model: str = DEFAULT_TIER1_MODEL,
+) -> CalibrationResult:
+    results = _tier1_results(workspace, tier1_model=tier1_model)
+    return sweep_threshold(results, target=target)
diff --git a/tests/unit/test_calibrate.py b/tests/unit/test_calibrate.py
new file mode 100644
index 0000000..d8d91d3
--- /dev/null
+++ b/tests/unit/test_calibrate.py
@@ -0,0 +1,32 @@
+from orc.eval.calibrate import sweep_threshold
+from orc.metrics.calibration import ConfidenceResult
+
+
+def test_sweep_finds_lowest_threshold_meeting_target() -> None:
+    # Lowest threshold that still hits the accuracy target accepts the most at
+    # Tier 1 (minimal escalation). At >=0.80 accepted accuracy is 1.0.
+    results = [
+        ConfidenceResult(0.99, True),
+        ConfidenceResult(0.98, True),
+        ConfidenceResult(0.80, True),
+        ConfidenceResult(0.79, False),
+    ]
+    r = sweep_threshold(results, target=0.95)
+    assert r.achievable is True
+    assert r.threshold == 0.80
+    assert r.accepted_accuracy == 1.0
+    assert r.escalation_rate == 0.25  # only the 0.79 item falls below 0.80
+
+
+def test_sweep_reports_unachievable_target() -> None:
+    # The top-confidence item is wrong, so no cutoff reaches 0.95.
+    results = [ConfidenceResult(0.99, False), ConfidenceResult(0.98, True)]
+    r = sweep_threshold(results, target=0.95)
+    assert r.achievable is False
+    assert r.max_accuracy == 0.5  # best accepted subset accuracy
+
+
+def test_sweep_empty_results_is_unachievable() -> None:
+    r = sweep_threshold([], target=0.95)
+    assert r.achievable is False
+    assert r.escalation_rate == 0.0

From c0db3a6166e9b7203668ce6c3c16ae245b12c178 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 16:38:54 -0400
Subject: [PATCH 14/15] feat(cli): orc eval calibrate -> tiered_policy

Closes the eval<->tiering loop: runs the gold set through Tier 1,
sweeps for the lowest threshold meeting --target (default 0.95),
writes the tiered_policy that tiered_verify reads, and reports the
escalation rate so the cost is visible immediately. When the target
is unachievable it says so on stderr (best accuracy + the stored
fallback threshold) rather than silently configuring always-escalate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 src/orc/cli_commands/eval_cmd.py | 55 ++++++++++++++++++++++++++++++++
 tests/unit/test_eval_cli.py      | 34 ++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/src/orc/cli_commands/eval_cmd.py b/src/orc/cli_commands/eval_cmd.py
index 79d62f9..be7dc69 100644
--- a/src/orc/cli_commands/eval_cmd.py
+++ b/src/orc/cli_commands/eval_cmd.py
@@ -143,6 +143,61 @@ def _report_json(report: object) -> str:
     return json_lib.dumps(d, indent=2)
 
 
+@eval_group.command("calibrate")
+@click.option("--workspace", "-w", default=None)
+@click.option("--target", type=float, default=0.95, show_default=True,
+              help="Required Tier-1-accepted accuracy")
+@click.option("--tier1-model", default=None, help="Cheap Tier-1 judge model")
+@click.option("--tier2-model", default=None, help="Expensive Tier-2 judge model")
+@click.option("--top-judge", default=None,
+              help="Tier-2 model override (e.g. a cross-family judge via OpenRouter)")
+def calibrate_command(
+    workspace: str | None,
+    target: float,
+    tier1_model: str | None,
+    tier2_model: str | None,
+    top_judge: str | None,
+) -> None:
+    """Derive the tiered escalation threshold from the gold set and store it."""
+    from orc.eval.calibrate import DEFAULT_TIER1_MODEL, DEFAULT_TIER2_MODEL, calibrate
+    from orc.eval.policy import save_policy
+
+    ws = _resolve(workspace)
+    t1 = tier1_model or DEFAULT_TIER1_MODEL
+    t2 = tier2_model or DEFAULT_TIER2_MODEL
+    result = calibrate(ws.name, target=target, tier1_model=t1)
+    if result.n == 0:
+        raise click.ClickException(
+            f"{ws.name} has no gold claims to calibrate against — `orc eval import` first"
+        )
+
+    save_policy(
+        ws.name,
+        tier1_model=t1,
+        tier2_model=t2,
+        top_judge_model=top_judge,
+        escalation_threshold=result.threshold,
+        target=target,
+        calibrated_against_eval_id=None,
+        n_gold=result.n,
+    )
+    if result.achievable:
+        click.echo(
+            f"Calibrated on {result.n} gold claim(s): escalate below confidence "
+            f"{result.threshold:.3f} (Tier-1 accepts {1 - result.escalation_rate:.0%}, "
+            f"escalates {result.escalation_rate:.0%}; accepted accuracy "
+            f"{result.accepted_accuracy:.3f})."
+        )
+    else:
+        click.echo(
+            f"Tier 1 cannot reach {target:.2f} accuracy at any cutoff on this gold "
+            f"set (max {result.max_accuracy:.2f}). Stored threshold "
+            f"{result.threshold:.3f} (escalates {result.escalation_rate:.0%}); "
+            f"lower --target or improve the gold set.",
+            err=True,
+        )
+
+
 @eval_group.command("gold")
 @click.argument("action", type=click.Choice(["list"]))
 @click.option("--workspace", "-w", default=None)
diff --git a/tests/unit/test_eval_cli.py b/tests/unit/test_eval_cli.py
index 2e77535..76f04e9 100644
--- a/tests/unit/test_eval_cli.py
+++ b/tests/unit/test_eval_cli.py
@@ -113,3 +113,37 @@ def test_eval_run_with_no_gold_fails_cleanly(orc_home) -> None:
     res = CliRunner().invoke(main, ["eval", "run", "-w", "demo"])
     assert res.exit_code != 0
     assert "gold" in res.output.lower()
+
+
+def test_eval_calibrate_writes_policy_and_tiered_reads_it(orc_home, tmp_path, monkeypatch) -> None:
+    from orc.eval.policy import load_policy
+
+    ws = ws_module.create("demo")
+    corpus = tmp_path / "corpus"
+    corpus.mkdir()
+    (corpus / "doc.md").write_text("# Doc\n\nThe sky is blue on a clear day.\n")
+    do_ingest(ws, str(corpus))
+    cv = ws_module.resolve("demo").corpus_version
+    gold.add("demo", claim="The sky is blue", expected_label="supported", corpus_version=cv, source="import")
+    gold.add("demo", claim="The grass is blue", expected_label="not_found", corpus_version=cv, source="import")
+
+    from tests._fake_llm import FakeContentBlock, FakeResponse
+
+    def _binary(faithful, confidence):
+        return FakeResponse(content=[FakeContentBlock(
+            type="tool_use", name="record_binary_verdict",
+            input={"faithful": faithful, "confidence": confidence, "reasoning": "r"})])
+
+    # Tier-1 binary: claim 1 supported@0.97 (correct), claim 2 unfaithful@0.96 (correct).
+    fake = FakeAnthropic(responses=[_binary(True, 0.97), _binary(False, 0.96)])
+    monkeypatch.setattr(client_module, "_client", fake)
+    monkeypatch.setattr(client_module, "_factory", None)
+
+    res = CliRunner().invoke(main, ["eval", "calibrate", "-w", "demo", "--target", "0.95"])
+    assert res.exit_code == 0, res.output
+    assert "Calibrated" in res.output
+
+    policy = load_policy("demo")
+    assert policy is not None
+    assert policy.target == 0.95
+    assert 0.0 < policy.escalation_threshold <= 0.97

From 40c60f19fdc2dca95ef34f83b88a899de79d7218 Mon Sep 17 00:00:00 2001
From: Thormatt <thormatt@gmail.com>
Date: Fri, 12 Jun 2026 16:40:59 -0400
Subject: [PATCH 15/15] docs: document orc eval + tiered verification

README commands + an honesty note that orc eval measures the
unsupported-claims coverage row against the user's own labelled gold
set (and still cannot measure faithful-but-wrong corpus content);
CHANGELOG Unreleased entries; the same one-line caveat mirrored into
the competitive and EU AI Act docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 CHANGELOG.md                    | 11 +++++++++++
 README.md                       |  7 +++++++
 docs/compliance/eu-ai-act.md    |  4 ++++
 docs/positioning/competitive.md |  4 ++++
 4 files changed, 26 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index da15434..85b3e9f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -17,6 +17,17 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
   CLI (the approval queue's producer surface); `orc approve list --json`.
 - **`orc report <run_id>...`** — render trace(s) into a self-contained HTML
   artifact reusing the trace design language.
+- **Gold set + `orc eval`** — measure the gate on a user-owned labelled gold
+  set: judge accuracy, confidence calibration (reliability bins + ECE), and
+  retrieval recall. `orc eval import` seeds from YAML; `orc eval label`
+  promotes/corrects a real verdict into gold (frozen to its corpus version).
+- **Tiered verification** (`verify --mode tiered`) — a cheap Tier-1 binary
+  judge on every claim, escalating to a stronger (optionally cross-family)
+  Tier-2 judge only below a calibrated confidence threshold; the deciding
+  tier and reason are recorded in the trace.
+- **`orc eval calibrate`** — derive the tiered escalation threshold from the
+  gold set (lowest cutoff meeting `--target`, default 0.95), with an
+  achievability guard that refuses to silently configure always-escalate.
 
 ### Planned
 
diff --git a/README.md b/README.md
index c6899cb..8371691 100644
--- a/README.md
+++ b/README.md
@@ -29,6 +29,8 @@ Orc's guarantee is **"every claim is traceable to a cited source"** — not "eve
 | **Unsupported claims** — the model says `supported` when the cited evidence doesn't actually back the claim | **Caught partially.** This is an LLM-judge decision, with LLM-judge limits — the faithfulness benchmark (F1 0.864) is the measured error rate, not a guarantee. |
 | **Faithful-but-wrong** — the corpus itself is wrong, stale, or poisoned, and the claim cites it faithfully | **Not caught.** Orc verifies against your corpus, not against the world. Mitigate with corpus provenance and freshness controls: ingest only sources you trust (sha256 + source path are recorded automatically) and re-verify with `orc replay --live` after corpus updates. |
 
+Don't take the partial-coverage row on faith: **`orc eval`** measures judge accuracy, confidence calibration, and retrieval recall against *your own* labelled gold set, so you can quantify how well the gate matches your corpus instead of trusting a headline number. It cannot detect faithful-but-wrong corpus content either — no gold set can.
+
 Built for **research analysts, editorial teams, legal & compliance, agentic-workflow engineers** — anyone whose AI work product has to survive a second reviewer six months later.
 
 ## Quickstart
@@ -80,6 +82,11 @@ orc verify --file <path>               extract + verify every claim in a draft
 orc verify --url <url>                 same, from a URL
 orc research "<topic>" [-w <name>]     corpus-grounded synthesis with citations
 orc report <run_id>... [-o out.html]   render trace(s) as a shareable HTML report
+orc verify "<claim>" --mode tiered     cheap judge first, escalate only when unsure
+orc eval import <file.yaml> [-w <n>]   seed a labelled gold set
+orc eval label <run_id> --verdict <v>  promote/correct a real verdict into gold
+orc eval run [-w <name>] [--json]      score the gate (accuracy, calibration, recall)
+orc eval calibrate [-w <name>]         tune the tiered escalation threshold
 orc trace show <run_id>                full trace JSON
 orc trace list [-w <name>]             recent runs
 orc replay <run_id> [--live]           re-execute a recorded run
diff --git a/docs/compliance/eu-ai-act.md b/docs/compliance/eu-ai-act.md
index 17b7eed..5307525 100644
--- a/docs/compliance/eu-ai-act.md
+++ b/docs/compliance/eu-ai-act.md
@@ -271,6 +271,10 @@ Honest framing matters here.
    the corpus is wrong, stale, or poisoned, a claim that cites it faithfully
    will pass. The mitigation is the Article 10 data-governance work above:
    corpus provenance, freshness, and review remain the deployer's obligation.
+   `orc eval` lets a deployer quantify the unsupported-claims coverage on
+   their own labelled gold set (judge accuracy, calibration, retrieval
+   recall) — evidence of accuracy and robustness for Article 15 — but it
+   cannot measure faithful-but-wrong corpus content; no gold set can.
 
 ---
 
diff --git a/docs/positioning/competitive.md b/docs/positioning/competitive.md
index 52c5059..87001ed 100644
--- a/docs/positioning/competitive.md
+++ b/docs/positioning/competitive.md
@@ -268,6 +268,10 @@ Honest gaps, kept current so prospects know what they're buying:
   caught at all. Corpus provenance and freshness controls are the
   mitigation. Post-hoc judges share the same ceiling: they score
   consistency with the provided context, not the truth of the context.
+  `orc eval` measures the unsupported-claims row against a user-owned
+  labelled gold set (judge accuracy, calibration, retrieval recall), so the
+  ceiling is quantified per corpus rather than asserted — but it cannot
+  measure the faithful-but-wrong row, because no gold set can.
 
 ---