Thormatt · Thormatt · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,6 +17,17 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
   CLI (the approval queue's producer surface); `orc approve list --json`.
 - **`orc report <run_id>...`** — render trace(s) into a self-contained HTML
   artifact reusing the trace design language.
+- **Gold set + `orc eval`** — measure the gate on a user-owned labelled gold
+  set: judge accuracy, confidence calibration (reliability bins + ECE), and
+  retrieval recall. `orc eval import` seeds from YAML; `orc eval label`
+  promotes/corrects a real verdict into gold (frozen to its corpus version).
+- **Tiered verification** (`verify --mode tiered`) — a cheap Tier-1 binary
+  judge on every claim, escalating to a stronger (optionally cross-family)
+  Tier-2 judge only below a calibrated confidence threshold; the deciding
+  tier and reason are recorded in the trace.
+- **`orc eval calibrate`** — derive the tiered escalation threshold from the
+  gold set (lowest cutoff meeting `--target`, default 0.95), with an
+  achievability guard that refuses to silently configure always-escalate.
 
 ### Planned
 

diff --git a/README.md b/README.md
@@ -29,6 +29,8 @@ Orc's guarantee is **"every claim is traceable to a cited source"** — not "eve
 | **Unsupported claims** — the model says `supported` when the cited evidence doesn't actually back the claim | **Caught partially.** This is an LLM-judge decision, with LLM-judge limits — the faithfulness benchmark (F1 0.864) is the measured error rate, not a guarantee. |
 | **Faithful-but-wrong** — the corpus itself is wrong, stale, or poisoned, and the claim cites it faithfully | **Not caught.** Orc verifies against your corpus, not against the world. Mitigate with corpus provenance and freshness controls: ingest only sources you trust (sha256 + source path are recorded automatically) and re-verify with `orc replay --live` after corpus updates. |
 
+Don't take the partial-coverage row on faith: **`orc eval`** measures judge accuracy, confidence calibration, and retrieval recall against *your own* labelled gold set, so you can quantify how well the gate matches your corpus instead of trusting a headline number. It cannot detect faithful-but-wrong corpus content either — no gold set can.
+
 Built for **research analysts, editorial teams, legal & compliance, agentic-workflow engineers** — anyone whose AI work product has to survive a second reviewer six months later.
 
 ## Quickstart
@@ -80,6 +82,11 @@ orc verify --file <path>               extract + verify every claim in a draft
 orc verify --url <url>                 same, from a URL
 orc research "<topic>" [-w <name>]     corpus-grounded synthesis with citations
 orc report <run_id>... [-o out.html]   render trace(s) as a shareable HTML report
+orc verify "<claim>" --mode tiered     cheap judge first, escalate only when unsure
+orc eval import <file.yaml> [-w <n>]   seed a labelled gold set
+orc eval label <run_id> --verdict <v>  promote/correct a real verdict into gold
+orc eval run [-w <name>] [--json]      score the gate (accuracy, calibration, recall)
+orc eval calibrate [-w <name>]         tune the tiered escalation threshold
 orc trace show <run_id>                full trace JSON
 orc trace list [-w <name>]             recent runs
 orc replay <run_id> [--live]           re-execute a recorded run

diff --git a/benchmarks/faithfulness/run.py b/benchmarks/faithfulness/run.py
@@ -113,37 +113,27 @@ def _load_dataset(n: int, source_filter: str | None) -> list[dict[str, Any]]:
 
 
 def _confusion(results: list[ItemResult], binary_attr: str) -> dict[str, int]:
-    tp = fp = tn = fn = 0
-    for r in results:
-        if getattr(r, binary_attr) is None:
-            continue
-        pred = getattr(r, binary_attr)
-        gt = r.ground_truth
-        if pred == "PASS" and gt == "PASS":
-            tp += 1
-        elif pred == "PASS" and gt == "FAIL":
-            fp += 1
-        elif pred == "FAIL" and gt == "FAIL":
-            tn += 1
-        elif pred == "FAIL" and gt == "PASS":
-            fn += 1
-    return {"tp": tp, "fp": fp, "tn": tn, "fn": fn}
+    # Thin adapter over orc.metrics.scoring (PASS as the positive class) so the
+    # benchmark and `orc eval` share one implementation. Skips items whose
+    # binary attr is None (errored/unscored), same as before.
+    labeled = [
+        LabeledResult(predicted=getattr(r, binary_attr), expected=r.ground_truth)
+        for r in results
+    ]
+    return _confusion_lib(labeled, positive="PASS")
 
 
 def _scores(cm: dict[str, int]) -> dict[str, float]:
-    """Treat PASS as the positive class. Reviewers may re-score with FAIL-positive."""
-    tp, fp, tn, fn = cm["tp"], cm["fp"], cm["tn"], cm["fn"]
-    n = tp + fp + tn + fn
-    if n == 0:
-        return {"accuracy": 0.0, "precision_pass": 0.0, "recall_pass": 0.0, "f1_pass": 0.0}
-    precision = tp / (tp + fp) if (tp + fp) else 0.0
-    recall = tp / (tp + fn) if (tp + fn) else 0.0
-    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0.0
+    """Treat PASS as the positive class. Reviewers may re-score with FAIL-positive.
+
+    Adapts the shared library's generic keys to this report's `*_pass` keys so
+    the downstream report assembly is untouched."""
+    s = _scores_lib(cm)
     return {
-        "accuracy": (tp + tn) / n,
-        "precision_pass": precision,
-        "recall_pass": recall,
-        "f1_pass": f1,
+        "accuracy": s["accuracy"],
+        "precision_pass": s["precision"],
+        "recall_pass": s["recall"],
+        "f1_pass": s["f1"],
     }
 
 
@@ -187,6 +177,9 @@ def _run_lynx_style_one(item: dict[str, Any], orc_home: Path) -> ItemResult:
 from orc.directives.research.routing import (  # noqa: E402
     BENCHMARK_SOURCE_TO_MODE as SOURCE_TO_MODE,
 )
+from orc.metrics.scoring import LabeledResult  # noqa: E402
+from orc.metrics.scoring import confusion as _confusion_lib  # noqa: E402
+from orc.metrics.scoring import scores as _scores_lib  # noqa: E402
 
 
 def _run_with_mode(item: dict[str, Any], orc_home: Path, mode: str) -> ItemResult:

diff --git a/docs/compliance/eu-ai-act.md b/docs/compliance/eu-ai-act.md
@@ -271,6 +271,10 @@ Honest framing matters here.
    the corpus is wrong, stale, or poisoned, a claim that cites it faithfully
    will pass. The mitigation is the Article 10 data-governance work above:
    corpus provenance, freshness, and review remain the deployer's obligation.
+   `orc eval` lets a deployer quantify the unsupported-claims coverage on
+   their own labelled gold set (judge accuracy, calibration, retrieval
+   recall) — evidence of accuracy and robustness for Article 15 — but it
+   cannot measure faithful-but-wrong corpus content; no gold set can.
 
 ---
 

diff --git a/docs/positioning/competitive.md b/docs/positioning/competitive.md
@@ -268,6 +268,10 @@ Honest gaps, kept current so prospects know what they're buying:
   caught at all. Corpus provenance and freshness controls are the
   mitigation. Post-hoc judges share the same ceiling: they score
   consistency with the provided context, not the truth of the context.
+  `orc eval` measures the unsupported-claims row against a user-owned
+  labelled gold set (judge accuracy, calibration, retrieval recall), so the
+  ceiling is quantified per corpus rather than asserted — but it cannot
+  measure the faithful-but-wrong row, because no gold set can.
 
 ---