Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,17 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
CLI (the approval queue's producer surface); `orc approve list --json`.
- **`orc report <run_id>...`** — render trace(s) into a self-contained HTML
artifact reusing the trace design language.
- **Gold set + `orc eval`** — measure the gate on a user-owned labelled gold
set: judge accuracy, confidence calibration (reliability bins + ECE), and
retrieval recall. `orc eval import` seeds from YAML; `orc eval label`
promotes/corrects a real verdict into gold (frozen to its corpus version).
- **Tiered verification** (`verify --mode tiered`) — a cheap Tier-1 binary
judge on every claim, escalating to a stronger (optionally cross-family)
Tier-2 judge only below a calibrated confidence threshold; the deciding
tier and reason are recorded in the trace.
- **`orc eval calibrate`** — derive the tiered escalation threshold from the
gold set (lowest cutoff meeting `--target`, default 0.95), with an
achievability guard that refuses to silently configure always-escalate.

### Planned

Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ Orc's guarantee is **"every claim is traceable to a cited source"** — not "eve
| **Unsupported claims** — the model says `supported` when the cited evidence doesn't actually back the claim | **Caught partially.** This is an LLM-judge decision, with LLM-judge limits — the faithfulness benchmark (F1 0.864) is the measured error rate, not a guarantee. |
| **Faithful-but-wrong** — the corpus itself is wrong, stale, or poisoned, and the claim cites it faithfully | **Not caught.** Orc verifies against your corpus, not against the world. Mitigate with corpus provenance and freshness controls: ingest only sources you trust (sha256 + source path are recorded automatically) and re-verify with `orc replay --live` after corpus updates. |

Don't take the partial-coverage row on faith: **`orc eval`** measures judge accuracy, confidence calibration, and retrieval recall against *your own* labelled gold set, so you can quantify how well the gate matches your corpus instead of trusting a headline number. It cannot detect faithful-but-wrong corpus content either — no gold set can.

Built for **research analysts, editorial teams, legal & compliance, agentic-workflow engineers** — anyone whose AI work product has to survive a second reviewer six months later.

## Quickstart
Expand Down Expand Up @@ -80,6 +82,11 @@ orc verify --file <path> extract + verify every claim in a draft
orc verify --url <url> same, from a URL
orc research "<topic>" [-w <name>] corpus-grounded synthesis with citations
orc report <run_id>... [-o out.html] render trace(s) as a shareable HTML report
orc verify "<claim>" --mode tiered cheap judge first, escalate only when unsure
orc eval import <file.yaml> [-w <n>] seed a labelled gold set
orc eval label <run_id> --verdict <v> promote/correct a real verdict into gold
orc eval run [-w <name>] [--json] score the gate (accuracy, calibration, recall)
orc eval calibrate [-w <name>] tune the tiered escalation threshold
orc trace show <run_id> full trace JSON
orc trace list [-w <name>] recent runs
orc replay <run_id> [--live] re-execute a recorded run
Expand Down
47 changes: 20 additions & 27 deletions benchmarks/faithfulness/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,37 +113,27 @@ def _load_dataset(n: int, source_filter: str | None) -> list[dict[str, Any]]:


def _confusion(results: list[ItemResult], binary_attr: str) -> dict[str, int]:
tp = fp = tn = fn = 0
for r in results:
if getattr(r, binary_attr) is None:
continue
pred = getattr(r, binary_attr)
gt = r.ground_truth
if pred == "PASS" and gt == "PASS":
tp += 1
elif pred == "PASS" and gt == "FAIL":
fp += 1
elif pred == "FAIL" and gt == "FAIL":
tn += 1
elif pred == "FAIL" and gt == "PASS":
fn += 1
return {"tp": tp, "fp": fp, "tn": tn, "fn": fn}
# Thin adapter over orc.metrics.scoring (PASS as the positive class) so the
# benchmark and `orc eval` share one implementation. Skips items whose
# binary attr is None (errored/unscored), same as before.
labeled = [
LabeledResult(predicted=getattr(r, binary_attr), expected=r.ground_truth)
for r in results
]
return _confusion_lib(labeled, positive="PASS")


def _scores(cm: dict[str, int]) -> dict[str, float]:
"""Treat PASS as the positive class. Reviewers may re-score with FAIL-positive."""
tp, fp, tn, fn = cm["tp"], cm["fp"], cm["tn"], cm["fn"]
n = tp + fp + tn + fn
if n == 0:
return {"accuracy": 0.0, "precision_pass": 0.0, "recall_pass": 0.0, "f1_pass": 0.0}
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0
f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) else 0.0
"""Treat PASS as the positive class. Reviewers may re-score with FAIL-positive.

Adapts the shared library's generic keys to this report's `*_pass` keys so
the downstream report assembly is untouched."""
s = _scores_lib(cm)
return {
"accuracy": (tp + tn) / n,
"precision_pass": precision,
"recall_pass": recall,
"f1_pass": f1,
"accuracy": s["accuracy"],
"precision_pass": s["precision"],
"recall_pass": s["recall"],
"f1_pass": s["f1"],
}


Expand Down Expand Up @@ -187,6 +177,9 @@ def _run_lynx_style_one(item: dict[str, Any], orc_home: Path) -> ItemResult:
from orc.directives.research.routing import ( # noqa: E402
BENCHMARK_SOURCE_TO_MODE as SOURCE_TO_MODE,
)
from orc.metrics.scoring import LabeledResult # noqa: E402
from orc.metrics.scoring import confusion as _confusion_lib # noqa: E402
from orc.metrics.scoring import scores as _scores_lib # noqa: E402


def _run_with_mode(item: dict[str, Any], orc_home: Path, mode: str) -> ItemResult:
Expand Down
4 changes: 4 additions & 0 deletions docs/compliance/eu-ai-act.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,10 @@ Honest framing matters here.
the corpus is wrong, stale, or poisoned, a claim that cites it faithfully
will pass. The mitigation is the Article 10 data-governance work above:
corpus provenance, freshness, and review remain the deployer's obligation.
`orc eval` lets a deployer quantify the unsupported-claims coverage on
their own labelled gold set (judge accuracy, calibration, retrieval
recall) — evidence of accuracy and robustness for Article 15 — but it
cannot measure faithful-but-wrong corpus content; no gold set can.

---

Expand Down
4 changes: 4 additions & 0 deletions docs/positioning/competitive.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,10 @@ Honest gaps, kept current so prospects know what they're buying:
caught at all. Corpus provenance and freshness controls are the
mitigation. Post-hoc judges share the same ceiling: they score
consistency with the provided context, not the truth of the context.
`orc eval` measures the unsupported-claims row against a user-owned
labelled gold set (judge accuracy, calibration, retrieval recall), so the
ceiling is quantified per corpus rather than asserted — but it cannot
measure the faithful-but-wrong row, because no gold set can.

---

Expand Down
Loading
Loading