Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: CI

on:
push:
branches: [main]
pull_request:

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
# Matches requires-python >=3.11 and the advertised classifiers.
python-version: ["3.11", "3.12", "3.13"]
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: uv sync --extra dev

- name: Run tests
run: uv run pytest -q

- name: Lint
run: uv run ruff check src tests
55 changes: 55 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Publishes to PyPI via Trusted Publishing (OIDC) — no API token is stored
# in this repo. One-time setup on PyPI before the first tagged release:
#
# 1. Create (or claim) the "orc-ai" project on https://pypi.org.
# 2. Under the project's Publishing settings, add a Trusted Publisher:
# owner: Thormatt
# repository: orc
# workflow: release.yml
# environment: pypi
# 3. In this GitHub repo, create an environment named "pypi"
# (Settings → Environments) — optionally with required reviewers.
#
# Then `git tag v0.2.0 && git push --tags` publishes automatically.
name: Release

on:
push:
tags: ["v*"]

jobs:
publish:
runs-on: ubuntu-latest
environment: pypi
permissions:
# Required for PyPI Trusted Publishing (OIDC token exchange).
id-token: write
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
python-version: "3.12"

- name: Check tag matches pyproject version
# Tagging v0.3.0 on a 0.2.0 pyproject would otherwise silently
# publish the wrong version.
run: |
PYPROJECT_VERSION=$(uv run python -c "import tomllib; print(tomllib.load(open('pyproject.toml','rb'))['project']['version'])")
TAG_VERSION="${GITHUB_REF_NAME#v}"
if [ "$PYPROJECT_VERSION" != "$TAG_VERSION" ]; then
echo "Tag $GITHUB_REF_NAME does not match pyproject version $PYPROJECT_VERSION" >&2
exit 1
fi

- name: Run tests
run: |
uv sync --extra dev
uv run pytest -q

- name: Build sdist and wheel
run: uv build

- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
63 changes: 53 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,39 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Planned

- `gads` directive (Google Ads agentic analysis: lens-based decomposition,
read-only MCP integration, evidence-bound recommendation verification).
- `orc eval consistency|perturb|retrieval|regression` reliability commands.
- Voyage-AI or local-`sentence-transformers` embeddings + hybrid retrieval (RRF over BM25 + vector).
- Hosted runtime (scheduled triggers, web dashboard, team workspaces).
- Decomposition + arithmetic combined for DROP-shaped multi-step claims.

## [0.2.0] — 2026-06-11

First PyPI release. The distribution is named **`orc-ai`** — `orc` is taken on
PyPI by an unrelated project — but the import package (`import orc`) and the
CLI command (`orc`) are unchanged.

### Added

- **PDF ingestion** — `orc ingest report.pdf` now works alongside markdown,
text, json, and URLs. Text is extracted page-by-page via `pypdf`, and the
PDF metadata title is used when the body carries no markdown-style heading
(typical for credit memos and contracts). (`src/orc/ingest/loaders.py`)
- **Product domain routing** — `--domain` / `domain=` on `verify_claim` takes
product domains (`general`, `legal`, `clinical`, `biomedical`, `financial`,
`numeric`), each mapped to the verify mode that scored best on the benchmark
family the domain generalizes. The HaluBench `source_ds` names stay accepted
as benchmark-only aliases (`BENCHMARK_SOURCE_TO_MODE`) so the published F1
numbers remain reproducible, but dataset names are no longer the product
surface. Unknown domains still raise `UnknownDomainError`.
(`src/orc/directives/research/routing.py`)
- **CI + release workflows** — `.github/workflows/ci.yml` runs `pytest` +
`ruff` on pushes to `main` and on pull requests; `.github/workflows/release.yml`
builds sdist + wheel with uv on `v*` tags and publishes to PyPI via Trusted
Publishing (OIDC, no long-lived token in the repo).
- **Isolated write paths (Phase 1)** — the effect plane that makes the Approval
invariant enforceable rather than aspirational (see
`docs/design/0001-isolated-write-paths.md`):
Expand Down Expand Up @@ -36,6 +67,28 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).

### Fixed (hardening)

- **SSRF guard hardened against DNS rebinding** — `load_url` now connects to
the exact IP it vetted (re-pinned on every redirect hop) instead of letting
the HTTP client re-resolve the hostname at request time, closing the
validate-then-connect TOCTOU window a low-TTL DNS record could exploit. A
`transport` injection seam keeps the loader testable without real sockets.
(`src/orc/ingest/loaders.py`)
- **Decomposed-mode negative voting** — atoms run in binary mode, which can
only say faithful or unfaithful; the negative vote now keys off `not_found`
and a negative net aggregates back to `not_found` instead of `contradicted`
— a distinction the atoms never actually made.
(`src/orc/directives/research/skills/verify_claim.py`)
- **Citation guard covers judgment mode** — judgment-mode verdicts pass
through the same hallucinated-chunk-ID filter and no-valid-grounding
downgrade as evidence mode, instead of shipping unguarded citations.
- **UTF-8-exact chunking** — chunk windows are computed at the byte level and
snapped forward to UTF-8 character starts, so a cl100k token boundary that
falls inside a multi-byte character (routine for CJK and emoji) can no
longer corrupt chunk text. (`src/orc/ingest/chunker.py`)
- **Offline guard covers the full credential surface** — the autouse test
fixture strips `ANTHROPIC_API_KEY`, `OPENROUTER_API_KEY`, *and*
`ORC_PROVIDER`, so a developer's shell environment can't leak live LLM
calls into the default suite. (`tests/conftest.py`)
- **Replay determinism** — LLM sampling is now pinned to `temperature=0` at the
`messages_create` chokepoint, so `orc replay` re-issues the recorded decision
rather than a fresh sample. (`src/orc/llm/client.py`)
Expand All @@ -60,16 +113,6 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
- README invariants reworded to match what the code enforces (approval-queue
isolation flagged as roadmap, not yet implemented).

### Planned

- `gads` directive (Google Ads agentic analysis: lens-based decomposition,
read-only MCP integration, evidence-bound recommendation verification).
- `orc eval consistency|perturb|retrieval|regression` reliability commands.
- Voyage-AI or local-`sentence-transformers` embeddings + hybrid retrieval (RRF over BM25 + vector).
- PDF ingestion.
- Hosted runtime (scheduled triggers, web dashboard, team workspaces).
- Decomposition + arithmetic combined for DROP-shaped multi-step claims.

## [0.1.4] — 2026-05-19

### Added
Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ Built for **research analysts, editorial teams, legal & compliance, agentic-work
# Install
uv pip install git+https://github.com/Thormatt/orc

# Or, once published to PyPI:
# uv pip install orc
# Or, once published to PyPI (the CLI command and import name stay `orc`):
# uv pip install orc-ai

# Set up credentials (either of these works; OpenRouter takes priority if both set)
export ANTHROPIC_API_KEY=sk-ant-...
Expand Down Expand Up @@ -63,7 +63,7 @@ claude mcp add orc -- uv run --directory $(pwd) orc mcp serve
```
orc workspace create <name> create a new workspace
orc workspace list list workspaces
orc ingest <path-or-url> [-w <name>] add evidence (md, txt, urls)
orc ingest <path-or-url> [-w <name>] add evidence (md, txt, json, pdf, urls)
orc search "<query>" [-w <name>] BM25 retrieval, no LLM
orc verify "<claim>" [-w <name>] verify a single claim
orc verify --file <path> extract + verify every claim in a draft
Expand Down Expand Up @@ -111,7 +111,7 @@ A `.env` file in the repo root or at `$ORC_HOME/.env` is auto-loaded. Shell-expo

## Project status

`v0.1.4` — current. Faithfulness benchmark headline (HaluBench, stratified 504-item subsample, source-aware routing):
`v0.2.0` — current. Faithfulness benchmark headline (HaluBench, stratified 504-item subsample, source-aware routing; measured on v0.1.4, runtime unchanged since):

| Metric | Score |
|---|---:|
Expand All @@ -122,13 +122,14 @@ A `.env` file in the repo root or at `$ORC_HOME/.env` is auto-loaded. Shell-expo

> **0.864 is competitive with Patronus AI's Lynx-70B published home-court F1 of 0.85** — not a same-set head-to-head: orc's number comes from a stratified 504-item HaluBench subsample, with source-aware routing tuned on that same subsample, while Lynx reported on the full benchmark. It is achieved with a general-purpose Claude Sonnet 4.6 call (no fine-tuning) plus a safe arithmetic evaluator the model can invoke for numeric claims. Orc additionally produces chunk-level citations, deterministic replay against a frozen corpus snapshot, audit-export bundles that can be self-contained (`--include-evidence`), and a multi-approver gate for high-risk verdicts — artifacts the competitive set of post-hoc faithfulness judges does not produce.

What shipped in this version:
What shipped in v0.2.0:

- `domain=` parameter on `verify_claim` + `--domain` CLI flag → source-aware routing is a real product feature, not a benchmark variant.
- `--include-evidence` flag on `orc audit export` → optional self-contained bundles (workspace DB + evidence files included) for offline regulator handoff.
- `mode="arithmetic"` for numeric claims — multi-turn LLM loop with a safe AST-walking calculator. FinanceBench F1 climbed 0.736 → 0.916.
- Citation guard: an evidence-mode verdict can no longer ship as `supported` with zero valid citations (downgraded to `not_found` and the dropped IDs land in the trace).
- Self-hosting any open-weight 70B judge: the runtime is model-agnostic — pass `model="llama-3.3-70b-instruct"` (or even Lynx itself) at any compatible endpoint and every artifact above is unchanged.
- **PDF ingestion** — `orc ingest report.pdf` (and PDF URLs) extracts text via pypdf, with metadata titles, owner-locked-PDF handling, and loud rejection of scanned/image-only files (OCR not yet supported).
- **Product domain routing** — `domain=` now takes real domains (`general`, `legal`, `clinical`, `biomedical`, `financial`, `numeric`); the HaluBench source names stay accepted as benchmark-only aliases so published numbers remain reproducible.
- **Hardening from a full code review** — SSRF guard now pins the validated IP against DNS rebinding, decomposed mode can vote against a claim, the citation guard covers judgment mode, chunking is UTF-8-exact for CJK/emoji corpora.
- **PyPI packaging as `orc-ai`** (the name `orc` was taken; CLI command and import name remain `orc`), plus CI and tag-triggered release workflows.

Shipped earlier in v0.1.4: `--include-evidence` self-contained audit bundles, `mode="arithmetic"` with a safe AST-walking calculator (FinanceBench F1 0.736 → 0.916), the evidence-mode citation guard, and model-agnostic self-hosting of any open-weight judge.

Live walkthrough: **[pagenta.app/p/thorm/orc-how-it-works](https://pagenta.app/p/thorm/orc-how-it-works)** — six-scene visual explainer. Full pitch: **[pagenta.app/p/thorm/orc-pitch](https://pagenta.app/p/thorm/orc-pitch)**.

Expand All @@ -153,7 +154,7 @@ Live LLM tests are gated behind `ORC_TEST_ALLOW_LIVE_LLM=1` and require a real A
## Roadmap

- Embedding-based retrieval (hybrid BM25 + vector via `sqlite-vec`)
- PDF ingestion
- OCR for scanned/image-only PDFs
- Long-running directives (scheduled triggers, cloud execution)
- `marketing` directive (assisted-only at first, autonomous behind approval gates later)
- `legal` / `gads` / `code-review` directives — same runtime, new skill packages
Expand Down
4 changes: 3 additions & 1 deletion benchmarks/faithfulness/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,9 @@ def _run_lynx_style_one(item: dict[str, Any], orc_home: Path) -> ItemResult:
# subsample. Prose-heavy sources where corpus citations help → evidence mode.
# Single-passage numeric/extraction tasks → binary mode. Mixed natural-language
# Q+A → judgment mode.
from orc.directives.research.routing import DOMAIN_TO_MODE as SOURCE_TO_MODE # noqa: E402
from orc.directives.research.routing import ( # noqa: E402
BENCHMARK_SOURCE_TO_MODE as SOURCE_TO_MODE,
)


def _run_with_mode(item: dict[str, Any], orc_home: Path, mode: str) -> ItemResult:
Expand Down
7 changes: 5 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
requires = ["hatchling"]
build-backend = "hatchling.build"

# Distribution name is "orc-ai" — "orc" is taken on PyPI by an unrelated
# project. The import package stays `orc` and the CLI command stays `orc`.
[project]
name = "orc"
version = "0.1.4"
name = "orc-ai"
version = "0.2.0"
description = "The verification runtime for AI that has to be defensible. Evidence-bound claim verification, structured citations, trace + replay, MCP-ready CLI."
readme = "README.md"
requires-python = ">=3.11"
Expand Down Expand Up @@ -45,6 +47,7 @@ dependencies = [
"rich>=13.0",
"python-ulid>=2.0",
"python-dotenv>=1.0",
"pypdf>=4.0",
]

[project.optional-dependencies]
Expand Down
2 changes: 1 addition & 1 deletion src/orc/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.1.4"
__version__ = "0.2.0"
2 changes: 1 addition & 1 deletion src/orc/cli_commands/verify.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
@click.option(
"--domain",
default=None,
help="Route mode by domain hint (e.g. 'pubmedQA', 'DROP', 'FinanceBench')",
help="Route mode by domain hint (e.g. 'financial', 'clinical', 'legal')",
)
@click.option("--yes", "-y", is_flag=True, help="Skip the confirmation prompt for batch verify")
@click.option("--json", "as_json", is_flag=True, help="Emit raw JSON instead of formatted output")
Expand Down
64 changes: 47 additions & 17 deletions src/orc/directives/research/routing.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
"""Domain → verify-mode routing.

Callers can pass `domain="pubmedQA"` (or any other registered domain) to
`verify_claim` and the runtime picks the best mode empirically — derived from
the per-source-ds F1 breakdown in the HaluBench benchmark. The benchmark's
`SOURCE_TO_MODE` is now a thin import from this dict so the runtime and the
benchmark routing can never drift.
Callers pass a product domain (`domain="clinical"`, `domain="financial"`, ...)
to `verify_claim` and the runtime picks the verify mode that performed best on
the benchmark family that domain generalizes — derived from the per-source-ds
F1 breakdown in the HaluBench benchmark. The HaluBench `source_ds` names stay
accepted as benchmark aliases (`BENCHMARK_SOURCE_TO_MODE`) so the published
benchmark numbers remain reproducible, but the product surface is the domain
map: dataset names are benchmark artifacts, not domains a customer has.

In production this lives behind a workspace tag, a manifest hint, or an
explicit `--domain` flag on the verify call. Unknown domains raise rather than
Expand All @@ -18,12 +20,36 @@


class UnknownDomainError(OrcError):
"""Raised when a caller passes a domain not present in DOMAIN_TO_MODE."""
"""Raised when a caller passes a domain that is neither a product domain
(DOMAIN_TO_MODE) nor a benchmark source alias (BENCHMARK_SOURCE_TO_MODE)."""


# Empirically derived from per-source-ds F1 on the HaluBench 504-item stratified
# subsample. See docs/benchmarks/results-2026-05-19-source-routed.md.
# Product domains. Each mode is derived from the benchmark family the domain
# generalizes — per-source-ds F1 on the HaluBench 504-item stratified
# subsample (docs/benchmarks/results-2026-05-19-source-routed.md).
DOMAIN_TO_MODE: dict[str, str] = {
# RAGTruth / covidQA family: prose-heavy retrieval QA where chunk-level
# citations carry the verdict.
"general": "evidence",
# No benchmark evidence for legal yet. Evidence mode is the deliberate
# default because chunk-level citations matter most in legal review.
"legal": "evidence",
# pubmedQA family: yes/no verdicts over a single passage.
"clinical": "binary",
# Alias of clinical — same pubmedQA family.
"biomedical": "binary",
# FinanceBench family: claims that hinge on derived numbers.
"financial": "arithmetic",
# DROP family: reading comprehension over numeric/tabular passages where
# the answer is a single extracted or computed value.
"numeric": "binary",
}

# HaluBench source_ds names, pinned exactly as published. The benchmark's
# SOURCE_TO_MODE imports this dict, so reproducibility of the published F1
# numbers cannot drift as product domains evolve. Do not edit without a
# benchmark re-run (docs/benchmarks/results-2026-05-19-source-routed.md).
BENCHMARK_SOURCE_TO_MODE: dict[str, str] = {
"covidQA": "evidence",
"RAGTruth": "evidence",
"halueval": "judgment",
Expand All @@ -36,16 +62,20 @@ class UnknownDomainError(OrcError):
def route_to_mode(domain: str | None) -> str | None:
"""Return the routed mode for `domain`, or None if `domain` is None.

Raises UnknownDomainError when `domain` is a string not in DOMAIN_TO_MODE.
Callers must validate at their surface; we don't silently fall through to
a default — that would mask config typos and make replay non-deterministic.
Product domains resolve first; HaluBench source_ds names are accepted as
benchmark aliases so existing callers and published numbers keep working.
Raises UnknownDomainError otherwise — we don't silently fall through to a
default; that would mask config typos and make replay non-deterministic.
"""
if domain is None:
return None
try:
if domain in DOMAIN_TO_MODE:
return DOMAIN_TO_MODE[domain]
except KeyError as exc:
known = sorted(DOMAIN_TO_MODE.keys())
raise UnknownDomainError(
f"unknown domain {domain!r}; known: {known}"
) from exc
if domain in BENCHMARK_SOURCE_TO_MODE:
return BENCHMARK_SOURCE_TO_MODE[domain]
domains = sorted(DOMAIN_TO_MODE)
aliases = sorted(BENCHMARK_SOURCE_TO_MODE)
raise UnknownDomainError(
f"unknown domain {domain!r}; domains: {domains} "
f"(benchmark source aliases also accepted: {aliases})"
)
Loading
Loading