Thormatt · Thormatt · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,30 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        # Matches requires-python >=3.11 and the advertised classifiers.
+        python-version: ["3.11", "3.12", "3.13"]
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: uv sync --extra dev
+
+      - name: Run tests
+        run: uv run pytest -q
+
+      - name: Lint
+        run: uv run ruff check src tests
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,55 @@
+# Publishes to PyPI via Trusted Publishing (OIDC) — no API token is stored
+# in this repo. One-time setup on PyPI before the first tagged release:
+#
+#   1. Create (or claim) the "orc-ai" project on https://pypi.org.
+#   2. Under the project's Publishing settings, add a Trusted Publisher:
+#        owner:       Thormatt
+#        repository:  orc
+#        workflow:    release.yml
+#        environment: pypi
+#   3. In this GitHub repo, create an environment named "pypi"
+#      (Settings → Environments) — optionally with required reviewers.
+#
+# Then `git tag v0.2.0 && git push --tags` publishes automatically.
+name: Release
+
+on:
+  push:
+    tags: ["v*"]
+
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    environment: pypi
+    permissions:
+      # Required for PyPI Trusted Publishing (OIDC token exchange).
+      id-token: write
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.12"
+
+      - name: Check tag matches pyproject version
+        # Tagging v0.3.0 on a 0.2.0 pyproject would otherwise silently
+        # publish the wrong version.
+        run: |
+          PYPROJECT_VERSION=$(uv run python -c "import tomllib; print(tomllib.load(open('pyproject.toml','rb'))['project']['version'])")
+          TAG_VERSION="${GITHUB_REF_NAME#v}"
+          if [ "$PYPROJECT_VERSION" != "$TAG_VERSION" ]; then
+            echo "Tag $GITHUB_REF_NAME does not match pyproject version $PYPROJECT_VERSION" >&2
+            exit 1
+          fi
+
+      - name: Run tests
+        run: |
+          uv sync --extra dev
+          uv run pytest -q
+
+      - name: Build sdist and wheel
+        run: uv build
+
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,8 +7,39 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
 
 ## [Unreleased]
 
+### Planned
+
+- `gads` directive (Google Ads agentic analysis: lens-based decomposition,
+  read-only MCP integration, evidence-bound recommendation verification).
+- `orc eval consistency|perturb|retrieval|regression` reliability commands.
+- Voyage-AI or local-`sentence-transformers` embeddings + hybrid retrieval (RRF over BM25 + vector).
+- Hosted runtime (scheduled triggers, web dashboard, team workspaces).
+- Decomposition + arithmetic combined for DROP-shaped multi-step claims.
+
+## [0.2.0] — 2026-06-11
+
+First PyPI release. The distribution is named **`orc-ai`** — `orc` is taken on
+PyPI by an unrelated project — but the import package (`import orc`) and the
+CLI command (`orc`) are unchanged.
+
 ### Added
 
+- **PDF ingestion** — `orc ingest report.pdf` now works alongside markdown,
+  text, json, and URLs. Text is extracted page-by-page via `pypdf`, and the
+  PDF metadata title is used when the body carries no markdown-style heading
+  (typical for credit memos and contracts). (`src/orc/ingest/loaders.py`)
+- **Product domain routing** — `--domain` / `domain=` on `verify_claim` takes
+  product domains (`general`, `legal`, `clinical`, `biomedical`, `financial`,
+  `numeric`), each mapped to the verify mode that scored best on the benchmark
+  family the domain generalizes. The HaluBench `source_ds` names stay accepted
+  as benchmark-only aliases (`BENCHMARK_SOURCE_TO_MODE`) so the published F1
+  numbers remain reproducible, but dataset names are no longer the product
+  surface. Unknown domains still raise `UnknownDomainError`.
+  (`src/orc/directives/research/routing.py`)
+- **CI + release workflows** — `.github/workflows/ci.yml` runs `pytest` +
+  `ruff` on pushes to `main` and on pull requests; `.github/workflows/release.yml`
+  builds sdist + wheel with uv on `v*` tags and publishes to PyPI via Trusted
+  Publishing (OIDC, no long-lived token in the repo).
 - **Isolated write paths (Phase 1)** — the effect plane that makes the Approval
   invariant enforceable rather than aspirational (see
   `docs/design/0001-isolated-write-paths.md`):
@@ -36,6 +67,28 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
 
 ### Fixed (hardening)
 
+- **SSRF guard hardened against DNS rebinding** — `load_url` now connects to
+  the exact IP it vetted (re-pinned on every redirect hop) instead of letting
+  the HTTP client re-resolve the hostname at request time, closing the
+  validate-then-connect TOCTOU window a low-TTL DNS record could exploit. A
+  `transport` injection seam keeps the loader testable without real sockets.
+  (`src/orc/ingest/loaders.py`)
+- **Decomposed-mode negative voting** — atoms run in binary mode, which can
+  only say faithful or unfaithful; the negative vote now keys off `not_found`
+  and a negative net aggregates back to `not_found` instead of `contradicted`
+  — a distinction the atoms never actually made.
+  (`src/orc/directives/research/skills/verify_claim.py`)
+- **Citation guard covers judgment mode** — judgment-mode verdicts pass
+  through the same hallucinated-chunk-ID filter and no-valid-grounding
+  downgrade as evidence mode, instead of shipping unguarded citations.
+- **UTF-8-exact chunking** — chunk windows are computed at the byte level and
+  snapped forward to UTF-8 character starts, so a cl100k token boundary that
+  falls inside a multi-byte character (routine for CJK and emoji) can no
+  longer corrupt chunk text. (`src/orc/ingest/chunker.py`)
+- **Offline guard covers the full credential surface** — the autouse test
+  fixture strips `ANTHROPIC_API_KEY`, `OPENROUTER_API_KEY`, *and*
+  `ORC_PROVIDER`, so a developer's shell environment can't leak live LLM
+  calls into the default suite. (`tests/conftest.py`)
 - **Replay determinism** — LLM sampling is now pinned to `temperature=0` at the
   `messages_create` chokepoint, so `orc replay` re-issues the recorded decision
   rather than a fresh sample. (`src/orc/llm/client.py`)
@@ -60,16 +113,6 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).
 - README invariants reworded to match what the code enforces (approval-queue
   isolation flagged as roadmap, not yet implemented).
 
-### Planned
-
-- `gads` directive (Google Ads agentic analysis: lens-based decomposition,
-  read-only MCP integration, evidence-bound recommendation verification).
-- `orc eval consistency|perturb|retrieval|regression` reliability commands.
-- Voyage-AI or local-`sentence-transformers` embeddings + hybrid retrieval (RRF over BM25 + vector).
-- PDF ingestion.
-- Hosted runtime (scheduled triggers, web dashboard, team workspaces).
-- Decomposition + arithmetic combined for DROP-shaped multi-step claims.
-
 ## [0.1.4] — 2026-05-19
 
 ### Added

diff --git a/README.md b/README.md
@@ -27,8 +27,8 @@ Built for **research analysts, editorial teams, legal & compliance, agentic-work
 # Install
 uv pip install git+https://github.com/Thormatt/orc
 
-# Or, once published to PyPI:
-# uv pip install orc
+# Or, once published to PyPI (the CLI command and import name stay `orc`):
+# uv pip install orc-ai
 
 # Set up credentials (either of these works; OpenRouter takes priority if both set)
 export ANTHROPIC_API_KEY=sk-ant-...
@@ -63,7 +63,7 @@ claude mcp add orc -- uv run --directory $(pwd) orc mcp serve
 ```
 orc workspace create <name>            create a new workspace
 orc workspace list                     list workspaces
-orc ingest <path-or-url> [-w <name>]   add evidence (md, txt, urls)
+orc ingest <path-or-url> [-w <name>]   add evidence (md, txt, json, pdf, urls)
 orc search "<query>" [-w <name>]       BM25 retrieval, no LLM
 orc verify "<claim>" [-w <name>]       verify a single claim
 orc verify --file <path>               extract + verify every claim in a draft
@@ -111,7 +111,7 @@ A `.env` file in the repo root or at `$ORC_HOME/.env` is auto-loaded. Shell-expo
 
 ## Project status
 
-`v0.1.4` — current. Faithfulness benchmark headline (HaluBench, stratified 504-item subsample, source-aware routing):
+`v0.2.0` — current. Faithfulness benchmark headline (HaluBench, stratified 504-item subsample, source-aware routing; measured on v0.1.4, runtime unchanged since):
 
 | Metric | Score |
 |---|---:|
@@ -122,13 +122,14 @@ A `.env` file in the repo root or at `$ORC_HOME/.env` is auto-loaded. Shell-expo
 
 > **0.864 is competitive with Patronus AI's Lynx-70B published home-court F1 of 0.85** — not a same-set head-to-head: orc's number comes from a stratified 504-item HaluBench subsample, with source-aware routing tuned on that same subsample, while Lynx reported on the full benchmark. It is achieved with a general-purpose Claude Sonnet 4.6 call (no fine-tuning) plus a safe arithmetic evaluator the model can invoke for numeric claims. Orc additionally produces chunk-level citations, deterministic replay against a frozen corpus snapshot, audit-export bundles that can be self-contained (`--include-evidence`), and a multi-approver gate for high-risk verdicts — artifacts the competitive set of post-hoc faithfulness judges does not produce.
 
-What shipped in this version:
+What shipped in v0.2.0:
 
-- `domain=` parameter on `verify_claim` + `--domain` CLI flag → source-aware routing is a real product feature, not a benchmark variant.
-- `--include-evidence` flag on `orc audit export` → optional self-contained bundles (workspace DB + evidence files included) for offline regulator handoff.
-- `mode="arithmetic"` for numeric claims — multi-turn LLM loop with a safe AST-walking calculator. FinanceBench F1 climbed 0.736 → 0.916.
-- Citation guard: an evidence-mode verdict can no longer ship as `supported` with zero valid citations (downgraded to `not_found` and the dropped IDs land in the trace).
-- Self-hosting any open-weight 70B judge: the runtime is model-agnostic — pass `model="llama-3.3-70b-instruct"` (or even Lynx itself) at any compatible endpoint and every artifact above is unchanged.
+- **PDF ingestion** — `orc ingest report.pdf` (and PDF URLs) extracts text via pypdf, with metadata titles, owner-locked-PDF handling, and loud rejection of scanned/image-only files (OCR not yet supported).
+- **Product domain routing** — `domain=` now takes real domains (`general`, `legal`, `clinical`, `biomedical`, `financial`, `numeric`); the HaluBench source names stay accepted as benchmark-only aliases so published numbers remain reproducible.
+- **Hardening from a full code review** — SSRF guard now pins the validated IP against DNS rebinding, decomposed mode can vote against a claim, the citation guard covers judgment mode, chunking is UTF-8-exact for CJK/emoji corpora.
+- **PyPI packaging as `orc-ai`** (the name `orc` was taken; CLI command and import name remain `orc`), plus CI and tag-triggered release workflows.
+
+Shipped earlier in v0.1.4: `--include-evidence` self-contained audit bundles, `mode="arithmetic"` with a safe AST-walking calculator (FinanceBench F1 0.736 → 0.916), the evidence-mode citation guard, and model-agnostic self-hosting of any open-weight judge.
 
 Live walkthrough: **[pagenta.app/p/thorm/orc-how-it-works](https://pagenta.app/p/thorm/orc-how-it-works)** — six-scene visual explainer. Full pitch: **[pagenta.app/p/thorm/orc-pitch](https://pagenta.app/p/thorm/orc-pitch)**.
 
@@ -153,7 +154,7 @@ Live LLM tests are gated behind `ORC_TEST_ALLOW_LIVE_LLM=1` and require a real A
 ## Roadmap
 
 - Embedding-based retrieval (hybrid BM25 + vector via `sqlite-vec`)
-- PDF ingestion
+- OCR for scanned/image-only PDFs
 - Long-running directives (scheduled triggers, cloud execution)
 - `marketing` directive (assisted-only at first, autonomous behind approval gates later)
 - `legal` / `gads` / `code-review` directives — same runtime, new skill packages

diff --git a/benchmarks/faithfulness/run.py b/benchmarks/faithfulness/run.py
@@ -184,7 +184,9 @@ def _run_lynx_style_one(item: dict[str, Any], orc_home: Path) -> ItemResult:
 # subsample. Prose-heavy sources where corpus citations help → evidence mode.
 # Single-passage numeric/extraction tasks → binary mode. Mixed natural-language
 # Q+A → judgment mode.
-from orc.directives.research.routing import DOMAIN_TO_MODE as SOURCE_TO_MODE  # noqa: E402
+from orc.directives.research.routing import (  # noqa: E402
+    BENCHMARK_SOURCE_TO_MODE as SOURCE_TO_MODE,
+)
 
 
 def _run_with_mode(item: dict[str, Any], orc_home: Path, mode: str) -> ItemResult:

diff --git a/pyproject.toml b/pyproject.toml
@@ -2,9 +2,11 @@
 requires = ["hatchling"]
 build-backend = "hatchling.build"
 
+# Distribution name is "orc-ai" — "orc" is taken on PyPI by an unrelated
+# project. The import package stays `orc` and the CLI command stays `orc`.
 [project]
-name = "orc"
-version = "0.1.4"
+name = "orc-ai"
+version = "0.2.0"
 description = "The verification runtime for AI that has to be defensible. Evidence-bound claim verification, structured citations, trace + replay, MCP-ready CLI."
 readme = "README.md"
 requires-python = ">=3.11"
@@ -45,6 +47,7 @@ dependencies = [
     "rich>=13.0",
     "python-ulid>=2.0",
     "python-dotenv>=1.0",
+    "pypdf>=4.0",
 ]
 
 [project.optional-dependencies]

diff --git a/src/orc/__init__.py b/src/orc/__init__.py
@@ -1 +1 @@
-__version__ = "0.1.4"
+__version__ = "0.2.0"
diff --git a/src/orc/cli_commands/verify.py b/src/orc/cli_commands/verify.py
@@ -43,7 +43,7 @@
 @click.option(
     "--domain",
     default=None,
-    help="Route mode by domain hint (e.g. 'pubmedQA', 'DROP', 'FinanceBench')",
+    help="Route mode by domain hint (e.g. 'financial', 'clinical', 'legal')",
 )
 @click.option("--yes", "-y", is_flag=True, help="Skip the confirmation prompt for batch verify")
 @click.option("--json", "as_json", is_flag=True, help="Emit raw JSON instead of formatted output")

diff --git a/src/orc/directives/research/routing.py b/src/orc/directives/research/routing.py
@@ -1,10 +1,12 @@
 """Domain → verify-mode routing.
 
-Callers can pass `domain="pubmedQA"` (or any other registered domain) to
-`verify_claim` and the runtime picks the best mode empirically — derived from
-the per-source-ds F1 breakdown in the HaluBench benchmark. The benchmark's
-`SOURCE_TO_MODE` is now a thin import from this dict so the runtime and the
-benchmark routing can never drift.
+Callers pass a product domain (`domain="clinical"`, `domain="financial"`, ...)
+to `verify_claim` and the runtime picks the verify mode that performed best on
+the benchmark family that domain generalizes — derived from the per-source-ds
+F1 breakdown in the HaluBench benchmark. The HaluBench `source_ds` names stay
+accepted as benchmark aliases (`BENCHMARK_SOURCE_TO_MODE`) so the published
+benchmark numbers remain reproducible, but the product surface is the domain
+map: dataset names are benchmark artifacts, not domains a customer has.
 
 In production this lives behind a workspace tag, a manifest hint, or an
 explicit `--domain` flag on the verify call. Unknown domains raise rather than
@@ -18,12 +20,36 @@
 
 
 class UnknownDomainError(OrcError):
-    """Raised when a caller passes a domain not present in DOMAIN_TO_MODE."""
+    """Raised when a caller passes a domain that is neither a product domain
+    (DOMAIN_TO_MODE) nor a benchmark source alias (BENCHMARK_SOURCE_TO_MODE)."""
 
 
-# Empirically derived from per-source-ds F1 on the HaluBench 504-item stratified
-# subsample. See docs/benchmarks/results-2026-05-19-source-routed.md.
+# Product domains. Each mode is derived from the benchmark family the domain
+# generalizes — per-source-ds F1 on the HaluBench 504-item stratified
+# subsample (docs/benchmarks/results-2026-05-19-source-routed.md).
 DOMAIN_TO_MODE: dict[str, str] = {
+    # RAGTruth / covidQA family: prose-heavy retrieval QA where chunk-level
+    # citations carry the verdict.
+    "general": "evidence",
+    # No benchmark evidence for legal yet. Evidence mode is the deliberate
+    # default because chunk-level citations matter most in legal review.
+    "legal": "evidence",
+    # pubmedQA family: yes/no verdicts over a single passage.
+    "clinical": "binary",
+    # Alias of clinical — same pubmedQA family.
+    "biomedical": "binary",
+    # FinanceBench family: claims that hinge on derived numbers.
+    "financial": "arithmetic",
+    # DROP family: reading comprehension over numeric/tabular passages where
+    # the answer is a single extracted or computed value.
+    "numeric": "binary",
+}
+
+# HaluBench source_ds names, pinned exactly as published. The benchmark's
+# SOURCE_TO_MODE imports this dict, so reproducibility of the published F1
+# numbers cannot drift as product domains evolve. Do not edit without a
+# benchmark re-run (docs/benchmarks/results-2026-05-19-source-routed.md).
+BENCHMARK_SOURCE_TO_MODE: dict[str, str] = {
     "covidQA": "evidence",
     "RAGTruth": "evidence",
     "halueval": "judgment",
@@ -36,16 +62,20 @@ class UnknownDomainError(OrcError):
 def route_to_mode(domain: str | None) -> str | None:
     """Return the routed mode for `domain`, or None if `domain` is None.
 
-    Raises UnknownDomainError when `domain` is a string not in DOMAIN_TO_MODE.
-    Callers must validate at their surface; we don't silently fall through to
-    a default — that would mask config typos and make replay non-deterministic.
+    Product domains resolve first; HaluBench source_ds names are accepted as
+    benchmark aliases so existing callers and published numbers keep working.
+    Raises UnknownDomainError otherwise — we don't silently fall through to a
+    default; that would mask config typos and make replay non-deterministic.
     """
     if domain is None:
         return None
-    try:
+    if domain in DOMAIN_TO_MODE:
         return DOMAIN_TO_MODE[domain]
-    except KeyError as exc:
-        known = sorted(DOMAIN_TO_MODE.keys())
-        raise UnknownDomainError(
-            f"unknown domain {domain!r}; known: {known}"
-        ) from exc
+    if domain in BENCHMARK_SOURCE_TO_MODE:
+        return BENCHMARK_SOURCE_TO_MODE[domain]
+    domains = sorted(DOMAIN_TO_MODE)
+    aliases = sorted(BENCHMARK_SOURCE_TO_MODE)
+    raise UnknownDomainError(
+        f"unknown domain {domain!r}; domains: {domains} "
+        f"(benchmark source aliases also accepted: {aliases})"
+    )