Skip to content

feat: PDF ingestion, product domain routing, PyPI release prep (orc-ai 0.2.0)#6

Closed
Thormatt wants to merge 4 commits into
fix/review-hardeningfrom
feat/review-followups
Closed

feat: PDF ingestion, product domain routing, PyPI release prep (orc-ai 0.2.0)#6
Thormatt wants to merge 4 commits into
fix/review-hardeningfrom
feat/review-followups

Conversation

@Thormatt

Copy link
Copy Markdown
Owner

Context

Follow-ups from the repo review, as requested. Stacked on #5 (review-hardening) — merge that first; GitHub will retarget this PR to main when the base branch is deleted.

Changes

  • PDF ingestion (feat(ingest)): orc ingest report.pdf and PDF URLs work — pypdf extraction, metadata titles, owner-locked PDFs opened via decrypt(""), loud rejection of encrypted/unparseable/scanned files. All SSRF guards unchanged in front. 8 new tests, fixture PDFs hand-rolled in-test (no binaries, no network).
  • Product domain routing (feat(routing)): --domain financial|clinical|legal|biomedical|numeric|general replaces HaluBench dataset names as the product surface. Dataset names remain benchmark-only aliases (BENCHMARK_SOURCE_TO_MODE) so published F1 numbers stay reproducible; the benchmark imports that map directly.
  • Release prep (chore(release)): distribution renamed to orc-ai (PyPI name orc is taken; CLI/import stay orc), version 0.2.0 everywhere (pyproject, __version__, README, CHANGELOG). New CI workflow (pytest+ruff, Python 3.11–3.13) and tag-triggered release workflow via PyPI Trusted Publishing, with a tag↔version guard and tests-before-publish.

Testing

  • uv run pytest -q: 294 passed, 2 skipped, ~3s; ruff check: clean
  • uv build produces orc_ai-0.2.0 sdist+wheel; uv run orc --version → 0.2.0
  • Every behavior change TDD'd (RED observed before GREEN); combined diff cross-reviewed by an independent agent, all high/medium findings fixed

Before tagging v0.2.0 (one-time, manual)

  1. Create the orc-ai project on PyPI and add a Trusted Publisher (owner Thormatt, repo orc, workflow release.yml, environment pypi) — steps are in the release.yml header.
  2. Create a GitHub environment named pypi.
  3. Then git tag v0.2.0 && git push --tags publishes.

Known deferrals (flagged in review, intentionally out of scope): magic-bytes PDF sniffing for application/octet-stream URLs, partial-scan heuristics (one text page in an otherwise scanned PDF ingests), OCR.

🤖 Generated with Claude Code

Thormatt and others added 4 commits June 11, 2026 22:36
The #1 product gap from the review: the target users' corpora (credit
memos, clinical summaries, contracts) are PDFs, and ingest rejected
them outright.

orc ingest now accepts .pdf files and application/pdf URLs: per-page
text extraction with pypdf, pages joined with blank lines, PDF
metadata /Title preferred over the heading-scan fallback. Owner-
password-locked PDFs with an empty user password (common for
distributed contracts) open via decrypt(""); truly password-protected,
unparseable, and scanned/image-only PDFs are refused loudly — silently
ingesting an empty corpus would produce confident not_found verdicts
downstream. pypdf internals never leak; callers see ValueError.

All existing URL guards (SSRF pinning, redirect re-validation, size
limit) sit in front of the PDF branch unchanged. The request
User-Agent now derives from orc.__version__ instead of a hardcoded
stale string.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
domain= previously only accepted HaluBench dataset names (covidQA,
halueval, DROP, ...) — benchmark artifacts promoted to the product
API. No legal team has a claim from domain "halueval".

DOMAIN_TO_MODE is now the product map: general/legal -> evidence,
clinical/biomedical -> binary, financial -> arithmetic, numeric ->
binary, each annotated with the benchmark family it generalizes
(legal honestly marked as having no benchmark evidence yet). The six
HaluBench source names moved unchanged to BENCHMARK_SOURCE_TO_MODE
and still route as aliases, so published F1 numbers stay reproducible
— the benchmark now imports that map directly and cannot drift when
product domains evolve. CLI/MCP help and the verify_claim docstring
teach the product domains instead of dataset names.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The PyPI name "orc" is taken by an unrelated project; the
distribution becomes orc-ai while the import package and CLI command
stay orc. Version 0.2.0 covers the review-hardening fixes plus PDF
ingestion and product domain routing (see CHANGELOG).

Adds minimal CI (pytest + ruff on a 3.11-3.13 matrix) and a
tag-triggered release workflow using PyPI Trusted Publishing — no
token stored; the one-time publisher setup is documented in the
workflow header. The release job refuses to publish when the tag
doesn't match the pyproject version and runs the suite before
building. README project-status section now attributes features to
the right versions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found live: a claim-dense draft hit extract_claims' 1024-token cap,
the forced tool call came back cut off (stop_reason=max_tokens), and
the partial input parsed as zero claims — so verify --file reported
"No claims extracted" and the caller's gate passed vacuously.

Truncation now escalates the budget (1024 -> 4096 -> 16384), each
attempt is recorded in the trace with its stop_reason and a truncated
flag, and extraction that still truncates at the ceiling raises
rather than returning a partial claim list that would let unextracted
claims bypass verification silently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant