feat: PDF ingestion, product domain routing, PyPI release prep (orc-ai 0.2.0)#6
Closed
Thormatt wants to merge 4 commits into
Closed
feat: PDF ingestion, product domain routing, PyPI release prep (orc-ai 0.2.0)#6Thormatt wants to merge 4 commits into
Thormatt wants to merge 4 commits into
Conversation
The #1 product gap from the review: the target users' corpora (credit memos, clinical summaries, contracts) are PDFs, and ingest rejected them outright. orc ingest now accepts .pdf files and application/pdf URLs: per-page text extraction with pypdf, pages joined with blank lines, PDF metadata /Title preferred over the heading-scan fallback. Owner- password-locked PDFs with an empty user password (common for distributed contracts) open via decrypt(""); truly password-protected, unparseable, and scanned/image-only PDFs are refused loudly — silently ingesting an empty corpus would produce confident not_found verdicts downstream. pypdf internals never leak; callers see ValueError. All existing URL guards (SSRF pinning, redirect re-validation, size limit) sit in front of the PDF branch unchanged. The request User-Agent now derives from orc.__version__ instead of a hardcoded stale string. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
domain= previously only accepted HaluBench dataset names (covidQA, halueval, DROP, ...) — benchmark artifacts promoted to the product API. No legal team has a claim from domain "halueval". DOMAIN_TO_MODE is now the product map: general/legal -> evidence, clinical/biomedical -> binary, financial -> arithmetic, numeric -> binary, each annotated with the benchmark family it generalizes (legal honestly marked as having no benchmark evidence yet). The six HaluBench source names moved unchanged to BENCHMARK_SOURCE_TO_MODE and still route as aliases, so published F1 numbers stay reproducible — the benchmark now imports that map directly and cannot drift when product domains evolve. CLI/MCP help and the verify_claim docstring teach the product domains instead of dataset names. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The PyPI name "orc" is taken by an unrelated project; the distribution becomes orc-ai while the import package and CLI command stay orc. Version 0.2.0 covers the review-hardening fixes plus PDF ingestion and product domain routing (see CHANGELOG). Adds minimal CI (pytest + ruff on a 3.11-3.13 matrix) and a tag-triggered release workflow using PyPI Trusted Publishing — no token stored; the one-time publisher setup is documented in the workflow header. The release job refuses to publish when the tag doesn't match the pyproject version and runs the suite before building. README project-status section now attributes features to the right versions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found live: a claim-dense draft hit extract_claims' 1024-token cap, the forced tool call came back cut off (stop_reason=max_tokens), and the partial input parsed as zero claims — so verify --file reported "No claims extracted" and the caller's gate passed vacuously. Truncation now escalates the budget (1024 -> 4096 -> 16384), each attempt is recorded in the trace with its stop_reason and a truncated flag, and extraction that still truncates at the ceiling raises rather than returning a partial claim list that would let unextracted claims bypass verification silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Follow-ups from the repo review, as requested. Stacked on #5 (review-hardening) — merge that first; GitHub will retarget this PR to
mainwhen the base branch is deleted.Changes
feat(ingest)):orc ingest report.pdfand PDF URLs work — pypdf extraction, metadata titles, owner-locked PDFs opened viadecrypt(""), loud rejection of encrypted/unparseable/scanned files. All SSRF guards unchanged in front. 8 new tests, fixture PDFs hand-rolled in-test (no binaries, no network).feat(routing)):--domain financial|clinical|legal|biomedical|numeric|generalreplaces HaluBench dataset names as the product surface. Dataset names remain benchmark-only aliases (BENCHMARK_SOURCE_TO_MODE) so published F1 numbers stay reproducible; the benchmark imports that map directly.chore(release)): distribution renamed to orc-ai (PyPI nameorcis taken; CLI/import stayorc), version 0.2.0 everywhere (pyproject,__version__, README, CHANGELOG). New CI workflow (pytest+ruff, Python 3.11–3.13) and tag-triggered release workflow via PyPI Trusted Publishing, with a tag↔version guard and tests-before-publish.Testing
uv run pytest -q: 294 passed, 2 skipped, ~3s;ruff check: cleanuv buildproducesorc_ai-0.2.0sdist+wheel;uv run orc --version→ 0.2.0Before tagging v0.2.0 (one-time, manual)
Thormatt, repoorc, workflowrelease.yml, environmentpypi) — steps are in the release.yml header.pypi.git tag v0.2.0 && git push --tagspublishes.Known deferrals (flagged in review, intentionally out of scope): magic-bytes PDF sniffing for
application/octet-streamURLs, partial-scan heuristics (one text page in an otherwise scanned PDF ingests), OCR.🤖 Generated with Claude Code