T040: LLMProvider Protocol, request/response types, and error hierarchy#39
T040: LLMProvider Protocol, request/response types, and error hierarchy#39likith1908 wants to merge 15 commits into
Conversation
Core adapter (iris-ocr-paddleocr): - PaddleOCREngine backed by PaddleOCRVL pipeline (v1.6), satisfying contract clauses C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 - PDF rasterisation via PyMuPDF at 150 DPI; PNG/JPEG/TIFF accepted directly - Offline mode: IRIS_PADDLEOCR_OFFLINE=1 + IRIS_PADDLEOCR_MODEL_PATH point the VL recognition sub-model at a local HuggingFace snapshot directory; startup raises OCRUnavailable immediately if offline flag is set without path - Result mapping from PaddleOCRVLResult (dict-like, parsing_res_list) to OCRPageResult; VLM returns no per-block confidence so confidence=1.0 - BBox format is axis-aligned [x1,y1,x2,y2] rect, not polygon - converted via _rect_to_bbox with negative-coord clamp and min-size=1 enforcement - 29 unit tests, 96% coverage; live test gated on IRIS_OCR_LIVE_PADDLEOCR=1 Fixes carried forward from code review (Datalab / ADI): - DatalabOCREngine._poll() was an unbounded loop; added max_poll_seconds=300 and deadline check mirroring the ADI pattern (test added) - ocr-adi/tests/__init__.py and ocr-datalab/tests/__init__.py deleted; with importlib mode these caused all three adapter test_unit modules to resolve to the same module name, running ADI tests for every adapter - All three adapter live test files now carry pytest.mark.slow alongside their skipif guard so intent is explicit when running the full suite Workspace changes: - Root pyproject.toml: iris_ocr_paddleocr added to coverage sources, mypy overrides, importlinter; ruff exclude added for PaddleOCR-VL-1.6-snapshot - .gitignore: PaddleOCR-VL-1.6-snapshot/ added (1.8 GB model, local-only) - .env.example: PaddleOCR offline mode vars and model download instructions - huggingface_hub>=0.23 added to ocr-paddleocr dependencies
extract() was holding the event loop during PyMuPDF rasterisation and pipeline.predict() (up to ~15s per doc). Wrapped both _to_images and the per-page inference loop in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T035 to [x] in tasks.md and added the offline/airgap caveat about PP-DocLayoutV3 and doc-preprocessor sub-models requiring ~/.paddlex pre-seeding for a true airgap.
paddlepaddle was already resolved as a transitive dep via paddleocr/paddlex but was not declared explicitly. A partial-sync of the venv (e.g. during rebase operations) could leave it missing until a full uv sync --all-packages ran. Declaring it explicitly ensures it is always present after any uv sync.
Core adapter (iris-ocr-local): - TesseractEngine backed by pytesseract wrapping the system tesseract binary; satisfies C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 (inherent - no network calls at any point) - PDF rasterisation via PyMuPDF at 150 DPI, same pattern as iris-ocr-paddleocr - image_to_data() provides per-word bboxes in pixels and confidence [0-100]; adapter normalises confidence to [0.0-1.0], averages per page - Words grouped by block_num; blocks joined with double newline in markdown - Rows with conf == -1 (layout rows) and empty text stripped before mapping - IRIS_TESSERACT_CMD env var overrides the binary path (Windows / custom paths) - 28 unit tests, all contract clauses mocked via _pytesseract injection seam - Live test gated on IRIS_OCR_LIVE_LOCAL=1; result: 182 ms, confidence 0.96, 3 bboxes, markdown "IRIS live test" Adapter READMEs (T039 partial - all four OCR adapters): - ocr-adi/README.md: Azure setup, env vars, live test command, limitations - ocr-datalab/README.md: API key setup, polling flow, bbox/confidence notes - ocr-paddleocr/README.md: model download, GPU install, offline mode, limits - ocr-local/README.md: platform install commands, IRIS_TESSERACT_CMD, live test Makefile: - install: cross-platform tesseract check via shutil.which after uv sync; prints per-platform install instructions if binary is absent - distclean (Linux + Windows): note that system packages are not auto-removed Workspace: - iris_ocr_local added to coverage sources in root pyproject.toml - tasks.md: T035 marked complete
README files for adi, datalab, local, and paddleocr are kept untracked locally. They will be added in the T039 PR which covers all adapter documentation together.
extract() was blocking the event loop during PyMuPDF rasterisation and pytesseract.image_to_data() subprocess calls. Wrapped both in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T036 to [x] in tasks.md.
Correctness: re-apply tesseract_cmd per inference call to prevent global state clobber between concurrent engines; wrap per-page get_pixmap and Image.frombytes in try/except OCRMalformedDocument; replace TIFF while-True frame loop with ImageSequence.Iterator; sort blocks by block_num for correct reading order on multi-column documents; move _map_result inside _run_page try/except so KeyError raises OCRUnavailable not a raw exception. Infrastructure: add missing --cov flags for all four adapter packages to make test-cov. Tests: strengthen C-OCR-002 to assert full five-word phrase and total_pages; add per-page content assertions to test_c003; add test_binary_not_found_raises_unavailable. Documentation: note C-OCR-011 deferral to T038 in module docstring.
pyproject.toml [tool.coverage.run] source already lists all packages; duplicate flags in the Makefile were a second source of truth that would silently drift when a new adapter is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tests/contract/test_ocr_contract.py: 66 tests + 1 skipped placeholder. C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 verified against all five adapters (in-memory, adi, datalab, paddleocr, local) with mocked externals. C-OCR-007 excludes in-memory: documented gap from PR#32 - fake engine returns pre-registered results and cannot detect malformed PDFs without a parser. C-OCR-011 skipped pending T038 (OTEL spans).
- C-OCR-LOCAL-001: add monkeypatch socket.socket.connect guard so the test fails loudly if a local adapter ever opens a network connection, not just when extraction returns pages. - Lines 123 and 258: remove doubled # pragma: allowlist secret comment.
Adds iris_engine.ocr.tracing with instrument_extract() async context manager and log_extract_success(). Every adapter's extract() is wrapped: span name ocr.extract, attributes ocr.adapter_id, ocr.tenant_id, ocr.product_slug, ocr.document_id, ocr.content_type, ocr.total_pages, ocr.total_latency_ms, ocr.success. OCRError subtypes recorded with ocr.error_category; unexpected exceptions re-raised with StatusCode.ERROR. InMemoryOCREngine also wrapped so C-OCR-011 covers all 5 adapters. conftest.py wires an InMemorySpanExporter at module level; span_exporter fixture clears it per test. C-OCR-011 parametrised across all 5 adapters. opentelemetry-api>=1.20 added to iris-engine deps; opentelemetry-sdk>=1.20 added to root dev deps. mypy ignore_missing_imports added for opentelemetry (no upstream type stubs).
Four new README files - one per OCR adapter - documenting prerequisites, environment variables, running tests, how each adapter works, and known limitations. PaddleOCR README includes a full airgapped deployment section covering PADDLEX_HOME and PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK for environments without internet access.
… (T040) Introduces iris_engine/contracts/llm_provider.py with the LLMProvider Protocol, LLMRequest, LLMResponse, LLMUsage dataclasses, and the full typed error hierarchy (LLMUnavailable, LLMRateLimited, LLMAuthenticationFailed, LLMSchemaViolation, LLMContextWindowExceeded, LLMContentFiltered, LLMInvalidRequest). No adapter imports; the module is importable in a bare iris-engine install.
anmolg1997
left a comment
There was a problem hiding this comment.
Approved — substance is a clean, faithful port of the contract. One small tracking fix to land before merge (below).
What I verified
iris_engine/contracts/llm_provider.py matches contracts/llm-provider.md exactly:
- All 7 typed errors present and correctly mapped:
LLMUnavailable(5xx/network/timeout),LLMRateLimited(429),LLMAuthenticationFailed(401/403),LLMSchemaViolation,LLMContextWindowExceeded,LLMContentFiltered,LLMInvalidRequest(400), all underLLMError. LLMUsage/LLMResponse/LLMRequestfrozen dataclasses with the documented fields.LLMRequestdefaults are sensible (temperature=0.0for deterministic extraction,schema=None,model_hintfor the extraction/classification/summary/chat routing the gateway will use later).LLMResponse.structured: BaseModel | None— pydantic typed, populated only when a schema was supplied.adapter_iddoc-comment correctly notes "in-memory" for the stub.- Reuses
TenantContextfromocr_enginerather than redefining it, and re-exports it in__all__— good, one shared context type across both contracts. - Module docstring states the layer rule ("No adapter imports... importable without any LLM dependency").
iris_enginestays at the bottom; lint confirms 3 contracts kept.
Gates: lint clean, mypy clean on 20 source files, full suite 373 passed.
One fix before merge
tasks/004-llm-adapter-set/tasks.md still shows - [ ] **T040**. Every prior task PR flipped its checkbox in the same PR; this one didn't. Left as-is, main will show T040 as not-done even though it's complete — and the docs-ci structural check won't catch a stale checkbox.
Please flip it to - [x] **T040**. You're rebasing this branch anyway as the stack lands, so it's a zero-cost addition to the same push. Not requesting changes since the code needs nothing — just don't let it merge without the flip.
(Optional: T040's acceptance text references the OTEL span on complete carrying structured_output_used — that span lands with the LLM equivalent of T038, not here. If you want to mirror the OCR pattern, scope T040's acceptance to "Protocol type-checks under mypy strict; imports with zero adapter deps" the way you did for T030, and let the span clause ride with the later instrumentation task. Your call.)
Solid foundation for the four LLM adapters. Reviewing #40 (selector + stub) next.
9ee71d9 to
e4c4eb4
Compare
Summary
Add
iris_engine/contracts/llm_provider.pydefining theLLMProviderProtocol, frozen dataclasses (LLMRequest,LLMResponse,LLMUsage), and the typed error hierarchy (8 classes) for the LLM adapter workstream. Single file, zero adapter dependencies, importable in a bare iris-engine install.Task reference
T040004-llm-adapter-setWhy
Establishes the single authoritative contract that all four LLM adapters (T044-T047) and the selector (T041) implement against. Must land before any adapter package can be written.
Acceptance criteria
mypy --strictpasses on the module.from iris_engine.contracts.llm_provider import LLMProviderworks in a bare iris-engine install).docs-ciworkflow passes (markdown lint + tasks structural check).PR checklist
tasks.mdentry is satisfied.docs-ciworkflow passes (markdown lint + tasks structural check).Notes for the reviewer
TenantContextis re-exported fromocr_engine.pyrather than redefined. One authoritative definition prevents drift between OCR and LLM callsites.LLMRequest.stopislist[str] | None(nottuple) per the contract doc. The dataclass isfrozen=Truefor immutability semantics;hash()is never called on request instances so the mutable field is acceptable.LLMResponse.structuredisBaseModel | Noneat the Protocol level. Callers narrow it against the schema they passed in. Using a generic TypeVar on the Protocol method would complicate structural subtype checking.contracts/is omitted from coverage (omit = ["*/iris_engine/contracts/*"]) - same treatment asocr_engine.py.Rebase note
This PR is stacked on #38 (T038+T039). Once #38 merges, rebase onto
mainbefore merging this one.