T040: LLMProvider Protocol, request/response types, and error hierarchy by likith1908 · Pull Request #39 · auropro-hyd/IRIS

likith1908 · 2026-06-17T06:56:49Z

Summary

Add iris_engine/contracts/llm_provider.py defining the LLMProvider Protocol, frozen dataclasses (LLMRequest, LLMResponse, LLMUsage), and the typed error hierarchy (8 classes) for the LLM adapter workstream. Single file, zero adapter dependencies, importable in a bare iris-engine install.

Task reference

Task ID: T040
Workstream: 004-llm-adapter-set

Why

Establishes the single authoritative contract that all four LLM adapters (T044-T047) and the selector (T041) implement against. Must land before any adapter package can be written.

Acceptance criteria

mypy --strict passes on the module.
Module imports with zero adapter dependencies (from iris_engine.contracts.llm_provider import LLMProvider works in a bare iris-engine install).
docs-ci workflow passes (markdown lint + tasks structural check).
No em-dashes introduced in new prose.
All internal links resolve.
No secrets, credentials, or personal names introduced.

PR checklist

Every acceptance criterion in the task's tasks.md entry is satisfied.
docs-ci workflow passes (markdown lint + tasks structural check).
No em-dashes introduced in new prose.
All internal links resolve.
No secrets, credentials, or personal names introduced.

Notes for the reviewer

TenantContext is re-exported from ocr_engine.py rather than redefined. One authoritative definition prevents drift between OCR and LLM callsites.
LLMRequest.stop is list[str] | None (not tuple) per the contract doc. The dataclass is frozen=True for immutability semantics; hash() is never called on request instances so the mutable field is acceptable.
LLMResponse.structured is BaseModel | None at the Protocol level. Callers narrow it against the schema they passed in. Using a generic TypeVar on the Protocol method would complicate structural subtype checking.
Error hierarchy is shallow and explicit: each class maps to exactly one failure category (401/403, 429, 5xx, 400, content-policy). No inheritance between siblings; callers check the type, not the message.
contracts/ is omitted from coverage (omit = ["*/iris_engine/contracts/*"]) - same treatment as ocr_engine.py.

Rebase note

This PR is stacked on #38 (T038+T039). Once #38 merges, rebase onto main before merging this one.

Core adapter (iris-ocr-paddleocr): - PaddleOCREngine backed by PaddleOCRVL pipeline (v1.6), satisfying contract clauses C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 - PDF rasterisation via PyMuPDF at 150 DPI; PNG/JPEG/TIFF accepted directly - Offline mode: IRIS_PADDLEOCR_OFFLINE=1 + IRIS_PADDLEOCR_MODEL_PATH point the VL recognition sub-model at a local HuggingFace snapshot directory; startup raises OCRUnavailable immediately if offline flag is set without path - Result mapping from PaddleOCRVLResult (dict-like, parsing_res_list) to OCRPageResult; VLM returns no per-block confidence so confidence=1.0 - BBox format is axis-aligned [x1,y1,x2,y2] rect, not polygon - converted via _rect_to_bbox with negative-coord clamp and min-size=1 enforcement - 29 unit tests, 96% coverage; live test gated on IRIS_OCR_LIVE_PADDLEOCR=1 Fixes carried forward from code review (Datalab / ADI): - DatalabOCREngine._poll() was an unbounded loop; added max_poll_seconds=300 and deadline check mirroring the ADI pattern (test added) - ocr-adi/tests/__init__.py and ocr-datalab/tests/__init__.py deleted; with importlib mode these caused all three adapter test_unit modules to resolve to the same module name, running ADI tests for every adapter - All three adapter live test files now carry pytest.mark.slow alongside their skipif guard so intent is explicit when running the full suite Workspace changes: - Root pyproject.toml: iris_ocr_paddleocr added to coverage sources, mypy overrides, importlinter; ruff exclude added for PaddleOCR-VL-1.6-snapshot - .gitignore: PaddleOCR-VL-1.6-snapshot/ added (1.8 GB model, local-only) - .env.example: PaddleOCR offline mode vars and model download instructions - huggingface_hub>=0.23 added to ocr-paddleocr dependencies

extract() was holding the event loop during PyMuPDF rasterisation and pipeline.predict() (up to ~15s per doc). Wrapped both _to_images and the per-page inference loop in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T035 to [x] in tasks.md and added the offline/airgap caveat about PP-DocLayoutV3 and doc-preprocessor sub-models requiring ~/.paddlex pre-seeding for a true airgap.

paddlepaddle was already resolved as a transitive dep via paddleocr/paddlex but was not declared explicitly. A partial-sync of the venv (e.g. during rebase operations) could leave it missing until a full uv sync --all-packages ran. Declaring it explicitly ensures it is always present after any uv sync.

Core adapter (iris-ocr-local): - TesseractEngine backed by pytesseract wrapping the system tesseract binary; satisfies C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 (inherent - no network calls at any point) - PDF rasterisation via PyMuPDF at 150 DPI, same pattern as iris-ocr-paddleocr - image_to_data() provides per-word bboxes in pixels and confidence [0-100]; adapter normalises confidence to [0.0-1.0], averages per page - Words grouped by block_num; blocks joined with double newline in markdown - Rows with conf == -1 (layout rows) and empty text stripped before mapping - IRIS_TESSERACT_CMD env var overrides the binary path (Windows / custom paths) - 28 unit tests, all contract clauses mocked via _pytesseract injection seam - Live test gated on IRIS_OCR_LIVE_LOCAL=1; result: 182 ms, confidence 0.96, 3 bboxes, markdown "IRIS live test" Adapter READMEs (T039 partial - all four OCR adapters): - ocr-adi/README.md: Azure setup, env vars, live test command, limitations - ocr-datalab/README.md: API key setup, polling flow, bbox/confidence notes - ocr-paddleocr/README.md: model download, GPU install, offline mode, limits - ocr-local/README.md: platform install commands, IRIS_TESSERACT_CMD, live test Makefile: - install: cross-platform tesseract check via shutil.which after uv sync; prints per-platform install instructions if binary is absent - distclean (Linux + Windows): note that system packages are not auto-removed Workspace: - iris_ocr_local added to coverage sources in root pyproject.toml - tasks.md: T035 marked complete

README files for adi, datalab, local, and paddleocr are kept untracked locally. They will be added in the T039 PR which covers all adapter documentation together.

extract() was blocking the event loop during PyMuPDF rasterisation and pytesseract.image_to_data() subprocess calls. Wrapped both in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T036 to [x] in tasks.md.

Correctness: re-apply tesseract_cmd per inference call to prevent global state clobber between concurrent engines; wrap per-page get_pixmap and Image.frombytes in try/except OCRMalformedDocument; replace TIFF while-True frame loop with ImageSequence.Iterator; sort blocks by block_num for correct reading order on multi-column documents; move _map_result inside _run_page try/except so KeyError raises OCRUnavailable not a raw exception. Infrastructure: add missing --cov flags for all four adapter packages to make test-cov. Tests: strengthen C-OCR-002 to assert full five-word phrase and total_pages; add per-page content assertions to test_c003; add test_binary_not_found_raises_unavailable. Documentation: note C-OCR-011 deferral to T038 in module docstring.

pyproject.toml [tool.coverage.run] source already lists all packages; duplicate flags in the Makefile were a second source of truth that would silently drift when a new adapter is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tests/contract/test_ocr_contract.py: 66 tests + 1 skipped placeholder. C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 verified against all five adapters (in-memory, adi, datalab, paddleocr, local) with mocked externals. C-OCR-007 excludes in-memory: documented gap from PR#32 - fake engine returns pre-registered results and cannot detect malformed PDFs without a parser. C-OCR-011 skipped pending T038 (OTEL spans).

- C-OCR-LOCAL-001: add monkeypatch socket.socket.connect guard so the test fails loudly if a local adapter ever opens a network connection, not just when extraction returns pages. - Lines 123 and 258: remove doubled # pragma: allowlist secret comment.

Adds iris_engine.ocr.tracing with instrument_extract() async context manager and log_extract_success(). Every adapter's extract() is wrapped: span name ocr.extract, attributes ocr.adapter_id, ocr.tenant_id, ocr.product_slug, ocr.document_id, ocr.content_type, ocr.total_pages, ocr.total_latency_ms, ocr.success. OCRError subtypes recorded with ocr.error_category; unexpected exceptions re-raised with StatusCode.ERROR. InMemoryOCREngine also wrapped so C-OCR-011 covers all 5 adapters. conftest.py wires an InMemorySpanExporter at module level; span_exporter fixture clears it per test. C-OCR-011 parametrised across all 5 adapters. opentelemetry-api>=1.20 added to iris-engine deps; opentelemetry-sdk>=1.20 added to root dev deps. mypy ignore_missing_imports added for opentelemetry (no upstream type stubs).

Four new README files - one per OCR adapter - documenting prerequisites, environment variables, running tests, how each adapter works, and known limitations. PaddleOCR README includes a full airgapped deployment section covering PADDLEX_HOME and PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK for environments without internet access.

… (T040) Introduces iris_engine/contracts/llm_provider.py with the LLMProvider Protocol, LLMRequest, LLMResponse, LLMUsage dataclasses, and the full typed error hierarchy (LLMUnavailable, LLMRateLimited, LLMAuthenticationFailed, LLMSchemaViolation, LLMContextWindowExceeded, LLMContentFiltered, LLMInvalidRequest). No adapter imports; the module is importable in a bare iris-engine install.

anmolg1997

Approved — substance is a clean, faithful port of the contract. One small tracking fix to land before merge (below).

What I verified

iris_engine/contracts/llm_provider.py matches contracts/llm-provider.md exactly:

All 7 typed errors present and correctly mapped: LLMUnavailable (5xx/network/timeout), LLMRateLimited (429), LLMAuthenticationFailed (401/403), LLMSchemaViolation, LLMContextWindowExceeded, LLMContentFiltered, LLMInvalidRequest (400), all under LLMError.
LLMUsage / LLMResponse / LLMRequest frozen dataclasses with the documented fields. LLMRequest defaults are sensible (temperature=0.0 for deterministic extraction, schema=None, model_hint for the extraction/classification/summary/chat routing the gateway will use later).
LLMResponse.structured: BaseModel | None — pydantic typed, populated only when a schema was supplied. adapter_id doc-comment correctly notes "in-memory" for the stub.
Reuses TenantContext from ocr_engine rather than redefining it, and re-exports it in __all__ — good, one shared context type across both contracts.
Module docstring states the layer rule ("No adapter imports... importable without any LLM dependency"). iris_engine stays at the bottom; lint confirms 3 contracts kept.

Gates: lint clean, mypy clean on 20 source files, full suite 373 passed.

One fix before merge

tasks/004-llm-adapter-set/tasks.md still shows - [ ] **T040**. Every prior task PR flipped its checkbox in the same PR; this one didn't. Left as-is, main will show T040 as not-done even though it's complete — and the docs-ci structural check won't catch a stale checkbox.

Please flip it to - [x] **T040**. You're rebasing this branch anyway as the stack lands, so it's a zero-cost addition to the same push. Not requesting changes since the code needs nothing — just don't let it merge without the flip.

(Optional: T040's acceptance text references the OTEL span on complete carrying structured_output_used — that span lands with the LLM equivalent of T038, not here. If you want to mirror the OCR pattern, scope T040's acceptance to "Protocol type-checks under mypy strict; imports with zero adapter deps" the way you did for T030, and let the span clause ride with the later instrumentation task. Your call.)

Solid foundation for the four LLM adapters. Reviewing #40 (selector + stub) next.

likith1908 and others added 14 commits June 16, 2026 12:03

ci: trigger CI after rebase onto main (T035)

773ad0d

Remove per-adapter READMEs from T036 - deferred to T039

7140abf

README files for adi, datalab, local, and paddleocr are kept untracked locally. They will be added in the T039 PR which covers all adapter documentation together.

likith1908 requested a review from anmolg1997 as a code owner June 17, 2026 06:56

likith1908 mentioned this pull request Jun 17, 2026

T041 + T042: LLM selector, StubLLMProvider, and generate-schemas #40

Open

14 tasks

anmolg1997 mentioned this pull request Jun 17, 2026

T038 + T039: OTEL span instrumentation and per-adapter READMEs #38

Open

13 tasks

anmolg1997 approved these changes Jun 17, 2026

View reviewed changes

Flip T040 checkbox to done in tasks.md

6b33e1f

likith1908 force-pushed the T038-T039-otel-spans-readmes branch from 9ee71d9 to e4c4eb4 Compare June 17, 2026 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T040: LLMProvider Protocol, request/response types, and error hierarchy#39

T040: LLMProvider Protocol, request/response types, and error hierarchy#39
likith1908 wants to merge 15 commits into
T038-T039-otel-spans-readmesfrom
T040-llm-provider-protocol

likith1908 commented Jun 17, 2026

Uh oh!

anmolg1997 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

likith1908 commented Jun 17, 2026

Summary

Task reference

Why

Acceptance criteria

PR checklist

Notes for the reviewer

Rebase note

Uh oh!

anmolg1997 left a comment

Choose a reason for hiding this comment

What I verified

One fix before merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants