Skip to content

T040: LLMProvider Protocol, request/response types, and error hierarchy#39

Open
likith1908 wants to merge 15 commits into
T038-T039-otel-spans-readmesfrom
T040-llm-provider-protocol
Open

T040: LLMProvider Protocol, request/response types, and error hierarchy#39
likith1908 wants to merge 15 commits into
T038-T039-otel-spans-readmesfrom
T040-llm-provider-protocol

Conversation

@likith1908

Copy link
Copy Markdown
Contributor

Summary

Add iris_engine/contracts/llm_provider.py defining the LLMProvider Protocol, frozen dataclasses (LLMRequest, LLMResponse, LLMUsage), and the typed error hierarchy (8 classes) for the LLM adapter workstream. Single file, zero adapter dependencies, importable in a bare iris-engine install.

Task reference

  • Task ID: T040
  • Workstream: 004-llm-adapter-set

Why

Establishes the single authoritative contract that all four LLM adapters (T044-T047) and the selector (T041) implement against. Must land before any adapter package can be written.

Acceptance criteria

  • mypy --strict passes on the module.
  • Module imports with zero adapter dependencies (from iris_engine.contracts.llm_provider import LLMProvider works in a bare iris-engine install).
  • docs-ci workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

PR checklist

  • Every acceptance criterion in the task's tasks.md entry is satisfied.
  • docs-ci workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

Notes for the reviewer

  • TenantContext is re-exported from ocr_engine.py rather than redefined. One authoritative definition prevents drift between OCR and LLM callsites.
  • LLMRequest.stop is list[str] | None (not tuple) per the contract doc. The dataclass is frozen=True for immutability semantics; hash() is never called on request instances so the mutable field is acceptable.
  • LLMResponse.structured is BaseModel | None at the Protocol level. Callers narrow it against the schema they passed in. Using a generic TypeVar on the Protocol method would complicate structural subtype checking.
  • Error hierarchy is shallow and explicit: each class maps to exactly one failure category (401/403, 429, 5xx, 400, content-policy). No inheritance between siblings; callers check the type, not the message.
  • contracts/ is omitted from coverage (omit = ["*/iris_engine/contracts/*"]) - same treatment as ocr_engine.py.

Rebase note

This PR is stacked on #38 (T038+T039). Once #38 merges, rebase onto main before merging this one.

likith1908 and others added 14 commits June 16, 2026 12:03
Core adapter (iris-ocr-paddleocr):
- PaddleOCREngine backed by PaddleOCRVL pipeline (v1.6), satisfying
  contract clauses C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001
- PDF rasterisation via PyMuPDF at 150 DPI; PNG/JPEG/TIFF accepted directly
- Offline mode: IRIS_PADDLEOCR_OFFLINE=1 + IRIS_PADDLEOCR_MODEL_PATH point
  the VL recognition sub-model at a local HuggingFace snapshot directory;
  startup raises OCRUnavailable immediately if offline flag is set without path
- Result mapping from PaddleOCRVLResult (dict-like, parsing_res_list) to
  OCRPageResult; VLM returns no per-block confidence so confidence=1.0
- BBox format is axis-aligned [x1,y1,x2,y2] rect, not polygon - converted
  via _rect_to_bbox with negative-coord clamp and min-size=1 enforcement
- 29 unit tests, 96% coverage; live test gated on IRIS_OCR_LIVE_PADDLEOCR=1

Fixes carried forward from code review (Datalab / ADI):
- DatalabOCREngine._poll() was an unbounded loop; added max_poll_seconds=300
  and deadline check mirroring the ADI pattern (test added)
- ocr-adi/tests/__init__.py and ocr-datalab/tests/__init__.py deleted; with
  importlib mode these caused all three adapter test_unit modules to resolve
  to the same module name, running ADI tests for every adapter
- All three adapter live test files now carry pytest.mark.slow alongside their
  skipif guard so intent is explicit when running the full suite

Workspace changes:
- Root pyproject.toml: iris_ocr_paddleocr added to coverage sources, mypy
  overrides, importlinter; ruff exclude added for PaddleOCR-VL-1.6-snapshot
- .gitignore: PaddleOCR-VL-1.6-snapshot/ added (1.8 GB model, local-only)
- .env.example: PaddleOCR offline mode vars and model download instructions
- huggingface_hub>=0.23 added to ocr-paddleocr dependencies
extract() was holding the event loop during PyMuPDF rasterisation and
pipeline.predict() (up to ~15s per doc). Wrapped both _to_images and the
per-page inference loop in asyncio.to_thread so concurrent requests are
not frozen during OCR. Also flipped T035 to [x] in tasks.md and added
the offline/airgap caveat about PP-DocLayoutV3 and doc-preprocessor
sub-models requiring ~/.paddlex pre-seeding for a true airgap.
paddlepaddle was already resolved as a transitive dep via paddleocr/paddlex
but was not declared explicitly. A partial-sync of the venv (e.g. during
rebase operations) could leave it missing until a full uv sync --all-packages
ran. Declaring it explicitly ensures it is always present after any uv sync.
Core adapter (iris-ocr-local):
- TesseractEngine backed by pytesseract wrapping the system tesseract binary;
  satisfies C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 (inherent - no
  network calls at any point)
- PDF rasterisation via PyMuPDF at 150 DPI, same pattern as iris-ocr-paddleocr
- image_to_data() provides per-word bboxes in pixels and confidence [0-100];
  adapter normalises confidence to [0.0-1.0], averages per page
- Words grouped by block_num; blocks joined with double newline in markdown
- Rows with conf == -1 (layout rows) and empty text stripped before mapping
- IRIS_TESSERACT_CMD env var overrides the binary path (Windows / custom paths)
- 28 unit tests, all contract clauses mocked via _pytesseract injection seam
- Live test gated on IRIS_OCR_LIVE_LOCAL=1; result: 182 ms, confidence 0.96,
  3 bboxes, markdown "IRIS live test"

Adapter READMEs (T039 partial - all four OCR adapters):
- ocr-adi/README.md: Azure setup, env vars, live test command, limitations
- ocr-datalab/README.md: API key setup, polling flow, bbox/confidence notes
- ocr-paddleocr/README.md: model download, GPU install, offline mode, limits
- ocr-local/README.md: platform install commands, IRIS_TESSERACT_CMD, live test

Makefile:
- install: cross-platform tesseract check via shutil.which after uv sync;
  prints per-platform install instructions if binary is absent
- distclean (Linux + Windows): note that system packages are not auto-removed

Workspace:
- iris_ocr_local added to coverage sources in root pyproject.toml
- tasks.md: T035 marked complete
README files for adi, datalab, local, and paddleocr are kept
untracked locally. They will be added in the T039 PR which
covers all adapter documentation together.
extract() was blocking the event loop during PyMuPDF rasterisation and
pytesseract.image_to_data() subprocess calls. Wrapped both in
asyncio.to_thread so concurrent requests are not frozen during OCR.
Also flipped T036 to [x] in tasks.md.
Correctness: re-apply tesseract_cmd per inference call to prevent global
state clobber between concurrent engines; wrap per-page get_pixmap and
Image.frombytes in try/except OCRMalformedDocument; replace TIFF while-True
frame loop with ImageSequence.Iterator; sort blocks by block_num for correct
reading order on multi-column documents; move _map_result inside _run_page
try/except so KeyError raises OCRUnavailable not a raw exception.
Infrastructure: add missing --cov flags for all four adapter packages to
make test-cov. Tests: strengthen C-OCR-002 to assert full five-word phrase
and total_pages; add per-page content assertions to test_c003; add
test_binary_not_found_raises_unavailable. Documentation: note C-OCR-011
deferral to T038 in module docstring.
pyproject.toml [tool.coverage.run] source already lists all packages;
duplicate flags in the Makefile were a second source of truth that
would silently drift when a new adapter is added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tests/contract/test_ocr_contract.py: 66 tests + 1 skipped placeholder.
C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 verified against all five
adapters (in-memory, adi, datalab, paddleocr, local) with mocked externals.
C-OCR-007 excludes in-memory: documented gap from PR#32 - fake engine returns
pre-registered results and cannot detect malformed PDFs without a parser.
C-OCR-011 skipped pending T038 (OTEL spans).
- C-OCR-LOCAL-001: add monkeypatch socket.socket.connect guard so the
  test fails loudly if a local adapter ever opens a network connection,
  not just when extraction returns pages.
- Lines 123 and 258: remove doubled # pragma: allowlist secret comment.
Adds iris_engine.ocr.tracing with instrument_extract() async context manager
and log_extract_success(). Every adapter's extract() is wrapped: span name
ocr.extract, attributes ocr.adapter_id, ocr.tenant_id, ocr.product_slug,
ocr.document_id, ocr.content_type, ocr.total_pages, ocr.total_latency_ms,
ocr.success. OCRError subtypes recorded with ocr.error_category; unexpected
exceptions re-raised with StatusCode.ERROR.

InMemoryOCREngine also wrapped so C-OCR-011 covers all 5 adapters.
conftest.py wires an InMemorySpanExporter at module level; span_exporter
fixture clears it per test. C-OCR-011 parametrised across all 5 adapters.

opentelemetry-api>=1.20 added to iris-engine deps; opentelemetry-sdk>=1.20
added to root dev deps. mypy ignore_missing_imports added for opentelemetry
(no upstream type stubs).
Four new README files - one per OCR adapter - documenting prerequisites,
environment variables, running tests, how each adapter works, and known
limitations. PaddleOCR README includes a full airgapped deployment section
covering PADDLEX_HOME and PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK for
environments without internet access.
… (T040)

Introduces iris_engine/contracts/llm_provider.py with the LLMProvider
Protocol, LLMRequest, LLMResponse, LLMUsage dataclasses, and the full
typed error hierarchy (LLMUnavailable, LLMRateLimited,
LLMAuthenticationFailed, LLMSchemaViolation, LLMContextWindowExceeded,
LLMContentFiltered, LLMInvalidRequest). No adapter imports; the module
is importable in a bare iris-engine install.

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved — substance is a clean, faithful port of the contract. One small tracking fix to land before merge (below).

What I verified

iris_engine/contracts/llm_provider.py matches contracts/llm-provider.md exactly:

  • All 7 typed errors present and correctly mapped: LLMUnavailable (5xx/network/timeout), LLMRateLimited (429), LLMAuthenticationFailed (401/403), LLMSchemaViolation, LLMContextWindowExceeded, LLMContentFiltered, LLMInvalidRequest (400), all under LLMError.
  • LLMUsage / LLMResponse / LLMRequest frozen dataclasses with the documented fields. LLMRequest defaults are sensible (temperature=0.0 for deterministic extraction, schema=None, model_hint for the extraction/classification/summary/chat routing the gateway will use later).
  • LLMResponse.structured: BaseModel | None — pydantic typed, populated only when a schema was supplied. adapter_id doc-comment correctly notes "in-memory" for the stub.
  • Reuses TenantContext from ocr_engine rather than redefining it, and re-exports it in __all__ — good, one shared context type across both contracts.
  • Module docstring states the layer rule ("No adapter imports... importable without any LLM dependency"). iris_engine stays at the bottom; lint confirms 3 contracts kept.

Gates: lint clean, mypy clean on 20 source files, full suite 373 passed.

One fix before merge

tasks/004-llm-adapter-set/tasks.md still shows - [ ] **T040**. Every prior task PR flipped its checkbox in the same PR; this one didn't. Left as-is, main will show T040 as not-done even though it's complete — and the docs-ci structural check won't catch a stale checkbox.

Please flip it to - [x] **T040**. You're rebasing this branch anyway as the stack lands, so it's a zero-cost addition to the same push. Not requesting changes since the code needs nothing — just don't let it merge without the flip.

(Optional: T040's acceptance text references the OTEL span on complete carrying structured_output_used — that span lands with the LLM equivalent of T038, not here. If you want to mirror the OCR pattern, scope T040's acceptance to "Protocol type-checks under mypy strict; imports with zero adapter deps" the way you did for T030, and let the span clause ride with the later instrumentation task. Your call.)

Solid foundation for the four LLM adapters. Reviewing #40 (selector + stub) next.

@likith1908 likith1908 force-pushed the T038-T039-otel-spans-readmes branch from 9ee71d9 to e4c4eb4 Compare June 17, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants