Implement ADI OCR adapter, py.typed marker, .env.example OCR vars (T033) by likith1908 · Pull Request #33 · auropro-hyd/IRIS

likith1908 · 2026-06-08T09:46:16Z

Summary

Implements T033: the Azure Document Intelligence OCR adapter. All ADI HTTP traffic goes through raw `httpx.AsyncClient` with no Azure SDK dependency, consistent with the decision recorded in plan.md key choice 3.

`packages/iris-adapters/ocr-adi/src/iris_ocr_adi/client.py`: `AdiOCREngine` - async POST to start analysis, polling loop on `Operation-Location`, full mapping of ADI analysis result to `OCRResult`; typed errors for every HTTP and ADI failure mode; poll deadline guard via `max_poll_seconds=300.0`
`packages/iris-adapters/ocr-adi/src/iris_ocr_adi/init.py`: re-exports `AdiOCREngine`
`packages/iris-adapters/ocr-adi/pyproject.toml`: dependencies (`httpx>=0.27`, `iris-engine`)
`packages/iris-adapters/ocr-adi/tests/test_unit.py`: 26 unit tests, all ADI HTTP calls mocked; covers C-OCR-001 through C-OCR-010, full HTTP error mapping, poll deadline, and edge cases
`packages/iris-adapters/ocr-adi/tests/test_live.py`: C-OCR-LIVE-001, gated on `IRIS_OCR_LIVE_ADI=1`
`packages/iris-adapters/ocr-adi/tests/init.py`: empty package marker to prevent pytest module name collision when multiple adapter packages share the same test file names
`packages/iris-engine/src/iris_engine/py.typed`: PEP 561 typed marker so downstream consumers see `iris-engine` as a typed package
`.env.example`: OCR adapter section documenting `IRIS_ADI_ENDPOINT`, `IRIS_ADI_API_KEY`, `IRIS_ADI_MODEL`, `IRIS_OCR_LIVE_ADI`
`tasks/003-ocr-adapter-set/tasks.md`: T033 marked complete; acceptance text updated with explicit clause references

Task reference

Task ID: `T033`
Workstream: `003-ocr-adapter-set`

Acceptance criteria

T033

PR checklist:

Every acceptance criterion in the task's `tasks.md` entry is satisfied.
`docs-ci` workflow passes (markdown lint + tasks structural check).
No em-dashes introduced in new prose.
All internal links resolve.
No secrets, credentials, or personal names introduced.

Notes for the reviewer

No Azure SDK

`httpx.AsyncClient` is used directly. The Azure SDK (`azure-ai-documentintelligence`) is not a dependency. This keeps the package lightweight and testable with a mock client injected via `_http_client`. Token-based auth (Entra) is deferred to the OIDC workstream per plan.md key choice 3.

ADI async flow

ADI analysis is a two-step operation: POST returns HTTP 202 and an `Operation-Location` URL; GET that URL until `status == "succeeded"`. The polling loop is in `_poll()`. `poll_interval` defaults to 1.0 s in production but is set to 0.0 in tests so they do not sleep. A `deadline = time.monotonic() + max_poll_seconds` check at the top of each poll iteration ensures the loop cannot hang indefinitely if ADI keeps returning `"running"`.

Polygon-to-BBox unit conversion

ADI returns polygons (8 floats: x0,y0 to x3,y3 going clockwise). The unit is per-page: `"inch"` for PDFs and TIFFs, `"pixel"` for PNG and JPEG. `_polygon_to_bbox()` scales inch coordinates to pixels at 96 DPI (the standard screen approximation). `max(1, ...)` ensures zero-width/height boxes - which would fail C-OCR-004 - never occur from rounding.

Out of scope additions (beyond T033 acceptance)

The following were added beyond what T033 explicitly required. They are included because they reflect real ADI behaviour and prevent silent swallowing of typed errors:

Addition	Reason
`OCRDocumentTooLarge` for ADI `ContentTooLarge` error code	ADI returns this at the application level (in the poll body), not as an HTTP status code. Mapping it to the correct typed error prevents it falling through to the generic `OCRUnavailable` path.
Missing `Operation-Location` header handling	ADI guarantees this header on 202 but a proxy or misconfiguration can strip it. Without this guard the code would raise an untyped `AttributeError`.
Non-JSON poll response handling	A gateway timeout or WAF block can return HTML. The bare `response.json()` call would raise an untyped `JSONDecodeError`.
ADI `status == "failed"` with generic error code	Distinct from `InvalidContent` and `ContentTooLarge` - maps to `OCRUnavailable` so the fallback path can trigger.
`py.typed` marker on `iris-engine`	Without it, mypy treats all `iris_engine` imports as untyped in downstream packages. This is the correct PEP 561 fix; it is not adapter-specific but was discovered when wiring the ADI adapter.
`.env.example` OCR section	T039 owns adapter READMEs; this is the env var documentation at repo root.
`tests/init.py`	Marks the test directory as a package. Required step toward resolving the pytest module name collision that surfaces when multiple adapter packages each have a `test_unit.py`. The full fix (adding `--import-mode=importlib` to pytest config) lands in T034 once the second adapter exists.

C-OCR-006 bytes-in-error test

C-OCR-006 in the contract explicitly states "the error does not leak the document bytes in its message". `test_c006_error_does_not_contain_bytes` wraps the call in `pytest.raises(OCRUnsupportedContentType)` so a non-raising `extract` fails loudly. The implementation raises `OCRUnsupportedContentType(f"ADI adapter does not support content type {content_type!r}")` - the content type string, not the bytes.

Coverage

`iris_ocr_adi` is included in `[tool.coverage.run] source`. Current coverage: 97.76% (gate: 95%). Three additional tests were added after the initial review to reach this level: `test_poll_timeout_raises_unavailable`, `test_http_4xx_catch_all_raises_unavailable`, `test_poll_running_then_succeeded`.

…T033)

likith1908 · 2026-06-09T09:27:13Z

Live test result: C-OCR-LIVE-001

Run against the real ADI endpoint with a synthetic single-page PDF containing the text IRIS live test.

adapter_id   : adi
total_pages  : 1
latency_ms   : 5213

--- page 1 ---
confidence   : 0.9883
bboxes       : 3
markdown     :
IRIS live test

What this confirms:

Authentication against the real Azure endpoint works (IRIS_ADI_ENDPOINT + IRIS_ADI_API_KEY)
POST + Operation-Location poll loop completed successfully (~5s, 2-3 poll cycles - normal for ADI)
ADI read the embedded text correctly (IRIS live test)
Confidence is in [0, 1] - 0.9883 (C-OCR-005)
3 bboxes returned with valid coordinates - word-level spans for the three words (C-OCR-004)
adapter_id == "adi" in the result (C-OCR-009)

likith1908 · 2026-06-09T09:34:18Z

@anmolg1997 , Please review this PR.

anmolg1997

Approved — first real adapter and it sets a high bar. One robustness gap to close (poll deadline, below); fine to fold into this PR or land as an immediate follow-up commit, your call.

What I verified

Adapter implementation (iris_ocr_adi/client.py)

No Azure SDK — raw httpx.AsyncClient per plan.md key choice 3. The _http_client injection seam makes the whole thing testable without patching. Right design.
Two-step ADI flow modeled correctly: POST → 202 + Operation-Location, then GET-poll until succeeded/failed. poll_interval=0.0 in tests so the suite stays fast (full run is 1.1s).
Error taxonomy mapping is complete and correct:
- HTTP 401/403 → OCRAuthenticationFailed, 429 → OCRRateLimited, 5xx → OCRUnavailable, timeout/connect → OCRUnavailable
- ADI app-level InvalidContent → OCRMalformedDocument, ContentTooLarge → OCRDocumentTooLarge, generic failed → OCRUnavailable (so the fallback engine from T031 can fire)
- Missing Operation-Location and non-JSON poll bodies → typed OCRUnavailable instead of raw AttributeError/JSONDecodeError. These "out of scope additions" are exactly the defensive shell a real adapter needs — good judgment call, and the PR body documents why each exists.
Polygon→BBox conversion handles ADI's per-page unit split (inch for PDF/TIFF at 96 DPI, pixel for PNG/JPEG), and max(1, ...) guards rounding to zero-width boxes that would violate C-OCR-004.
Confidence clamped to [0,1] after averaging word confidences. Empty-words page defaults to 1.0 — defensible (no evidence of low quality).
time.monotonic() for latency — correct clock for intervals.

Tests

22 unit tests, every contract clause C-OCR-001 through C-OCR-010 named in the test function (test_c00X_...). This naming convention makes T037's parametrised contract suite almost mechanical to write — the clause-to-test mapping is already explicit. Keep this convention for T034-T036.
HTTP error mapping covered by 10 dedicated tests including the two defensive paths (missing header, non-JSON body).
Live test correctly gated on IRIS_OCR_LIVE_ADI=1 with pytest.mark.skipif, self-contained fixture PDF, no committed credentials.

Supporting changes

py.typed on iris-engine — correct PEP 561 fix, and the PR body explains why it surfaced here. Without it every downstream adapter would see iris_engine as untyped under strict mypy.
.env.example OCR section documents all four new vars with the allowlist pragma on the placeholder key. Doc-and-implementation-together, again.
T033 acceptance text updated with explicit clause references — fifth instance of the inline spec-fix pattern.
mypy now checks 15 source files; iris_ocr_adi.* dropped out of the unused-overrides list, meaning the strict override is now actually exercising the adapter.

One gap to close: the poll loop has no deadline

_poll() is while True with no max duration. Each individual GET has the client timeout, but a stuck operation that keeps returning status: "running" forever (or a long-degraded ADI that answers every poll politely) hangs extract() indefinitely — the per-request timeout never fires because each request succeeds. In production behind the T031 fallback wrapper, this also means the fallback never gets a chance.

Suggested shape:

def __init__(self, ..., poll_interval: float = 1.0, max_poll_seconds: float = 300.0, ...):
    self._max_poll_seconds = max_poll_seconds

async def _poll(self, client, operation_url):
    deadline = time.monotonic() + self._max_poll_seconds
    while True:
        if time.monotonic() > deadline:
            raise OCRUnavailable(
                f"ADI analysis did not complete within {self._max_poll_seconds:.0f}s"
            )
        ...

Plus one test: poll body always running, max_poll_seconds=0.01, assert OCRUnavailable. Five minutes of work; my preference is folding it into this PR so the first adapter ships hang-proof, but an immediate follow-up commit works too.

Two minor notes (genuinely non-blocking)

test_c006_error_does_not_contain_bytes uses try/except — if extract doesn't raise, the test silently passes. Wrap in pytest.raises(OCRUnsupportedContentType) and assert on the exception object so the no-raise path fails loudly.
Adapter coverage isn't measured: [tool.coverage.run] source = ["iris_engine", "iris_config"] doesn't include iris_ocr_adi, so the 22 tests don't count toward the gate. best-practices.md section 3 says 80% on iris-engine and every adapter package. Suggest adding "iris_ocr_adi" (and successors as they land) to the source list — or switching to measuring packages/ wholesale once a few adapters exist. Fine as part of T034 or a tiny hygiene commit.

Looking ahead

T034 (Datalab) is next and can reuse this exact shape: injection seam, clause-named tests, gated live test, .env.example section. The PR body format here (clause checklist + "out of scope additions" table with reasons) is the best one yet — keep it.

- Add poll deadline (max_poll_seconds=300) to prevent infinite hang - Fix test_c006 silent pass with pytest.raises - Add iris_ocr_adi to coverage source; coverage now 97.76% - Add tests/__init__.py to prevent same-name collision across adapter packages - Add 3 new tests: poll deadline, poll timeout, running->succeeded transition

likith1908 · 2026-06-10T06:41:28Z

All review points addressed:

Poll deadline (required): max_poll_seconds=300.0 added to AdiOCREngine.__init__. _poll() checks deadline at the top of each iteration and raises OCRUnavailable if exceeded. New test: test_poll_deadline_exceeded_raises_unavailable with max_poll_seconds=0.0 confirms the path.

test_c006 silent pass (non-blocking): Replaced bare try/except with pytest.raises(OCRUnsupportedContentType) - a non-raising extract now fails loudly.

Coverage source (non-blocking): Added iris_ocr_adi to [tool.coverage.run] source. Three additional tests added (poll timeout, 4xx catch-all, running->succeeded transition). Coverage is now 97.76% (gate: 95%).

Bonus: Added tests/__init__.py to prevent pytest module name collision when multiple adapter packages have same-named test files (test_unit.py, test_live.py).

anmolg1997

Re-approving on 2c4240d. All three review items closed:

Poll deadline — max_poll_seconds: float = 300.0 with the monotonic-deadline check at the top of the loop, exactly the suggested shape, plus test_poll_deadline_exceeded_raises_unavailable proving it fires on a perpetual running status.
Bytes-leak test — now pytest.raises(...) with assertion on exc_info.value; the no-raise path fails loudly.
Adapter coverage — iris_ocr_adi added to [tool.coverage.run] source.

Bonus tests beyond the asks: poll-phase timeout, 4xx catch-all, and running→succeeded transition (the happy path through the poll loop's wait branch). 26 unit tests now. All 10 checks green. Ready to merge — and since #34/#35 will conflict on pyproject.toml coverage source and .env.example, merge this first and rebase those.

Implement ADI OCR adapter with httpx client and contract test suite (…

3df277b

…T033)

likith1908 force-pushed the T033-adi-adapter branch from abac774 to 3df277b Compare June 8, 2026 09:55

likith1908 changed the base branch from T031-T032-selector-inmemory to main June 8, 2026 09:55

likith1908 marked this pull request as ready for review June 8, 2026 13:11

likith1908 requested a review from anmolg1997 as a code owner June 8, 2026 13:11

trigger ci

801099c

likith1908 marked this pull request as draft June 9, 2026 04:50

likith1908 marked this pull request as ready for review June 9, 2026 09:33

anmolg1997 previously approved these changes Jun 10, 2026

View reviewed changes

likith1908 dismissed anmolg1997’s stale review via 2c4240d June 10, 2026 06:40

likith1908 requested a review from anmolg1997 June 10, 2026 06:41

likith1908 mentioned this pull request Jun 10, 2026

Implement Datalab OCR adapter with httpx client and contract test suite (T034) #34

Merged

19 tasks

anmolg1997 approved these changes Jun 12, 2026

View reviewed changes

likith1908 merged commit 45fd430 into main Jun 12, 2026
10 checks passed

anmolg1997 mentioned this pull request Jun 16, 2026

T036: Implement Tesseract local OCR adapter #36

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ADI OCR adapter, py.typed marker, .env.example OCR vars (T033)#33

Implement ADI OCR adapter, py.typed marker, .env.example OCR vars (T033)#33
likith1908 merged 3 commits into
mainfrom
T033-adi-adapter

likith1908 commented Jun 8, 2026 •

edited

Loading

Uh oh!

likith1908 commented Jun 9, 2026 •

edited

Loading

Uh oh!

likith1908 commented Jun 9, 2026

Uh oh!

anmolg1997 left a comment

Uh oh!

likith1908 commented Jun 10, 2026

Uh oh!

anmolg1997 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

likith1908 commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Task reference

Acceptance criteria

Notes for the reviewer

No Azure SDK

ADI async flow

Polygon-to-BBox unit conversion

Out of scope additions (beyond T033 acceptance)

C-OCR-006 bytes-in-error test

Coverage

Uh oh!

likith1908 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Live test result: C-OCR-LIVE-001

Uh oh!

likith1908 commented Jun 9, 2026

Uh oh!

anmolg1997 left a comment

Choose a reason for hiding this comment

What I verified

One gap to close: the poll loop has no deadline

Two minor notes (genuinely non-blocking)

Looking ahead

Uh oh!

likith1908 commented Jun 10, 2026

Uh oh!

anmolg1997 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

likith1908 commented Jun 8, 2026 •

edited

Loading

likith1908 commented Jun 9, 2026 •

edited

Loading