Skip to content

Implement ADI OCR adapter, py.typed marker, .env.example OCR vars (T033)#33

Merged
likith1908 merged 3 commits into
mainfrom
T033-adi-adapter
Jun 12, 2026
Merged

Implement ADI OCR adapter, py.typed marker, .env.example OCR vars (T033)#33
likith1908 merged 3 commits into
mainfrom
T033-adi-adapter

Conversation

@likith1908

@likith1908 likith1908 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements T033: the Azure Document Intelligence OCR adapter. All ADI HTTP traffic goes through raw `httpx.AsyncClient` with no Azure SDK dependency, consistent with the decision recorded in plan.md key choice 3.

  • `packages/iris-adapters/ocr-adi/src/iris_ocr_adi/client.py`: `AdiOCREngine` - async POST to start analysis, polling loop on `Operation-Location`, full mapping of ADI analysis result to `OCRResult`; typed errors for every HTTP and ADI failure mode; poll deadline guard via `max_poll_seconds=300.0`
  • `packages/iris-adapters/ocr-adi/src/iris_ocr_adi/init.py`: re-exports `AdiOCREngine`
  • `packages/iris-adapters/ocr-adi/pyproject.toml`: dependencies (`httpx>=0.27`, `iris-engine`)
  • `packages/iris-adapters/ocr-adi/tests/test_unit.py`: 26 unit tests, all ADI HTTP calls mocked; covers C-OCR-001 through C-OCR-010, full HTTP error mapping, poll deadline, and edge cases
  • `packages/iris-adapters/ocr-adi/tests/test_live.py`: C-OCR-LIVE-001, gated on `IRIS_OCR_LIVE_ADI=1`
  • `packages/iris-adapters/ocr-adi/tests/init.py`: empty package marker to prevent pytest module name collision when multiple adapter packages share the same test file names
  • `packages/iris-engine/src/iris_engine/py.typed`: PEP 561 typed marker so downstream consumers see `iris-engine` as a typed package
  • `.env.example`: OCR adapter section documenting `IRIS_ADI_ENDPOINT`, `IRIS_ADI_API_KEY`, `IRIS_ADI_MODEL`, `IRIS_OCR_LIVE_ADI`
  • `tasks/003-ocr-adapter-set/tasks.md`: T033 marked complete; acceptance text updated with explicit clause references

Task reference

  • Task ID: `T033`
  • Workstream: `003-ocr-adapter-set`

Acceptance criteria

T033

  • C-OCR-001: `test_c001_id_is_adi`, `test_c001_version_is_semver`
  • C-OCR-002: `test_c002_pdf_extracts_markdown`
  • C-OCR-003: `test_c003_multipage_ordering`
  • C-OCR-004: `test_c004_bbox_non_negative_xy_positive_wh`
  • C-OCR-005: `test_c005_confidence_in_range`
  • C-OCR-006: `test_c006_unsupported_content_type_raises`, `test_c006_error_does_not_contain_bytes`
  • C-OCR-007: `test_c007_adi_invalid_content_raises_malformed`
  • C-OCR-008: `test_c008_empty_bytes_raises_malformed`
  • C-OCR-009: `test_c009_adapter_id_in_result`
  • C-OCR-010: `test_c010_png_input_accepted`
  • HTTP error mapping (401, 403, 429, 5xx, timeout): covered by six dedicated tests
  • Poll deadline: `test_poll_deadline_exceeded_raises_unavailable` with `max_poll_seconds=0.0`
  • Live clause `test_live.py` runs under `IRIS_OCR_LIVE_ADI=1`
  • C-OCR-011 (OTEL span): deferred to T038

PR checklist:

  • Every acceptance criterion in the task's `tasks.md` entry is satisfied.
  • `docs-ci` workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

Notes for the reviewer

No Azure SDK

`httpx.AsyncClient` is used directly. The Azure SDK (`azure-ai-documentintelligence`) is not a dependency. This keeps the package lightweight and testable with a mock client injected via `_http_client`. Token-based auth (Entra) is deferred to the OIDC workstream per plan.md key choice 3.

ADI async flow

ADI analysis is a two-step operation: POST returns HTTP 202 and an `Operation-Location` URL; GET that URL until `status == "succeeded"`. The polling loop is in `_poll()`. `poll_interval` defaults to 1.0 s in production but is set to 0.0 in tests so they do not sleep. A `deadline = time.monotonic() + max_poll_seconds` check at the top of each poll iteration ensures the loop cannot hang indefinitely if ADI keeps returning `"running"`.

Polygon-to-BBox unit conversion

ADI returns polygons (8 floats: x0,y0 to x3,y3 going clockwise). The unit is per-page: `"inch"` for PDFs and TIFFs, `"pixel"` for PNG and JPEG. `_polygon_to_bbox()` scales inch coordinates to pixels at 96 DPI (the standard screen approximation). `max(1, ...)` ensures zero-width/height boxes - which would fail C-OCR-004 - never occur from rounding.

Out of scope additions (beyond T033 acceptance)

The following were added beyond what T033 explicitly required. They are included because they reflect real ADI behaviour and prevent silent swallowing of typed errors:

Addition Reason
`OCRDocumentTooLarge` for ADI `ContentTooLarge` error code ADI returns this at the application level (in the poll body), not as an HTTP status code. Mapping it to the correct typed error prevents it falling through to the generic `OCRUnavailable` path.
Missing `Operation-Location` header handling ADI guarantees this header on 202 but a proxy or misconfiguration can strip it. Without this guard the code would raise an untyped `AttributeError`.
Non-JSON poll response handling A gateway timeout or WAF block can return HTML. The bare `response.json()` call would raise an untyped `JSONDecodeError`.
ADI `status == "failed"` with generic error code Distinct from `InvalidContent` and `ContentTooLarge` - maps to `OCRUnavailable` so the fallback path can trigger.
`py.typed` marker on `iris-engine` Without it, mypy treats all `iris_engine` imports as untyped in downstream packages. This is the correct PEP 561 fix; it is not adapter-specific but was discovered when wiring the ADI adapter.
`.env.example` OCR section T039 owns adapter READMEs; this is the env var documentation at repo root.
`tests/init.py` Marks the test directory as a package. Required step toward resolving the pytest module name collision that surfaces when multiple adapter packages each have a `test_unit.py`. The full fix (adding `--import-mode=importlib` to pytest config) lands in T034 once the second adapter exists.

C-OCR-006 bytes-in-error test

C-OCR-006 in the contract explicitly states "the error does not leak the document bytes in its message". `test_c006_error_does_not_contain_bytes` wraps the call in `pytest.raises(OCRUnsupportedContentType)` so a non-raising `extract` fails loudly. The implementation raises `OCRUnsupportedContentType(f"ADI adapter does not support content type {content_type!r}")` - the content type string, not the bytes.

Coverage

`iris_ocr_adi` is included in `[tool.coverage.run] source`. Current coverage: 97.76% (gate: 95%). Three additional tests were added after the initial review to reach this level: `test_poll_timeout_raises_unavailable`, `test_http_4xx_catch_all_raises_unavailable`, `test_poll_running_then_succeeded`.

@likith1908 likith1908 changed the base branch from T031-T032-selector-inmemory to main June 8, 2026 09:55
@likith1908 likith1908 marked this pull request as ready for review June 8, 2026 13:11
@likith1908 likith1908 requested a review from anmolg1997 as a code owner June 8, 2026 13:11
@likith1908 likith1908 marked this pull request as draft June 9, 2026 04:50
@likith1908

likith1908 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Live test result: C-OCR-LIVE-001

Run against the real ADI endpoint with a synthetic single-page PDF containing the text IRIS live test.

adapter_id   : adi
total_pages  : 1
latency_ms   : 5213

--- page 1 ---
confidence   : 0.9883
bboxes       : 3
markdown     :
IRIS live test

What this confirms:

  • Authentication against the real Azure endpoint works (IRIS_ADI_ENDPOINT + IRIS_ADI_API_KEY)
  • POST + Operation-Location poll loop completed successfully (~5s, 2-3 poll cycles - normal for ADI)
  • ADI read the embedded text correctly (IRIS live test)
  • Confidence is in [0, 1] - 0.9883 (C-OCR-005)
  • 3 bboxes returned with valid coordinates - word-level spans for the three words (C-OCR-004)
  • adapter_id == "adi" in the result (C-OCR-009)
image

@likith1908 likith1908 marked this pull request as ready for review June 9, 2026 09:33
@likith1908

Copy link
Copy Markdown
Contributor Author

@anmolg1997 , Please review this PR.

anmolg1997
anmolg1997 previously approved these changes Jun 10, 2026

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved — first real adapter and it sets a high bar. One robustness gap to close (poll deadline, below); fine to fold into this PR or land as an immediate follow-up commit, your call.

What I verified

Adapter implementation (iris_ocr_adi/client.py)

  • No Azure SDK — raw httpx.AsyncClient per plan.md key choice 3. The _http_client injection seam makes the whole thing testable without patching. Right design.
  • Two-step ADI flow modeled correctly: POST → 202 + Operation-Location, then GET-poll until succeeded/failed. poll_interval=0.0 in tests so the suite stays fast (full run is 1.1s).
  • Error taxonomy mapping is complete and correct:
    • HTTP 401/403 → OCRAuthenticationFailed, 429 → OCRRateLimited, 5xx → OCRUnavailable, timeout/connect → OCRUnavailable
    • ADI app-level InvalidContentOCRMalformedDocument, ContentTooLargeOCRDocumentTooLarge, generic failedOCRUnavailable (so the fallback engine from T031 can fire)
    • Missing Operation-Location and non-JSON poll bodies → typed OCRUnavailable instead of raw AttributeError/JSONDecodeError. These "out of scope additions" are exactly the defensive shell a real adapter needs — good judgment call, and the PR body documents why each exists.
  • Polygon→BBox conversion handles ADI's per-page unit split (inch for PDF/TIFF at 96 DPI, pixel for PNG/JPEG), and max(1, ...) guards rounding to zero-width boxes that would violate C-OCR-004.
  • Confidence clamped to [0,1] after averaging word confidences. Empty-words page defaults to 1.0 — defensible (no evidence of low quality).
  • time.monotonic() for latency — correct clock for intervals.

Tests

  • 22 unit tests, every contract clause C-OCR-001 through C-OCR-010 named in the test function (test_c00X_...). This naming convention makes T037's parametrised contract suite almost mechanical to write — the clause-to-test mapping is already explicit. Keep this convention for T034-T036.
  • HTTP error mapping covered by 10 dedicated tests including the two defensive paths (missing header, non-JSON body).
  • Live test correctly gated on IRIS_OCR_LIVE_ADI=1 with pytest.mark.skipif, self-contained fixture PDF, no committed credentials.

Supporting changes

  • py.typed on iris-engine — correct PEP 561 fix, and the PR body explains why it surfaced here. Without it every downstream adapter would see iris_engine as untyped under strict mypy.
  • .env.example OCR section documents all four new vars with the allowlist pragma on the placeholder key. Doc-and-implementation-together, again.
  • T033 acceptance text updated with explicit clause references — fifth instance of the inline spec-fix pattern.
  • mypy now checks 15 source files; iris_ocr_adi.* dropped out of the unused-overrides list, meaning the strict override is now actually exercising the adapter.

One gap to close: the poll loop has no deadline

_poll() is while True with no max duration. Each individual GET has the client timeout, but a stuck operation that keeps returning status: "running" forever (or a long-degraded ADI that answers every poll politely) hangs extract() indefinitely — the per-request timeout never fires because each request succeeds. In production behind the T031 fallback wrapper, this also means the fallback never gets a chance.

Suggested shape:

def __init__(self, ..., poll_interval: float = 1.0, max_poll_seconds: float = 300.0, ...):
    self._max_poll_seconds = max_poll_seconds

async def _poll(self, client, operation_url):
    deadline = time.monotonic() + self._max_poll_seconds
    while True:
        if time.monotonic() > deadline:
            raise OCRUnavailable(
                f"ADI analysis did not complete within {self._max_poll_seconds:.0f}s"
            )
        ...

Plus one test: poll body always running, max_poll_seconds=0.01, assert OCRUnavailable. Five minutes of work; my preference is folding it into this PR so the first adapter ships hang-proof, but an immediate follow-up commit works too.

Two minor notes (genuinely non-blocking)

  1. test_c006_error_does_not_contain_bytes uses try/except — if extract doesn't raise, the test silently passes. Wrap in pytest.raises(OCRUnsupportedContentType) and assert on the exception object so the no-raise path fails loudly.
  2. Adapter coverage isn't measured: [tool.coverage.run] source = ["iris_engine", "iris_config"] doesn't include iris_ocr_adi, so the 22 tests don't count toward the gate. best-practices.md section 3 says 80% on iris-engine and every adapter package. Suggest adding "iris_ocr_adi" (and successors as they land) to the source list — or switching to measuring packages/ wholesale once a few adapters exist. Fine as part of T034 or a tiny hygiene commit.

Looking ahead

T034 (Datalab) is next and can reuse this exact shape: injection seam, clause-named tests, gated live test, .env.example section. The PR body format here (clause checklist + "out of scope additions" table with reasons) is the best one yet — keep it.

- Add poll deadline (max_poll_seconds=300) to prevent infinite hang
- Fix test_c006 silent pass with pytest.raises
- Add iris_ocr_adi to coverage source; coverage now 97.76%
- Add tests/__init__.py to prevent same-name collision across adapter packages
- Add 3 new tests: poll deadline, poll timeout, running->succeeded transition
@likith1908

Copy link
Copy Markdown
Contributor Author

All review points addressed:

Poll deadline (required): max_poll_seconds=300.0 added to AdiOCREngine.__init__. _poll() checks deadline at the top of each iteration and raises OCRUnavailable if exceeded. New test: test_poll_deadline_exceeded_raises_unavailable with max_poll_seconds=0.0 confirms the path.

test_c006 silent pass (non-blocking): Replaced bare try/except with pytest.raises(OCRUnsupportedContentType) - a non-raising extract now fails loudly.

Coverage source (non-blocking): Added iris_ocr_adi to [tool.coverage.run] source. Three additional tests added (poll timeout, 4xx catch-all, running->succeeded transition). Coverage is now 97.76% (gate: 95%).

Bonus: Added tests/__init__.py to prevent pytest module name collision when multiple adapter packages have same-named test files (test_unit.py, test_live.py).

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving on 2c4240d. All three review items closed:

  1. Poll deadlinemax_poll_seconds: float = 300.0 with the monotonic-deadline check at the top of the loop, exactly the suggested shape, plus test_poll_deadline_exceeded_raises_unavailable proving it fires on a perpetual running status.
  2. Bytes-leak test — now pytest.raises(...) with assertion on exc_info.value; the no-raise path fails loudly.
  3. Adapter coverageiris_ocr_adi added to [tool.coverage.run] source.

Bonus tests beyond the asks: poll-phase timeout, 4xx catch-all, and running→succeeded transition (the happy path through the poll loop's wait branch). 26 unit tests now. All 10 checks green. Ready to merge — and since #34/#35 will conflict on pyproject.toml coverage source and .env.example, merge this first and rebase those.

@likith1908 likith1908 merged commit 45fd430 into main Jun 12, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants