Implement ADI OCR adapter, py.typed marker, .env.example OCR vars (T033)#33
Conversation
abac774 to
3df277b
Compare
|
@anmolg1997 , Please review this PR. |
anmolg1997
left a comment
There was a problem hiding this comment.
Approved — first real adapter and it sets a high bar. One robustness gap to close (poll deadline, below); fine to fold into this PR or land as an immediate follow-up commit, your call.
What I verified
Adapter implementation (iris_ocr_adi/client.py)
- No Azure SDK — raw
httpx.AsyncClientper plan.md key choice 3. The_http_clientinjection seam makes the whole thing testable without patching. Right design. - Two-step ADI flow modeled correctly: POST → 202 +
Operation-Location, then GET-poll untilsucceeded/failed.poll_interval=0.0in tests so the suite stays fast (full run is 1.1s). - Error taxonomy mapping is complete and correct:
- HTTP 401/403 →
OCRAuthenticationFailed, 429 →OCRRateLimited, 5xx →OCRUnavailable, timeout/connect →OCRUnavailable - ADI app-level
InvalidContent→OCRMalformedDocument,ContentTooLarge→OCRDocumentTooLarge, genericfailed→OCRUnavailable(so the fallback engine from T031 can fire) - Missing
Operation-Locationand non-JSON poll bodies → typedOCRUnavailableinstead of rawAttributeError/JSONDecodeError. These "out of scope additions" are exactly the defensive shell a real adapter needs — good judgment call, and the PR body documents why each exists.
- HTTP 401/403 →
- Polygon→BBox conversion handles ADI's per-page unit split (inch for PDF/TIFF at 96 DPI, pixel for PNG/JPEG), and
max(1, ...)guards rounding to zero-width boxes that would violate C-OCR-004. - Confidence clamped to [0,1] after averaging word confidences. Empty-words page defaults to 1.0 — defensible (no evidence of low quality).
time.monotonic()for latency — correct clock for intervals.
Tests
- 22 unit tests, every contract clause C-OCR-001 through C-OCR-010 named in the test function (
test_c00X_...). This naming convention makes T037's parametrised contract suite almost mechanical to write — the clause-to-test mapping is already explicit. Keep this convention for T034-T036. - HTTP error mapping covered by 10 dedicated tests including the two defensive paths (missing header, non-JSON body).
- Live test correctly gated on
IRIS_OCR_LIVE_ADI=1withpytest.mark.skipif, self-contained fixture PDF, no committed credentials.
Supporting changes
py.typedon iris-engine — correct PEP 561 fix, and the PR body explains why it surfaced here. Without it every downstream adapter would seeiris_engineas untyped under strict mypy..env.exampleOCR section documents all four new vars with the allowlist pragma on the placeholder key. Doc-and-implementation-together, again.- T033 acceptance text updated with explicit clause references — fifth instance of the inline spec-fix pattern.
- mypy now checks 15 source files;
iris_ocr_adi.*dropped out of the unused-overrides list, meaning the strict override is now actually exercising the adapter.
One gap to close: the poll loop has no deadline
_poll() is while True with no max duration. Each individual GET has the client timeout, but a stuck operation that keeps returning status: "running" forever (or a long-degraded ADI that answers every poll politely) hangs extract() indefinitely — the per-request timeout never fires because each request succeeds. In production behind the T031 fallback wrapper, this also means the fallback never gets a chance.
Suggested shape:
def __init__(self, ..., poll_interval: float = 1.0, max_poll_seconds: float = 300.0, ...):
self._max_poll_seconds = max_poll_seconds
async def _poll(self, client, operation_url):
deadline = time.monotonic() + self._max_poll_seconds
while True:
if time.monotonic() > deadline:
raise OCRUnavailable(
f"ADI analysis did not complete within {self._max_poll_seconds:.0f}s"
)
...Plus one test: poll body always running, max_poll_seconds=0.01, assert OCRUnavailable. Five minutes of work; my preference is folding it into this PR so the first adapter ships hang-proof, but an immediate follow-up commit works too.
Two minor notes (genuinely non-blocking)
test_c006_error_does_not_contain_bytesusestry/except— ifextractdoesn't raise, the test silently passes. Wrap inpytest.raises(OCRUnsupportedContentType)and assert on the exception object so the no-raise path fails loudly.- Adapter coverage isn't measured:
[tool.coverage.run] source = ["iris_engine", "iris_config"]doesn't includeiris_ocr_adi, so the 22 tests don't count toward the gate.best-practices.mdsection 3 says 80% on iris-engine and every adapter package. Suggest adding"iris_ocr_adi"(and successors as they land) to the source list — or switching to measuringpackages/wholesale once a few adapters exist. Fine as part of T034 or a tiny hygiene commit.
Looking ahead
T034 (Datalab) is next and can reuse this exact shape: injection seam, clause-named tests, gated live test, .env.example section. The PR body format here (clause checklist + "out of scope additions" table with reasons) is the best one yet — keep it.
- Add poll deadline (max_poll_seconds=300) to prevent infinite hang - Fix test_c006 silent pass with pytest.raises - Add iris_ocr_adi to coverage source; coverage now 97.76% - Add tests/__init__.py to prevent same-name collision across adapter packages - Add 3 new tests: poll deadline, poll timeout, running->succeeded transition
|
All review points addressed: Poll deadline (required): test_c006 silent pass (non-blocking): Replaced bare Coverage source (non-blocking): Added Bonus: Added |
anmolg1997
left a comment
There was a problem hiding this comment.
Re-approving on 2c4240d. All three review items closed:
- Poll deadline —
max_poll_seconds: float = 300.0with the monotonic-deadline check at the top of the loop, exactly the suggested shape, plustest_poll_deadline_exceeded_raises_unavailableproving it fires on a perpetualrunningstatus. - Bytes-leak test — now
pytest.raises(...)with assertion onexc_info.value; the no-raise path fails loudly. - Adapter coverage —
iris_ocr_adiadded to[tool.coverage.run] source.
Bonus tests beyond the asks: poll-phase timeout, 4xx catch-all, and running→succeeded transition (the happy path through the poll loop's wait branch). 26 unit tests now. All 10 checks green. Ready to merge — and since #34/#35 will conflict on pyproject.toml coverage source and .env.example, merge this first and rebase those.

Summary
Implements T033: the Azure Document Intelligence OCR adapter. All ADI HTTP traffic goes through raw `httpx.AsyncClient` with no Azure SDK dependency, consistent with the decision recorded in plan.md key choice 3.
Task reference
Acceptance criteria
T033
PR checklist:
Notes for the reviewer
No Azure SDK
`httpx.AsyncClient` is used directly. The Azure SDK (`azure-ai-documentintelligence`) is not a dependency. This keeps the package lightweight and testable with a mock client injected via `_http_client`. Token-based auth (Entra) is deferred to the OIDC workstream per plan.md key choice 3.
ADI async flow
ADI analysis is a two-step operation: POST returns HTTP 202 and an `Operation-Location` URL; GET that URL until `status == "succeeded"`. The polling loop is in `_poll()`. `poll_interval` defaults to 1.0 s in production but is set to 0.0 in tests so they do not sleep. A `deadline = time.monotonic() + max_poll_seconds` check at the top of each poll iteration ensures the loop cannot hang indefinitely if ADI keeps returning `"running"`.
Polygon-to-BBox unit conversion
ADI returns polygons (8 floats: x0,y0 to x3,y3 going clockwise). The unit is per-page: `"inch"` for PDFs and TIFFs, `"pixel"` for PNG and JPEG. `_polygon_to_bbox()` scales inch coordinates to pixels at 96 DPI (the standard screen approximation). `max(1, ...)` ensures zero-width/height boxes - which would fail C-OCR-004 - never occur from rounding.
Out of scope additions (beyond T033 acceptance)
The following were added beyond what T033 explicitly required. They are included because they reflect real ADI behaviour and prevent silent swallowing of typed errors:
C-OCR-006 bytes-in-error test
C-OCR-006 in the contract explicitly states "the error does not leak the document bytes in its message". `test_c006_error_does_not_contain_bytes` wraps the call in `pytest.raises(OCRUnsupportedContentType)` so a non-raising `extract` fails loudly. The implementation raises `OCRUnsupportedContentType(f"ADI adapter does not support content type {content_type!r}")` - the content type string, not the bytes.
Coverage
`iris_ocr_adi` is included in `[tool.coverage.run] source`. Current coverage: 97.76% (gate: 95%). Three additional tests were added after the initial review to reach this level: `test_poll_timeout_raises_unavailable`, `test_http_4xx_catch_all_raises_unavailable`, `test_poll_running_then_succeeded`.