Implement Datalab OCR adapter with httpx client and contract test suite (T034)#34
Conversation
Question 1: confidence field typeDatalab's convert endpoint always returns
Question 2: bounding boxesThe Datalab convert endpoint does not return word-level polygon data; |
anmolg1997
left a comment
There was a problem hiding this comment.
One change needed before merge, and it's the exact item we just closed on #33: DatalabOCREngine._poll() has no deadline.
The blocker
_poll() is while True with no max duration — same shape the ADI client had before commit 2c4240d added max_poll_seconds. A Datalab job stuck in processing (or a degraded API that answers every poll without ever reaching complete/failed) hangs extract() forever; each GET succeeds so the per-request timeout never fires, and the T031 fallback never gets its chance.
This branch was cut before the ADI fix landed, so it predates the standard — but the standard is now set. Please port the same treatment:
def __init__(self, api_key, mode="balanced", *, poll_interval=2.0, max_poll_seconds=300.0, _http_client=None):
self._max_poll_seconds = max_poll_seconds
async def _poll(self, client, check_url):
deadline = time.monotonic() + self._max_poll_seconds
while True:
if time.monotonic() > deadline:
raise OCRUnavailable(
f"Datalab conversion did not complete within {self._max_poll_seconds:.0f}s"
)
await asyncio.sleep(self._poll_interval)
...Plus the matching test (perpetual processing status, max_poll_seconds=0.0, assert OCRUnavailable), mirroring test_poll_deadline_exceeded_raises_unavailable from the ADI suite.
Going forward: treat max_poll_seconds as part of the adapter pattern for every polling adapter (PaddleOCR/local won't poll HTTP, but any future one that does). Worth adding a line to the contract doc's implementation notes so T034+ adapters inherit it by spec rather than by review.
Everything else is approved-quality
Adapter (iris_ocr_datalab/client.py)
- Same proven shape as ADI: injection seam, typed-error taxonomy, defensive non-JSON and missing-
request_check_urlhandling. - Datalab-specific mappings are right: 413 →
OCRDocumentTooLarge(Datalab signals size at HTTP level, unlike ADI's app-level code), 400 →OCRMalformedDocument, 529 →OCRUnavailable(Datalab's overloaded status). complete-with-success: falsehandled separately fromstatus: failed— covers Datalab's two distinct failure signalling paths. Both tested.- Honest documentation of the two contract compromises in the module docstring: bboxes always empty (convert endpoint returns markdown only) and confidence derived from
parse_quality_scorenormalised from the 0-5 scale. The tasks.md acceptance note records the bbox limitation too — future consumers won't be surprised. - Page splitting on marker's
---delimiters with pad/trim reconciliation againstpage_count, both directions tested.
Tests — 31 unit tests, clause-named, including the C-OCR-004 adaptation (test_c004_bboxes_are_empty_list — correct reading of the clause given the endpoint's capability) and test_c005_missing_quality_score_defaults_to_0. The 0.0-default for a missing quality score is the conservative choice (unknown quality routes to human review rather than sailing through); note it's the opposite default from ADI's no-words case (1.0). Both defensible in context; no change needed.
Infra
--import-mode=importlib+ tests__init__.pyfiles — correct fix for thetest_unit.pymodule-name collision across adapter packages. This was going to bite the moment two adapters existed; good that it's solved in the PR that creates the second adapter.iris_ocr_datalabadded to coverage source. Pattern holding.
Stacking note: base is T033-adi-adapter, so after #33 squash-merges, rebase this onto main (the ADI files in this diff drop out) and the deadline fix can ride the same push. CI will then run against the true diff.
Re-review will be quick — the substance is done.
- Add 4 unit tests to reach 97% branch coverage on iris_ocr_datalab - Add --import-mode=importlib to resolve test_unit.py name collision between ocr-adi and ocr-datalab (hyphenated parent dirs prevent __init__.py-only fix) - Add pythonpath for scaffold tests that do sibling imports broken by importlib mode
92d74f3 to
71aaac3
Compare
_poll had no upper bound so a stuck Datalab job would loop forever. Added max_poll_seconds=300.0 constructor param and a deadline check at the top of each poll iteration; raises OCRUnavailable once exceeded. Test test_poll_deadline_exceeded_raises_unavailable covers the new path.
anmolg1997
left a comment
There was a problem hiding this comment.
Approved on 350ae9e. The one blocker from my prior review is closed.
Poll deadline — max_poll_seconds: float = 300.0 with the monotonic deadline check at the top of _poll(), matching the ADI treatment exactly, plus test_poll_deadline_exceeded_raises_unavailable. The unbounded loop is gone; a Datalab job stuck in processing now surfaces OCRUnavailable and lets the T031 fallback fire.
Verified locally on the rebased branch:
- Datalab suite: 32 passed, 1 skipped (live, gated).
- Full default suite: 238 passed, 2 skipped, 12 deselected — no module collision now that
--import-mode=importlibis in and the tests carry__init__.pyconsistently. make lint: 3 contracts kept, 0 broken.- T034 flipped to
[x]with the bbox-limitation note in the acceptance text.
Everything from the original substance review still holds (typed-error taxonomy, 413→TooLarge, 400→Malformed, 529→Unavailable, complete-with-success:false handling, page pad/trim). Ready to merge — this is the bottom of the stack, so it lands first and #35/#36/#37 rebase onto the new main behind it.
Summary
Implements T034: the Datalab OCR adapter. All Datalab HTTP traffic goes through raw
httpx.AsyncClientwith no Datalab SDK runtime dependency, consistent with the adapter pattern established in T033.packages/iris-adapters/ocr-datalab/src/iris_ocr_datalab/client.py:DatalabOCREngine- async POST to/api/v1/convert, polling loop untilstatus == true, page splitting on\\n\\n---\\n\\naligned topage_count, full typed error mapping; poll deadline guard viamax_poll_seconds=300.0packages/iris-adapters/ocr-datalab/src/iris_ocr_datalab/__init__.py: re-exportsDatalabOCREnginepackages/iris-adapters/ocr-datalab/pyproject.toml: dependencies (httpx>=0.27,iris-engine)packages/iris-adapters/ocr-datalab/tests/test_unit.py: 31 unit tests, all Datalab HTTP calls mocked; covers C-OCR-001 through C-OCR-010 plus full HTTP error mapping and page alignment edge casespackages/iris-adapters/ocr-datalab/tests/test_live.py: C-OCR-LIVE-001, gated onIRIS_OCR_LIVE_DATALAB=1packages/iris-adapters/ocr-datalab/tests/__init__.py: empty package marker, required for--import-mode=importlibcollision fixpyproject.toml: adds--import-mode=importlibandpythonpathto[tool.pytest.ini_options]; addsiris_ocr_datalabto[tool.coverage.run] sourcetasks/003-ocr-adapter-set/tasks.md: T034 marked completeTask reference
T034003-ocr-adapter-setAcceptance criteria
T034
test_c001_id_is_datalab,test_c001_version_is_semvertest_c002_pdf_extracts_markdowntest_c003_multipage_orderingtest_c004_bbox_always_empty_list(Datalab convert does not return polygon data - see open question)test_c005_confidence_in_range,test_c005_missing_quality_score_defaults_to_0test_c006_unsupported_content_type_raises,test_c006_error_does_not_contain_bytestest_c007_failed_status_raises_malformedtest_c008_empty_bytes_raises_malformedtest_c009_adapter_id_in_resulttest_c010_png_input_acceptedtest_poll_deadline_exceeded_raises_unavailablewithmax_poll_seconds=0.0test_live.pyruns underIRIS_OCR_LIVE_DATALAB=1PR checklist:
tasks.mdentry is satisfied.docs-ciworkflow passes (markdown lint + tasks structural check).Live test result
Run against the real Datalab endpoint with a 3-page PDF.
Authentication, polling, page splitting, and markdown extraction all work correctly. Bboxes are empty because the convert endpoint does not return polygon data (see open question 2).
Notes for the reviewer
No Datalab SDK
httpx.AsyncClientis used directly. The Datalab Python SDK is not a runtime dependency. The_http_clientinjection seam makes the adapter fully testable with a mock client.Datalab async flow
Datalab analysis is a two-step operation: POST to
/api/v1/convertreturns{"success": true, "request_check_url": "..."}; GET that URL untilstatus == true. The polling loop is in_poll().poll_intervaldefaults to 3.0 s in production but is set to 0.0 in tests. Adeadline = time.monotonic() + max_poll_secondscheck at the top of each iteration prevents the loop hanging if Datalab keeps returningstatus: falseindefinitely.Page splitting
Datalab returns all pages as a single markdown string separated by
\\n\\n---\\n\\n. The client splits on that separator and aligns the resulting list topage_countfrom the response body: if there are more splits than pages the list is trimmed; if there are fewer it is padded with empty strings. Both cases are covered bytest_page_count_fewer_than_split_trims_pagesandtest_page_count_more_than_split_pads_pages.Out of scope additions (beyond T034 acceptance)
--import-mode=importlibin pytest configocr-adiandocr-datalabhave a file namedtest_unit.py. Pytest's default mode resolves both totests.test_unit- a collision. Adding__init__.pyto thetests/directories alone does not fix this because the import root is still the hyphenated parent directory, which Python cannot use as a package name. Importlib mode assigns each test file a unique identity based on its full filesystem path.pythonpath = ["tests/001-project-scaffold"]sys.path, which breaks sibling imports in the scaffold tests (from test_t001_workspace import ...). Adding the directory topythonpathrestores that behaviour without reverting the importlib fix.Rebase Note
This PR is stacked on #33. Once #33 merges, rebase onto
mainbefore merging this one.