Skip to content

Implement Datalab OCR adapter with httpx client and contract test suite (T034)#34

Merged
likith1908 merged 3 commits into
mainfrom
T034-datalab-adapter
Jun 16, 2026
Merged

Implement Datalab OCR adapter with httpx client and contract test suite (T034)#34
likith1908 merged 3 commits into
mainfrom
T034-datalab-adapter

Conversation

@likith1908

@likith1908 likith1908 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements T034: the Datalab OCR adapter. All Datalab HTTP traffic goes through raw httpx.AsyncClient with no Datalab SDK runtime dependency, consistent with the adapter pattern established in T033.

  • packages/iris-adapters/ocr-datalab/src/iris_ocr_datalab/client.py: DatalabOCREngine - async POST to /api/v1/convert, polling loop until status == true, page splitting on \\n\\n---\\n\\n aligned to page_count, full typed error mapping; poll deadline guard via max_poll_seconds=300.0
  • packages/iris-adapters/ocr-datalab/src/iris_ocr_datalab/__init__.py: re-exports DatalabOCREngine
  • packages/iris-adapters/ocr-datalab/pyproject.toml: dependencies (httpx>=0.27, iris-engine)
  • packages/iris-adapters/ocr-datalab/tests/test_unit.py: 31 unit tests, all Datalab HTTP calls mocked; covers C-OCR-001 through C-OCR-010 plus full HTTP error mapping and page alignment edge cases
  • packages/iris-adapters/ocr-datalab/tests/test_live.py: C-OCR-LIVE-001, gated on IRIS_OCR_LIVE_DATALAB=1
  • packages/iris-adapters/ocr-datalab/tests/__init__.py: empty package marker, required for --import-mode=importlib collision fix
  • pyproject.toml: adds --import-mode=importlib and pythonpath to [tool.pytest.ini_options]; adds iris_ocr_datalab to [tool.coverage.run] source
  • tasks/003-ocr-adapter-set/tasks.md: T034 marked complete

Task reference

  • Task ID: T034
  • Workstream: 003-ocr-adapter-set

Acceptance criteria

T034

  • C-OCR-001: test_c001_id_is_datalab, test_c001_version_is_semver
  • C-OCR-002: test_c002_pdf_extracts_markdown
  • C-OCR-003: test_c003_multipage_ordering
  • C-OCR-004: test_c004_bbox_always_empty_list (Datalab convert does not return polygon data - see open question)
  • C-OCR-005: test_c005_confidence_in_range, test_c005_missing_quality_score_defaults_to_0
  • C-OCR-006: test_c006_unsupported_content_type_raises, test_c006_error_does_not_contain_bytes
  • C-OCR-007: test_c007_failed_status_raises_malformed
  • C-OCR-008: test_c008_empty_bytes_raises_malformed
  • C-OCR-009: test_c009_adapter_id_in_result
  • C-OCR-010: test_c010_png_input_accepted
  • HTTP error mapping (401, 403, 429, 5xx, timeout): covered by six dedicated tests
  • Poll deadline: test_poll_deadline_exceeded_raises_unavailable with max_poll_seconds=0.0
  • Live clause test_live.py runs under IRIS_OCR_LIVE_DATALAB=1
  • C-OCR-011 (OTEL span): deferred to T038

PR checklist:

  • Every acceptance criterion in the task's tasks.md entry is satisfied.
  • docs-ci workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

Live test result

Run against the real Datalab endpoint with a 3-page PDF.

adapter_id : datalab
total_pages: 3
latency_ms : 18 432
confidence : 0.0  (parse_quality_score is None - see open question 1 below)
pages[0]   : markdown present, bboxes=[]

Authentication, polling, page splitting, and markdown extraction all work correctly. Bboxes are empty because the convert endpoint does not return polygon data (see open question 2).

Notes for the reviewer

No Datalab SDK

httpx.AsyncClient is used directly. The Datalab Python SDK is not a runtime dependency. The _http_client injection seam makes the adapter fully testable with a mock client.

Datalab async flow

Datalab analysis is a two-step operation: POST to /api/v1/convert returns {"success": true, "request_check_url": "..."}; GET that URL until status == true. The polling loop is in _poll(). poll_interval defaults to 3.0 s in production but is set to 0.0 in tests. A deadline = time.monotonic() + max_poll_seconds check at the top of each iteration prevents the loop hanging if Datalab keeps returning status: false indefinitely.

Page splitting

Datalab returns all pages as a single markdown string separated by \\n\\n---\\n\\n. The client splits on that separator and aligns the resulting list to page_count from the response body: if there are more splits than pages the list is trimmed; if there are fewer it is padded with empty strings. Both cases are covered by test_page_count_fewer_than_split_trims_pages and test_page_count_more_than_split_pads_pages.

Out of scope additions (beyond T034 acceptance)

Addition Reason
--import-mode=importlib in pytest config Both ocr-adi and ocr-datalab have a file named test_unit.py. Pytest's default mode resolves both to tests.test_unit - a collision. Adding __init__.py to the tests/ directories alone does not fix this because the import root is still the hyphenated parent directory, which Python cannot use as a package name. Importlib mode assigns each test file a unique identity based on its full filesystem path.
pythonpath = ["tests/001-project-scaffold"] Importlib mode stops pytest from auto-adding the test directory to sys.path, which breaks sibling imports in the scaffold tests (from test_t001_workspace import ...). Adding the directory to pythonpath restores that behaviour without reverting the importlib fix.

Rebase Note

This PR is stacked on #33. Once #33 merges, rebase onto main before merging this one.

@likith1908

Copy link
Copy Markdown
Contributor Author

Question 1: confidence field type

Datalab's convert endpoint always returns parse_quality_score: null across all modes (fast, balanced, accurate). There is no quality signal from this endpoint. The adapter returns 0.0 rather than a fabricated value.

0.0 is honest but ambiguous: callers cannot distinguish "low confidence" from "no signal available". A cleaner contract would be OCRPageResult.confidence: Optional[float] where Datalab returns None and ADI returns a real float. This touches InMemoryOCREngine and the selector. Deferring to T037 (contract suite) may be cleaner. Happy to do it here if preferred.

Question 2: bounding boxes

The Datalab convert endpoint does not return word-level polygon data; bboxes is always []. Contract clause C-OCR-004 requires non-empty bboxes for a real document. Should the contract allow empty bboxes for adapters that do not provide them, or should C-OCR-004 be skipped for this adapter with a documented note? Flagging for T037.

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One change needed before merge, and it's the exact item we just closed on #33: DatalabOCREngine._poll() has no deadline.

The blocker

_poll() is while True with no max duration — same shape the ADI client had before commit 2c4240d added max_poll_seconds. A Datalab job stuck in processing (or a degraded API that answers every poll without ever reaching complete/failed) hangs extract() forever; each GET succeeds so the per-request timeout never fires, and the T031 fallback never gets its chance.

This branch was cut before the ADI fix landed, so it predates the standard — but the standard is now set. Please port the same treatment:

def __init__(self, api_key, mode="balanced", *, poll_interval=2.0, max_poll_seconds=300.0, _http_client=None):
    self._max_poll_seconds = max_poll_seconds

async def _poll(self, client, check_url):
    deadline = time.monotonic() + self._max_poll_seconds
    while True:
        if time.monotonic() > deadline:
            raise OCRUnavailable(
                f"Datalab conversion did not complete within {self._max_poll_seconds:.0f}s"
            )
        await asyncio.sleep(self._poll_interval)
        ...

Plus the matching test (perpetual processing status, max_poll_seconds=0.0, assert OCRUnavailable), mirroring test_poll_deadline_exceeded_raises_unavailable from the ADI suite.

Going forward: treat max_poll_seconds as part of the adapter pattern for every polling adapter (PaddleOCR/local won't poll HTTP, but any future one that does). Worth adding a line to the contract doc's implementation notes so T034+ adapters inherit it by spec rather than by review.

Everything else is approved-quality

Adapter (iris_ocr_datalab/client.py)

  • Same proven shape as ADI: injection seam, typed-error taxonomy, defensive non-JSON and missing-request_check_url handling.
  • Datalab-specific mappings are right: 413 → OCRDocumentTooLarge (Datalab signals size at HTTP level, unlike ADI's app-level code), 400 → OCRMalformedDocument, 529 → OCRUnavailable (Datalab's overloaded status).
  • complete-with-success: false handled separately from status: failed — covers Datalab's two distinct failure signalling paths. Both tested.
  • Honest documentation of the two contract compromises in the module docstring: bboxes always empty (convert endpoint returns markdown only) and confidence derived from parse_quality_score normalised from the 0-5 scale. The tasks.md acceptance note records the bbox limitation too — future consumers won't be surprised.
  • Page splitting on marker's --- delimiters with pad/trim reconciliation against page_count, both directions tested.

Tests — 31 unit tests, clause-named, including the C-OCR-004 adaptation (test_c004_bboxes_are_empty_list — correct reading of the clause given the endpoint's capability) and test_c005_missing_quality_score_defaults_to_0. The 0.0-default for a missing quality score is the conservative choice (unknown quality routes to human review rather than sailing through); note it's the opposite default from ADI's no-words case (1.0). Both defensible in context; no change needed.

Infra

  • --import-mode=importlib + tests __init__.py files — correct fix for the test_unit.py module-name collision across adapter packages. This was going to bite the moment two adapters existed; good that it's solved in the PR that creates the second adapter.
  • iris_ocr_datalab added to coverage source. Pattern holding.

Stacking note: base is T033-adi-adapter, so after #33 squash-merges, rebase this onto main (the ADI files in this diff drop out) and the deadline fix can ride the same push. CI will then run against the true diff.

Re-review will be quick — the substance is done.

- Add 4 unit tests to reach 97% branch coverage on iris_ocr_datalab
- Add --import-mode=importlib to resolve test_unit.py name collision between
  ocr-adi and ocr-datalab (hyphenated parent dirs prevent __init__.py-only fix)
- Add pythonpath for scaffold tests that do sibling imports broken by importlib mode
@likith1908 likith1908 force-pushed the T034-datalab-adapter branch from 92d74f3 to 71aaac3 Compare June 12, 2026 06:22
@likith1908 likith1908 changed the base branch from T033-adi-adapter to main June 12, 2026 06:23
_poll had no upper bound so a stuck Datalab job would loop forever.
Added max_poll_seconds=300.0 constructor param and a deadline check at
the top of each poll iteration; raises OCRUnavailable once exceeded.
Test test_poll_deadline_exceeded_raises_unavailable covers the new path.
@likith1908 likith1908 marked this pull request as ready for review June 12, 2026 06:30

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on 350ae9e. The one blocker from my prior review is closed.

Poll deadlinemax_poll_seconds: float = 300.0 with the monotonic deadline check at the top of _poll(), matching the ADI treatment exactly, plus test_poll_deadline_exceeded_raises_unavailable. The unbounded loop is gone; a Datalab job stuck in processing now surfaces OCRUnavailable and lets the T031 fallback fire.

Verified locally on the rebased branch:

  • Datalab suite: 32 passed, 1 skipped (live, gated).
  • Full default suite: 238 passed, 2 skipped, 12 deselected — no module collision now that --import-mode=importlib is in and the tests carry __init__.py consistently.
  • make lint: 3 contracts kept, 0 broken.
  • T034 flipped to [x] with the bbox-limitation note in the acceptance text.

Everything from the original substance review still holds (typed-error taxonomy, 413→TooLarge, 400→Malformed, 529→Unavailable, complete-with-success:false handling, page pad/trim). Ready to merge — this is the bottom of the stack, so it lands first and #35/#36/#37 rebase onto the new main behind it.

@likith1908 likith1908 merged commit 8584c89 into main Jun 16, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants