T037: Parametrised OCR contract suite#37
Conversation
anmolg1997
left a comment
There was a problem hiding this comment.
Approved on 33f0e54. This is the keystone of WS003 and it does exactly what the workstream needed — one parametrised suite proving all five adapters honour the same contract.
What I verified
Structure
@pytest.mark.parametrize("adapter_id", _ALL)across the clause functions, with three audience lists:_ALL(5 adapters),_CONTENT_VALIDATING(the 4 real ones, for C-OCR-007),_LOCAL(local + paddleocr, for C-OCR-LOCAL-001). Clean separation — each clause runs against exactly the adapters that can satisfy it.pytestmark = pytest.mark.contracttags the whole module. Thecontractmarker is registered in pyproject and is NOT in theaddoptsdeselect list, so these run in the defaultmake testlane — verified: 66 contract tests execute on every PR, not just on demand.- Each adapter built through its own established injection seam (httpx AsyncMock for ADI/Datalab, MagicMock pipeline for PaddleOCR, MagicMock pytesseract for local, pre-registered responses for in-memory). No network, no subprocess, no model load. Runs in 0.43s.
Coverage decisions, both correct
- C-OCR-007 excludes in-memory —
InMemoryOCREnginelooks up bydocument_idand never parses bytes, so it structurally cannot detect a malformed PDF. The exclusion was documented back in #32's clause table; consistent. - C-OCR-LOCAL-001 covers local + paddleocr — the two adapters that run entirely on-device.
Gates
- Contract suite: 66 passed, 1 skipped (C-OCR-011, deferred to T038).
- Full default suite: 368 passed, 1 skipped, 16 deselected.
- lint: 3 contracts kept, 0 broken. T037 flipped.
Two non-blocking notes
-
C-OCR-LOCAL-001 doesn't actually prove "no network".
test_c_ocr_local_001_no_outbound_networkrunsextract()and assertstotal_pages >= 1— it verifies extraction works, not that no socket was opened. The claim is true by construction (neither adapter imports httpx), but the test as written would still pass if someone later added a network call to the local adapter. A stronger version installs a socket guard for the duration:def test_c_ocr_local_001_no_outbound_network(adapter_id, monkeypatch): def _deny(*a, **k): raise AssertionError("local adapter attempted a network connection") monkeypatch.setattr(socket.socket, "connect", _deny) result = _run(_engine(adapter_id).extract(_CTX, _DOC_ID, _make_pdf(), "application/pdf")) assert result.total_pages >= 1
Then the test fails loudly the day a network call sneaks in — which is the whole point of an airgap clause. Worth a follow-up; not worth a round-trip now.
-
Doubled comment on line 123 —
api_key="test-key", # pragma: allowlist secret # pragma: allowlist secret. Harmless, just trim one. (_datalab_engineon line 157 has the single correct form.)
WS003 status
T030 through T037 all done. Remaining: T038 (OTEL spans on every adapter's extract, which un-skips C-OCR-011 here) and T039 (per-adapter READMEs — and the place to land the PaddleOCR airgap pre-seed instructions and the live-thread confirmation I asked for on #35).
A natural shape for the close-out: T038 wires the spans and flips the skip on test_c011_otel_span_carries_tenant_id in this very suite — so that test function is the acceptance target. T039 is pure docs. Could even be one PR since they're both cross-adapter sweeps.
Strong sprint — four adapters plus a unifying contract suite, all landed clean. Once #34→#37 merge in stack order, WS003 is effectively complete bar OTEL and docs.
33f0e54 to
3d7aa44
Compare
851653d to
ba6181f
Compare
3d7aa44 to
688774f
Compare
ba6181f to
98ee039
Compare
anmolg1997
left a comment
There was a problem hiding this comment.
Re-approving on 2938ddd. You picked up both non-blocking notes from my last review even though neither gated the merge:
- C-OCR-LOCAL-001 now actually proves no network —
monkeypatch.setattr(socket.socket, "connect", _deny)wraps the extract call, so the test fails loudly if local/paddleocr ever opens a socket. That's the real airgap guarantee, not just 'extraction works'. Verified: 66 contract tests pass with the guard in place. - Doubled
# pragma: allowlist secretremoved.
Full suite 368 passed, lint clean. This is the top of the stack — once #35 → #36 → #37 merge in order, WS003's adapter sprint is complete bar T038/T039.
Appreciate you tightening these without being asked to.
98ee039 to
2156122
Compare
2938ddd to
193ab70
Compare
047a1e5 to
5baccdf
Compare
tests/contract/test_ocr_contract.py: 66 tests + 1 skipped placeholder. C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 verified against all five adapters (in-memory, adi, datalab, paddleocr, local) with mocked externals. C-OCR-007 excludes in-memory: documented gap from PR#32 - fake engine returns pre-registered results and cannot detect malformed PDFs without a parser. C-OCR-011 skipped pending T038 (OTEL spans).
- C-OCR-LOCAL-001: add monkeypatch socket.socket.connect guard so the test fails loudly if a local adapter ever opens a network connection, not just when extraction returns pages. - Lines 123 and 258: remove doubled # pragma: allowlist secret comment.
193ab70 to
b6c54b9
Compare
Summary
Implements T037: a single parametrised contract suite that verifies every OCR adapter against every contract clause in one run. Previously each adapter had its own per-clause unit tests; this suite adds a cross-adapter layer that catches any adapter drifting from the shared protocol.
tests/contract/test_ocr_contract.py: 67 tests (66 pass, 1 skipped). All five adapters (in-memory, adi, datalab, paddleocr, local) are exercised against C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 using injected mocks - no network calls, no subprocess calls, runs in under 1 second undermake test.Task reference
T037003-ocr-adapter-setAcceptance criteria
T037
test_c001_id_is_correct,test_c001_version_is_semver- all 5 adapterstest_c002_pdf_extracts_markdown- all 5 adapterstest_c003_multipage_ordering,test_c003_page_numbers_start_at_one- all 5 adapterstest_c004_bbox_non_negative_xy_positive_wh- all 5 adapterstest_c005_confidence_in_range- all 5 adapterstest_c006_unsupported_content_type_raises,test_c006_error_does_not_contain_bytes- all 5 adapterstest_c007_malformed_pdf_raises_malformed- 4 adapters (adi, datalab, paddleocr, local); in-memory excluded - see Notestest_c008_empty_bytes_raises_malformed- all 5 adapterstest_c009_adapter_id_in_result- all 5 adapterstest_c010_png_input_accepted- all 5 adapterstest_c_ocr_local_001_no_outbound_network- local, paddleocrtest_c011_otel_span_carries_tenant_id- skipped, deferred to T038PR checklist:
tasks.mdentry is satisfied.docs-ciworkflow passes (markdown lint + tasks structural check).Notes for the reviewer
Mock injection strategy
Each adapter is built using the same injection seam established in its own package:
_http_client: httpx.AsyncClientAsyncMockreturning a canned succeeded poll response_http_client: httpx.AsyncClientAsyncMockreturning a canned complete poll response_pipeline: AnyMagicMockwithpredict()returning_MockResultobjects_pytesseract: AnyMagicMockwithimage_to_data()returning word data dictsresponses: dict[UUID, OCRResult]OCRResultkeyed on fixed_DOC_IDC-OCR-007 exclusion for in-memory
InMemoryOCREnginecannot detect malformed PDFs because it never parses the document bytes - it looks up the pre-registered response bydocument_idand returns it regardless of content. This is not a new finding: PR #32 (T032) explicitly documented it in a clause coverage table:The four real adapters all pass C-OCR-007. The HTTP adapters (adi, datalab) use separate
_malformed_enginebuilders that configure their mocks to return error responses (ADI:InvalidContentpoll failure; Datalab: HTTP 400), matching the server-side validation path their real implementations use.C-OCR-007 for HTTP adapters needs error-mode mocks
The standard
_adi_engine()and_datalab_engine()builders return success responses unconditionally. For C-OCR-007,_malformed_engine(adapter_id)selects a separate builder (_adi_engine_malformed,_datalab_engine_malformed) that configures the mock to return the appropriate error response. Inference adapters (paddleocr, local) validate locally viapymupdf.open(), so the standard engine handles C-OCR-007 automatically when passedb"not a valid pdf".Factory pattern
Tests use a
_engine(adapter_id, pages=N, text=...)factory instead of pytest fixtures. This keeps each test self-contained and avoids fixture parametrize indirection. Multi-page tests (C-OCR-003) call_engine(adapter_id, pages=3)which configures the mock to return 3 pages viaside_effectfor inference adapters and\n\n---\n\nseparator in the Datalab markdown.Out of scope additions (beyond T037 acceptance)
None. This PR adds only
tests/contract/test_ocr_contract.pyand flips T037 intasks.md.Rebase note
This PR is stacked on #36. Once #36 merges, rebase onto
mainbefore merging this one.