Skip to content

T037: Parametrised OCR contract suite#37

Merged
likith1908 merged 2 commits into
mainfrom
T037-ocr-contract-suite
Jun 17, 2026
Merged

T037: Parametrised OCR contract suite#37
likith1908 merged 2 commits into
mainfrom
T037-ocr-contract-suite

Conversation

@likith1908

@likith1908 likith1908 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements T037: a single parametrised contract suite that verifies every OCR adapter against every contract clause in one run. Previously each adapter had its own per-clause unit tests; this suite adds a cross-adapter layer that catches any adapter drifting from the shared protocol.

  • tests/contract/test_ocr_contract.py: 67 tests (66 pass, 1 skipped). All five adapters (in-memory, adi, datalab, paddleocr, local) are exercised against C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 using injected mocks - no network calls, no subprocess calls, runs in under 1 second under make test.

Task reference

  • Task ID: T037
  • Workstream: 003-ocr-adapter-set

Acceptance criteria

T037

  • C-OCR-001: test_c001_id_is_correct, test_c001_version_is_semver - all 5 adapters
  • C-OCR-002: test_c002_pdf_extracts_markdown - all 5 adapters
  • C-OCR-003: test_c003_multipage_ordering, test_c003_page_numbers_start_at_one - all 5 adapters
  • C-OCR-004: test_c004_bbox_non_negative_xy_positive_wh - all 5 adapters
  • C-OCR-005: test_c005_confidence_in_range - all 5 adapters
  • C-OCR-006: test_c006_unsupported_content_type_raises, test_c006_error_does_not_contain_bytes - all 5 adapters
  • C-OCR-007: test_c007_malformed_pdf_raises_malformed - 4 adapters (adi, datalab, paddleocr, local); in-memory excluded - see Notes
  • C-OCR-008: test_c008_empty_bytes_raises_malformed - all 5 adapters
  • C-OCR-009: test_c009_adapter_id_in_result - all 5 adapters
  • C-OCR-010: test_c010_png_input_accepted - all 5 adapters
  • C-OCR-LOCAL-001: test_c_ocr_local_001_no_outbound_network - local, paddleocr
  • C-OCR-011 (OTEL span): test_c011_otel_span_carries_tenant_id - skipped, deferred to T038

PR checklist:

  • Every acceptance criterion in the task's tasks.md entry is satisfied.
  • docs-ci workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

Notes for the reviewer

Mock injection strategy

Each adapter is built using the same injection seam established in its own package:

Adapter Seam How mocked
adi _http_client: httpx.AsyncClient AsyncMock returning a canned succeeded poll response
datalab _http_client: httpx.AsyncClient AsyncMock returning a canned complete poll response
paddleocr _pipeline: Any MagicMock with predict() returning _MockResult objects
local _pytesseract: Any MagicMock with image_to_data() returning word data dicts
in-memory responses: dict[UUID, OCRResult] Pre-registered OCRResult keyed on fixed _DOC_ID

C-OCR-007 exclusion for in-memory

InMemoryOCREngine cannot detect malformed PDFs because it never parses the document bytes - it looks up the pre-registered response by document_id and returns it regardless of content. This is not a new finding: PR #32 (T032) explicitly documented it in a clause coverage table:

C-OCR-007 | Malformed PDF raises OCRMalformedDocument | Not covered - in-memory cannot detect malformed PDFs without a parser; only empty bytes detected

The four real adapters all pass C-OCR-007. The HTTP adapters (adi, datalab) use separate _malformed_engine builders that configure their mocks to return error responses (ADI: InvalidContent poll failure; Datalab: HTTP 400), matching the server-side validation path their real implementations use.

C-OCR-007 for HTTP adapters needs error-mode mocks

The standard _adi_engine() and _datalab_engine() builders return success responses unconditionally. For C-OCR-007, _malformed_engine(adapter_id) selects a separate builder (_adi_engine_malformed, _datalab_engine_malformed) that configures the mock to return the appropriate error response. Inference adapters (paddleocr, local) validate locally via pymupdf.open(), so the standard engine handles C-OCR-007 automatically when passed b"not a valid pdf".

Factory pattern

Tests use a _engine(adapter_id, pages=N, text=...) factory instead of pytest fixtures. This keeps each test self-contained and avoids fixture parametrize indirection. Multi-page tests (C-OCR-003) call _engine(adapter_id, pages=3) which configures the mock to return 3 pages via side_effect for inference adapters and \n\n---\n\n separator in the Datalab markdown.

Out of scope additions (beyond T037 acceptance)

None. This PR adds only tests/contract/test_ocr_contract.py and flips T037 in tasks.md.

Rebase note

This PR is stacked on #36. Once #36 merges, rebase onto main before merging this one.

@likith1908 likith1908 changed the base branch from main to T036-tesseract-adapter June 16, 2026 06:14
@likith1908 likith1908 marked this pull request as ready for review June 16, 2026 06:16
@likith1908 likith1908 requested a review from anmolg1997 as a code owner June 16, 2026 06:16

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on 33f0e54. This is the keystone of WS003 and it does exactly what the workstream needed — one parametrised suite proving all five adapters honour the same contract.

What I verified

Structure

  • @pytest.mark.parametrize("adapter_id", _ALL) across the clause functions, with three audience lists: _ALL (5 adapters), _CONTENT_VALIDATING (the 4 real ones, for C-OCR-007), _LOCAL (local + paddleocr, for C-OCR-LOCAL-001). Clean separation — each clause runs against exactly the adapters that can satisfy it.
  • pytestmark = pytest.mark.contract tags the whole module. The contract marker is registered in pyproject and is NOT in the addopts deselect list, so these run in the default make test lane — verified: 66 contract tests execute on every PR, not just on demand.
  • Each adapter built through its own established injection seam (httpx AsyncMock for ADI/Datalab, MagicMock pipeline for PaddleOCR, MagicMock pytesseract for local, pre-registered responses for in-memory). No network, no subprocess, no model load. Runs in 0.43s.

Coverage decisions, both correct

  • C-OCR-007 excludes in-memoryInMemoryOCREngine looks up by document_id and never parses bytes, so it structurally cannot detect a malformed PDF. The exclusion was documented back in #32's clause table; consistent.
  • C-OCR-LOCAL-001 covers local + paddleocr — the two adapters that run entirely on-device.

Gates

  • Contract suite: 66 passed, 1 skipped (C-OCR-011, deferred to T038).
  • Full default suite: 368 passed, 1 skipped, 16 deselected.
  • lint: 3 contracts kept, 0 broken. T037 flipped.

Two non-blocking notes

  1. C-OCR-LOCAL-001 doesn't actually prove "no network". test_c_ocr_local_001_no_outbound_network runs extract() and asserts total_pages >= 1 — it verifies extraction works, not that no socket was opened. The claim is true by construction (neither adapter imports httpx), but the test as written would still pass if someone later added a network call to the local adapter. A stronger version installs a socket guard for the duration:

    def test_c_ocr_local_001_no_outbound_network(adapter_id, monkeypatch):
        def _deny(*a, **k):
            raise AssertionError("local adapter attempted a network connection")
        monkeypatch.setattr(socket.socket, "connect", _deny)
        result = _run(_engine(adapter_id).extract(_CTX, _DOC_ID, _make_pdf(), "application/pdf"))
        assert result.total_pages >= 1

    Then the test fails loudly the day a network call sneaks in — which is the whole point of an airgap clause. Worth a follow-up; not worth a round-trip now.

  2. Doubled comment on line 123api_key="test-key", # pragma: allowlist secret # pragma: allowlist secret. Harmless, just trim one. (_datalab_engine on line 157 has the single correct form.)

WS003 status

T030 through T037 all done. Remaining: T038 (OTEL spans on every adapter's extract, which un-skips C-OCR-011 here) and T039 (per-adapter READMEs — and the place to land the PaddleOCR airgap pre-seed instructions and the live-thread confirmation I asked for on #35).

A natural shape for the close-out: T038 wires the spans and flips the skip on test_c011_otel_span_carries_tenant_id in this very suite — so that test function is the acceptance target. T039 is pure docs. Could even be one PR since they're both cross-adapter sweeps.

Strong sprint — four adapters plus a unifying contract suite, all landed clean. Once #34#37 merge in stack order, WS003 is effectively complete bar OTEL and docs.

@likith1908 likith1908 force-pushed the T037-ocr-contract-suite branch from 33f0e54 to 3d7aa44 Compare June 16, 2026 06:35
@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from 851653d to ba6181f Compare June 16, 2026 06:35
@likith1908 likith1908 force-pushed the T037-ocr-contract-suite branch from 3d7aa44 to 688774f Compare June 16, 2026 06:53
@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from ba6181f to 98ee039 Compare June 16, 2026 06:53
anmolg1997
anmolg1997 previously approved these changes Jun 16, 2026

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving on 2938ddd. You picked up both non-blocking notes from my last review even though neither gated the merge:

  1. C-OCR-LOCAL-001 now actually proves no networkmonkeypatch.setattr(socket.socket, "connect", _deny) wraps the extract call, so the test fails loudly if local/paddleocr ever opens a socket. That's the real airgap guarantee, not just 'extraction works'. Verified: 66 contract tests pass with the guard in place.
  2. Doubled # pragma: allowlist secret removed.

Full suite 368 passed, lint clean. This is the top of the stack — once #35#36#37 merge in order, WS003's adapter sprint is complete bar T038/T039.

Appreciate you tightening these without being asked to.

@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from 98ee039 to 2156122 Compare June 16, 2026 08:41
@likith1908 likith1908 force-pushed the T037-ocr-contract-suite branch from 2938ddd to 193ab70 Compare June 16, 2026 08:41
@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch 2 times, most recently from 047a1e5 to 5baccdf Compare June 17, 2026 10:40
@likith1908 likith1908 changed the base branch from T036-tesseract-adapter to main June 17, 2026 10:49
@likith1908 likith1908 dismissed anmolg1997’s stale review June 17, 2026 10:49

The base branch was changed.

tests/contract/test_ocr_contract.py: 66 tests + 1 skipped placeholder.
C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 verified against all five
adapters (in-memory, adi, datalab, paddleocr, local) with mocked externals.
C-OCR-007 excludes in-memory: documented gap from PR#32 - fake engine returns
pre-registered results and cannot detect malformed PDFs without a parser.
C-OCR-011 skipped pending T038 (OTEL spans).
- C-OCR-LOCAL-001: add monkeypatch socket.socket.connect guard so the
  test fails loudly if a local adapter ever opens a network connection,
  not just when extraction returns pages.
- Lines 123 and 258: remove doubled # pragma: allowlist secret comment.
@likith1908 likith1908 force-pushed the T037-ocr-contract-suite branch from 193ab70 to b6c54b9 Compare June 17, 2026 10:49

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved on b6c54b9 — clean rebase onto main after #36 landed, own diff is the contract suite only (470 lines, identical to reviewed, with the socket-guard and de-duplicated pragma still in place). All green. Clear to merge; then rebase #38 onto the new main (it's currently DIRTY) and ping me.

@likith1908 likith1908 merged commit d9c650b into main Jun 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants