T036: Implement Tesseract local OCR adapter by likith1908 · Pull Request #36 · auropro-hyd/IRIS

likith1908 · 2026-06-15T06:18:39Z

Summary

Implements T036: the Tesseract local OCR adapter. Wraps pytesseract (a thin Python subprocess wrapper around the tesseract binary). No network calls are ever made; C-OCR-LOCAL-001 is satisfied unconditionally. PDF pages are rasterised by PyMuPDF at 150 DPI, then each image is passed to pytesseract.image_to_data() which returns per-word confidence scores in [0, 100] and axis-aligned bounding boxes.

packages/iris-adapters/ocr-local/src/iris_ocr_local/client.py: TesseractEngine - lazy pytesseract import with binary-presence check at init; _to_images() handles PDF via PyMuPDF and images via ImageSequence.Iterator; _run_page() re-applies tesseract_cmd before each subprocess call to avoid global-state clobber; _map_result() groups words by block_num in sorted numeric order and normalises per-word confidence to [0.0, 1.0]
packages/iris-adapters/ocr-local/src/iris_ocr_local/__init__.py: re-exports TesseractEngine
packages/iris-adapters/ocr-local/pyproject.toml: dependencies (pytesseract>=0.3.10, pymupdf, Pillow, iris-engine)
packages/iris-adapters/ocr-local/tests/test_unit.py: 29 unit tests, all inference mocked via _pytesseract injection seam; covers C-OCR-001 through C-OCR-010, C-OCR-LOCAL-001, result mapping, block grouping, and bbox conversion
packages/iris-adapters/ocr-local/tests/test_live.py: C-OCR-LIVE-001, gated on IRIS_OCR_LIVE_LOCAL=1 and pytest.mark.slow
Makefile: make install now prints a warning when tesseract binary is absent; make test-cov adds --cov=iris_ocr_local to explicit coverage flags
pyproject.toml: iris_ocr_local added to [tool.coverage.run] source
uv.lock: pytesseract 0.3.13 added
tasks/003-ocr-adapter-set/tasks.md: T036 marked complete

Task reference

Task ID: T036
Workstream: 003-ocr-adapter-set

Acceptance criteria

T036

PR checklist:

Every acceptance criterion in the task's tasks.md entry is satisfied.
docs-ci workflow passes (markdown lint + tasks structural check).
No em-dashes introduced in new prose.
All internal links resolve.
No secrets, credentials, or personal names introduced.

Live test result

Run against 2-page PDF with the system Tesseract binary on the my machine.

adapter      : local
total_pages  : 2
latency_ms   : 5794  (wall 5797 ms)
page 1       : confidence=0.70, bboxes=208
page 2       : confidence=0.61, bboxes=243

Notes for the reviewer

Global state in pytesseract.tesseract_cmd

pytesseract.pytesseract.tesseract_cmd is a module-level string that the library reads immediately before spawning the subprocess. It is process-wide; two TesseractEngine instances configured with different binary paths cannot safely run concurrently. The fix applied here is: _load_pytesseract() returns (module, cmd), the engine stores self._tesseract_cmd, and _run_page() re-writes the global before every subprocess call. This is safe for the single-engine production case. Concurrent multi-binary usage is documented in the module docstring as unsupported.

Block ordering in TSV output

pytesseract.image_to_data() returns TSV rows in layout-visit order, not necessarily ascending block_num order. For multi-column documents, block 2 may appear before block 1 in the row list. _map_result() collects words into blocks: dict[int, list[str]] and emits them with sorted(blocks.items()) to guarantee numeric block order in the output markdown. test_words_from_multiple_blocks_joined_by_double_newline and the multipage ordering test both exercise this path.

asyncio.to_thread usage

Both _to_images() (PyMuPDF PDF rasterisation) and the _run_page() loop (Tesseract subprocess) are CPU-bound and blocking. Both are offloaded via asyncio.to_thread so the adapter does not block the event loop during extraction. The lambda captures pytesseract and tesseract_cmd as local variables before the to_thread call to avoid closing over self unnecessarily.

IRIS_TESSERACT_CMD env var

On Windows the Tesseract binary is typically at a path not on PATH (e.g. C:\Program Files\Tesseract-OCR\tesseract.exe). IRIS_TESSERACT_CMD overrides pytesseract.tesseract_cmd at init. The tesseract_cmd constructor parameter takes precedence over the env var; the env var is a deployment-level fallback.

Out of scope additions (beyond T036 acceptance)

Addition	Reason
`make install` tesseract binary check	Without `tesseract-ocr` installed at the OS level, the adapter fails silently at first use with an unhelpful `TesseractNotFoundError`. The check in `make install` surfaces the missing system dependency immediately with install instructions for Ubuntu, macOS, and Windows.
`--cov=iris_ocr_local` in `make test-cov`	The explicit `--cov` flags in `test-cov` override `[tool.coverage.run] source` in `pyproject.toml`. The existing four adapters were covered but `iris_ocr_local` was absent from the flags, so its coverage was silently excluded from the report and the 95% gate.

Considerations

Confidence is a real measured signal for this adapter
Unlike the PaddleOCR-VL adapter (which has no per-block score and returns 1.0 as a placeholder), Tesseract returns a per-word confidence in [0, 100] from its internal classifier. The adapter normalises to [0.0, 1.0] and averages across all valid words per page. Pages with no detected words return confidence=0.0. This is the most semantically meaningful confidence value across the four adapters.

Binary must be installed at the OS level
pytesseract is a Python package; it wraps the tesseract binary via subprocess. The binary itself is not a Python dependency and is not installed by uv sync. make install now checks for the binary and prints install instructions if absent. Deployment images must include tesseract-ocr (and a language pack such as tesseract-ocr-eng) as a system package.

Language packs
Tesseract requires a language data file to be installed alongside the binary. The default language is eng (English). Multi-language documents require additional packs installed at the OS level. The adapter does not currently expose a lang parameter; this is a follow-on item if multi-language support is needed.

DPI choice for PDF rasterisation
150 DPI matches the PaddleOCR adapter default. At 72 DPI the rasterised image is too coarse for Tesseract to produce accurate results. 300 DPI increases image size and per-page inference time approximately 4x. 150 DPI is configurable via the dpi constructor parameter.

Rebase note

This PR is stacked on #35. Once #35 merges, rebase onto main before merging this one.

anmolg1997

Approved on 664cef3. Cleanest adapter of the four — the patterns from #33-#35 reviews are all baked in from the first commit.

What I verified

Implementation (iris_ocr_local/client.py)

to_thread offload from the start — both rasterisation and the image_to_data loop. No event-loop block. You internalised the #35 feedback without being asked; that's the goal.
id = "local" — correct member of VALID_ADAPTER_IDS (the package is ocr-local; "local" not "tesseract").
Binary-presence check at init via get_tesseract_version() → OCRUnavailable with an actionable install message. Fail-fast, same shape as PaddleOCR's offline guard.
tesseract_cmd re-applied inside _run_page before each subprocess call, and the module docstring honestly documents that pytesseract.tesseract_cmd is process-wide global state — so concurrent engines with different binaries aren't safe. Exactly the kind of caveat that saves a future debugging session.
_map_result correctly skips conf==-1 layout rows and empty-text words, groups by block_num in sorted numeric order, normalises conf [0,100]→[0,1], averages per page, 0.0 on no words. C-OCR-004 guards (max(0,...) / max(1,...)) in place.
Multi-frame TIFF via ImageSequence.Iterator; malformed PDF/image → OCRMalformedDocument.

DX touches (genuinely good)

make install now warns when the tesseract binary is absent, with per-OS install commands. This is the right place for it — a Python-only uv sync can't surface a missing system dependency, and a new engineer would otherwise hit a confusing runtime error.
distclean notes that system packages aren't removed. Honest about the boundary of what the Makefile controls.

Gates

ocr-local suite: 29 passed, 1 deselected (live, slow).
Full default suite: 302 passed, 16 deselected — four adapter suites coexisting cleanly under importlib.
mypy: clean on 18 source files. lint: 3 contracts kept, 0 broken.
T036 flipped; C-OCR-LOCAL-001 satisfied inherently (local subprocess, no network).

One minor cleanup (not blocking)

make test-cov now lists every package explicitly as --cov=iris_engine --cov=iris_config --cov=iris_ocr_adi ... --cov=iris_ocr_local, while [tool.coverage.run] source in pyproject.toml also lists the same set. Two sources of truth — adding adapter #5 means editing both, and they can silently drift. Since source already covers it, the Makefile could drop the explicit --cov= flags and just run pytest --cov --cov-report=... (coverage reads source from config). Or keep the Makefile authoritative and thin out the config. Either way, one place. Fold into T037 or a tiny hygiene commit — not worth a round-trip here.

Stack status

This is adapter four of four. With #34/#35/#36 approved and #37 (contract suite) next, WS003's adapter sprint is essentially done. The contract suite is where all four get verified against the same clauses uniformly — reviewing that now.

anmolg1997

Re-confirming on 98ee039 — pure rebase onto the updated stack, own diff byte-identical to what I approved at 664cef3 (Tesseract adapter + Makefile install-warning + coverage). Still CLEAN. No re-review needed beyond confirming the rebase introduced nothing.

The base branch was changed.

Core adapter (iris-ocr-local): - TesseractEngine backed by pytesseract wrapping the system tesseract binary; satisfies C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 (inherent - no network calls at any point) - PDF rasterisation via PyMuPDF at 150 DPI, same pattern as iris-ocr-paddleocr - image_to_data() provides per-word bboxes in pixels and confidence [0-100]; adapter normalises confidence to [0.0-1.0], averages per page - Words grouped by block_num; blocks joined with double newline in markdown - Rows with conf == -1 (layout rows) and empty text stripped before mapping - IRIS_TESSERACT_CMD env var overrides the binary path (Windows / custom paths) - 28 unit tests, all contract clauses mocked via _pytesseract injection seam - Live test gated on IRIS_OCR_LIVE_LOCAL=1; result: 182 ms, confidence 0.96, 3 bboxes, markdown "IRIS live test" Adapter READMEs (T039 partial - all four OCR adapters): - ocr-adi/README.md: Azure setup, env vars, live test command, limitations - ocr-datalab/README.md: API key setup, polling flow, bbox/confidence notes - ocr-paddleocr/README.md: model download, GPU install, offline mode, limits - ocr-local/README.md: platform install commands, IRIS_TESSERACT_CMD, live test Makefile: - install: cross-platform tesseract check via shutil.which after uv sync; prints per-platform install instructions if binary is absent - distclean (Linux + Windows): note that system packages are not auto-removed Workspace: - iris_ocr_local added to coverage sources in root pyproject.toml - tasks.md: T035 marked complete

README files for adi, datalab, local, and paddleocr are kept untracked locally. They will be added in the T039 PR which covers all adapter documentation together.

extract() was blocking the event loop during PyMuPDF rasterisation and pytesseract.image_to_data() subprocess calls. Wrapped both in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T036 to [x] in tasks.md.

Correctness: re-apply tesseract_cmd per inference call to prevent global state clobber between concurrent engines; wrap per-page get_pixmap and Image.frombytes in try/except OCRMalformedDocument; replace TIFF while-True frame loop with ImageSequence.Iterator; sort blocks by block_num for correct reading order on multi-column documents; move _map_result inside _run_page try/except so KeyError raises OCRUnavailable not a raw exception. Infrastructure: add missing --cov flags for all four adapter packages to make test-cov. Tests: strengthen C-OCR-002 to assert full five-word phrase and total_pages; add per-page content assertions to test_c003; add test_binary_not_found_raises_unavailable. Documentation: note C-OCR-011 deferral to T038 in module docstring.

pyproject.toml [tool.coverage.run] source already lists all packages; duplicate flags in the Makefile were a second source of truth that would silently drift when a new adapter is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

anmolg1997

Re-approved on 5baccdf — clean rebase onto main after #35 landed, own diff byte-identical to what I reviewed (Tesseract adapter + Makefile + coverage, 677 insertions). All checks green. Clear to merge.

likith1908 marked this pull request as ready for review June 16, 2026 04:54

likith1908 requested a review from anmolg1997 as a code owner June 16, 2026 04:54

likith1908 mentioned this pull request Jun 16, 2026

T037: Parametrised OCR contract suite #37

Merged

17 tasks

anmolg1997 mentioned this pull request Jun 16, 2026

Implement Datalab OCR adapter with httpx client and contract test suite (T034) #34

Merged

19 tasks

anmolg1997 approved these changes Jun 16, 2026

View reviewed changes

likith1908 force-pushed the T036-tesseract-adapter branch from 851653d to ba6181f Compare June 16, 2026 06:35

likith1908 force-pushed the T035-paddleocr-adapter branch from 94ec43d to e1c8b2a Compare June 16, 2026 06:35

likith1908 force-pushed the T036-tesseract-adapter branch from ba6181f to 98ee039 Compare June 16, 2026 06:53

anmolg1997 mentioned this pull request Jun 16, 2026

T035: Implement PaddleOCR-VL-1.6 local OCR adapter #35

Merged

18 tasks

anmolg1997 previously approved these changes Jun 16, 2026

View reviewed changes

likith1908 force-pushed the T036-tesseract-adapter branch from 98ee039 to 2156122 Compare June 16, 2026 08:41

This was referenced Jun 17, 2026

T038 + T039: OTEL span instrumentation and per-adapter READMEs #38

Open

T041 + T042: LLM selector, StubLLMProvider, and generate-schemas #40

Open

likith1908 force-pushed the T036-tesseract-adapter branch from 2156122 to 047a1e5 Compare June 17, 2026 10:37

likith1908 changed the base branch from T035-paddleocr-adapter to main June 17, 2026 10:38

likith1908 and others added 5 commits June 17, 2026 16:10

Remove per-adapter READMEs from T036 - deferred to T039

6646013

README files for adi, datalab, local, and paddleocr are kept untracked locally. They will be added in the T039 PR which covers all adapter documentation together.

likith1908 force-pushed the T036-tesseract-adapter branch from 047a1e5 to 5baccdf Compare June 17, 2026 10:40

anmolg1997 approved these changes Jun 17, 2026

View reviewed changes

likith1908 merged commit 4bc2e0e into main Jun 17, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T036: Implement Tesseract local OCR adapter#36

T036: Implement Tesseract local OCR adapter#36
likith1908 merged 5 commits into
mainfrom
T036-tesseract-adapter

likith1908 commented Jun 15, 2026 •

edited

Loading

Uh oh!

anmolg1997 left a comment

Uh oh!

anmolg1997 left a comment

Uh oh!

anmolg1997 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

likith1908 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Task reference

Acceptance criteria

Live test result

Notes for the reviewer

Global state in pytesseract.tesseract_cmd

Block ordering in TSV output

asyncio.to_thread usage

IRIS_TESSERACT_CMD env var

Out of scope additions (beyond T036 acceptance)

Considerations

Rebase note

Uh oh!

anmolg1997 left a comment

Choose a reason for hiding this comment

What I verified

One minor cleanup (not blocking)

Stack status

Uh oh!

anmolg1997 left a comment

Choose a reason for hiding this comment

Uh oh!

anmolg1997 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

likith1908 commented Jun 15, 2026 •

edited

Loading