Skip to content

T036: Implement Tesseract local OCR adapter#36

Merged
likith1908 merged 5 commits into
mainfrom
T036-tesseract-adapter
Jun 17, 2026
Merged

T036: Implement Tesseract local OCR adapter#36
likith1908 merged 5 commits into
mainfrom
T036-tesseract-adapter

Conversation

@likith1908

@likith1908 likith1908 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements T036: the Tesseract local OCR adapter. Wraps pytesseract (a thin Python subprocess wrapper around the tesseract binary). No network calls are ever made; C-OCR-LOCAL-001 is satisfied unconditionally. PDF pages are rasterised by PyMuPDF at 150 DPI, then each image is passed to pytesseract.image_to_data() which returns per-word confidence scores in [0, 100] and axis-aligned bounding boxes.

  • packages/iris-adapters/ocr-local/src/iris_ocr_local/client.py: TesseractEngine - lazy pytesseract import with binary-presence check at init; _to_images() handles PDF via PyMuPDF and images via ImageSequence.Iterator; _run_page() re-applies tesseract_cmd before each subprocess call to avoid global-state clobber; _map_result() groups words by block_num in sorted numeric order and normalises per-word confidence to [0.0, 1.0]
  • packages/iris-adapters/ocr-local/src/iris_ocr_local/__init__.py: re-exports TesseractEngine
  • packages/iris-adapters/ocr-local/pyproject.toml: dependencies (pytesseract>=0.3.10, pymupdf, Pillow, iris-engine)
  • packages/iris-adapters/ocr-local/tests/test_unit.py: 29 unit tests, all inference mocked via _pytesseract injection seam; covers C-OCR-001 through C-OCR-010, C-OCR-LOCAL-001, result mapping, block grouping, and bbox conversion
  • packages/iris-adapters/ocr-local/tests/test_live.py: C-OCR-LIVE-001, gated on IRIS_OCR_LIVE_LOCAL=1 and pytest.mark.slow
  • Makefile: make install now prints a warning when tesseract binary is absent; make test-cov adds --cov=iris_ocr_local to explicit coverage flags
  • pyproject.toml: iris_ocr_local added to [tool.coverage.run] source
  • uv.lock: pytesseract 0.3.13 added
  • tasks/003-ocr-adapter-set/tasks.md: T036 marked complete

Task reference

  • Task ID: T036
  • Workstream: 003-ocr-adapter-set

Acceptance criteria

T036

  • C-OCR-001: test_c001_id_is_local, test_c001_version_is_semver
  • C-OCR-002: test_c002_pdf_extracts_markdown
  • C-OCR-003: test_c003_multipage_ordering, test_c003_page_numbers_start_at_one
  • C-OCR-004: test_c004_bbox_non_negative_xy_positive_wh
  • C-OCR-005: test_c005_confidence_in_range, test_c005_confidence_is_mean_of_word_scores, test_c005_no_words_gives_confidence_zero
  • C-OCR-006: test_c006_unsupported_content_type_raises, test_c006_error_does_not_contain_bytes
  • C-OCR-007: test_c007_malformed_pdf_raises_malformed
  • C-OCR-008: test_c008_empty_bytes_raises_malformed
  • C-OCR-009: test_c009_adapter_id_in_result
  • C-OCR-010: test_c010_png_input_accepted, test_c010_jpeg_input_accepted
  • C-OCR-LOCAL-001: no outbound network (inherent; tesseract is a local subprocess, never connects to a network)
  • Live clause test_live.py runs under IRIS_OCR_LIVE_LOCAL=1
  • C-OCR-011 (OTEL span): deferred to T038

PR checklist:

  • Every acceptance criterion in the task's tasks.md entry is satisfied.
  • docs-ci workflow passes (markdown lint + tasks structural check).
  • No em-dashes introduced in new prose.
  • All internal links resolve.
  • No secrets, credentials, or personal names introduced.

Live test result

Run against 2-page PDF with the system Tesseract binary on the my machine.

adapter      : local
total_pages  : 2
latency_ms   : 5794  (wall 5797 ms)
page 1       : confidence=0.70, bboxes=208
page 2       : confidence=0.61, bboxes=243

Notes for the reviewer

Global state in pytesseract.tesseract_cmd

pytesseract.pytesseract.tesseract_cmd is a module-level string that the library reads immediately before spawning the subprocess. It is process-wide; two TesseractEngine instances configured with different binary paths cannot safely run concurrently. The fix applied here is: _load_pytesseract() returns (module, cmd), the engine stores self._tesseract_cmd, and _run_page() re-writes the global before every subprocess call. This is safe for the single-engine production case. Concurrent multi-binary usage is documented in the module docstring as unsupported.

Block ordering in TSV output

pytesseract.image_to_data() returns TSV rows in layout-visit order, not necessarily ascending block_num order. For multi-column documents, block 2 may appear before block 1 in the row list. _map_result() collects words into blocks: dict[int, list[str]] and emits them with sorted(blocks.items()) to guarantee numeric block order in the output markdown. test_words_from_multiple_blocks_joined_by_double_newline and the multipage ordering test both exercise this path.

asyncio.to_thread usage

Both _to_images() (PyMuPDF PDF rasterisation) and the _run_page() loop (Tesseract subprocess) are CPU-bound and blocking. Both are offloaded via asyncio.to_thread so the adapter does not block the event loop during extraction. The lambda captures pytesseract and tesseract_cmd as local variables before the to_thread call to avoid closing over self unnecessarily.

IRIS_TESSERACT_CMD env var

On Windows the Tesseract binary is typically at a path not on PATH (e.g. C:\Program Files\Tesseract-OCR\tesseract.exe). IRIS_TESSERACT_CMD overrides pytesseract.tesseract_cmd at init. The tesseract_cmd constructor parameter takes precedence over the env var; the env var is a deployment-level fallback.

Out of scope additions (beyond T036 acceptance)

Addition Reason
make install tesseract binary check Without tesseract-ocr installed at the OS level, the adapter fails silently at first use with an unhelpful TesseractNotFoundError. The check in make install surfaces the missing system dependency immediately with install instructions for Ubuntu, macOS, and Windows.
--cov=iris_ocr_local in make test-cov The explicit --cov flags in test-cov override [tool.coverage.run] source in pyproject.toml. The existing four adapters were covered but iris_ocr_local was absent from the flags, so its coverage was silently excluded from the report and the 95% gate.

Considerations

Confidence is a real measured signal for this adapter
Unlike the PaddleOCR-VL adapter (which has no per-block score and returns 1.0 as a placeholder), Tesseract returns a per-word confidence in [0, 100] from its internal classifier. The adapter normalises to [0.0, 1.0] and averages across all valid words per page. Pages with no detected words return confidence=0.0. This is the most semantically meaningful confidence value across the four adapters.

Binary must be installed at the OS level
pytesseract is a Python package; it wraps the tesseract binary via subprocess. The binary itself is not a Python dependency and is not installed by uv sync. make install now checks for the binary and prints install instructions if absent. Deployment images must include tesseract-ocr (and a language pack such as tesseract-ocr-eng) as a system package.

Language packs
Tesseract requires a language data file to be installed alongside the binary. The default language is eng (English). Multi-language documents require additional packs installed at the OS level. The adapter does not currently expose a lang parameter; this is a follow-on item if multi-language support is needed.

DPI choice for PDF rasterisation
150 DPI matches the PaddleOCR adapter default. At 72 DPI the rasterised image is too coarse for Tesseract to produce accurate results. 300 DPI increases image size and per-page inference time approximately 4x. 150 DPI is configurable via the dpi constructor parameter.

Rebase note

This PR is stacked on #35. Once #35 merges, rebase onto main before merging this one.

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on 664cef3. Cleanest adapter of the four — the patterns from #33-#35 reviews are all baked in from the first commit.

What I verified

Implementation (iris_ocr_local/client.py)

  • to_thread offload from the start — both rasterisation and the image_to_data loop. No event-loop block. You internalised the #35 feedback without being asked; that's the goal.
  • id = "local" — correct member of VALID_ADAPTER_IDS (the package is ocr-local; "local" not "tesseract").
  • Binary-presence check at init via get_tesseract_version()OCRUnavailable with an actionable install message. Fail-fast, same shape as PaddleOCR's offline guard.
  • tesseract_cmd re-applied inside _run_page before each subprocess call, and the module docstring honestly documents that pytesseract.tesseract_cmd is process-wide global state — so concurrent engines with different binaries aren't safe. Exactly the kind of caveat that saves a future debugging session.
  • _map_result correctly skips conf==-1 layout rows and empty-text words, groups by block_num in sorted numeric order, normalises conf [0,100]→[0,1], averages per page, 0.0 on no words. C-OCR-004 guards (max(0,...) / max(1,...)) in place.
  • Multi-frame TIFF via ImageSequence.Iterator; malformed PDF/image → OCRMalformedDocument.

DX touches (genuinely good)

  • make install now warns when the tesseract binary is absent, with per-OS install commands. This is the right place for it — a Python-only uv sync can't surface a missing system dependency, and a new engineer would otherwise hit a confusing runtime error.
  • distclean notes that system packages aren't removed. Honest about the boundary of what the Makefile controls.

Gates

  • ocr-local suite: 29 passed, 1 deselected (live, slow).
  • Full default suite: 302 passed, 16 deselected — four adapter suites coexisting cleanly under importlib.
  • mypy: clean on 18 source files. lint: 3 contracts kept, 0 broken.
  • T036 flipped; C-OCR-LOCAL-001 satisfied inherently (local subprocess, no network).

One minor cleanup (not blocking)

make test-cov now lists every package explicitly as --cov=iris_engine --cov=iris_config --cov=iris_ocr_adi ... --cov=iris_ocr_local, while [tool.coverage.run] source in pyproject.toml also lists the same set. Two sources of truth — adding adapter #5 means editing both, and they can silently drift. Since source already covers it, the Makefile could drop the explicit --cov= flags and just run pytest --cov --cov-report=... (coverage reads source from config). Or keep the Makefile authoritative and thin out the config. Either way, one place. Fold into T037 or a tiny hygiene commit — not worth a round-trip here.

Stack status

This is adapter four of four. With #34/#35/#36 approved and #37 (contract suite) next, WS003's adapter sprint is essentially done. The contract suite is where all four get verified against the same clauses uniformly — reviewing that now.

@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from 851653d to ba6181f Compare June 16, 2026 06:35
@likith1908 likith1908 force-pushed the T035-paddleocr-adapter branch from 94ec43d to e1c8b2a Compare June 16, 2026 06:35
@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from ba6181f to 98ee039 Compare June 16, 2026 06:53
anmolg1997
anmolg1997 previously approved these changes Jun 16, 2026

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-confirming on 98ee039 — pure rebase onto the updated stack, own diff byte-identical to what I approved at 664cef3 (Tesseract adapter + Makefile install-warning + coverage). Still CLEAN. No re-review needed beyond confirming the rebase introduced nothing.

@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from 98ee039 to 2156122 Compare June 16, 2026 08:41
@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from 2156122 to 047a1e5 Compare June 17, 2026 10:37
@likith1908 likith1908 changed the base branch from T035-paddleocr-adapter to main June 17, 2026 10:38
@likith1908 likith1908 dismissed anmolg1997’s stale review June 17, 2026 10:38

The base branch was changed.

likith1908 and others added 5 commits June 17, 2026 16:10
Core adapter (iris-ocr-local):
- TesseractEngine backed by pytesseract wrapping the system tesseract binary;
  satisfies C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 (inherent - no
  network calls at any point)
- PDF rasterisation via PyMuPDF at 150 DPI, same pattern as iris-ocr-paddleocr
- image_to_data() provides per-word bboxes in pixels and confidence [0-100];
  adapter normalises confidence to [0.0-1.0], averages per page
- Words grouped by block_num; blocks joined with double newline in markdown
- Rows with conf == -1 (layout rows) and empty text stripped before mapping
- IRIS_TESSERACT_CMD env var overrides the binary path (Windows / custom paths)
- 28 unit tests, all contract clauses mocked via _pytesseract injection seam
- Live test gated on IRIS_OCR_LIVE_LOCAL=1; result: 182 ms, confidence 0.96,
  3 bboxes, markdown "IRIS live test"

Adapter READMEs (T039 partial - all four OCR adapters):
- ocr-adi/README.md: Azure setup, env vars, live test command, limitations
- ocr-datalab/README.md: API key setup, polling flow, bbox/confidence notes
- ocr-paddleocr/README.md: model download, GPU install, offline mode, limits
- ocr-local/README.md: platform install commands, IRIS_TESSERACT_CMD, live test

Makefile:
- install: cross-platform tesseract check via shutil.which after uv sync;
  prints per-platform install instructions if binary is absent
- distclean (Linux + Windows): note that system packages are not auto-removed

Workspace:
- iris_ocr_local added to coverage sources in root pyproject.toml
- tasks.md: T035 marked complete
README files for adi, datalab, local, and paddleocr are kept
untracked locally. They will be added in the T039 PR which
covers all adapter documentation together.
extract() was blocking the event loop during PyMuPDF rasterisation and
pytesseract.image_to_data() subprocess calls. Wrapped both in
asyncio.to_thread so concurrent requests are not frozen during OCR.
Also flipped T036 to [x] in tasks.md.
Correctness: re-apply tesseract_cmd per inference call to prevent global
state clobber between concurrent engines; wrap per-page get_pixmap and
Image.frombytes in try/except OCRMalformedDocument; replace TIFF while-True
frame loop with ImageSequence.Iterator; sort blocks by block_num for correct
reading order on multi-column documents; move _map_result inside _run_page
try/except so KeyError raises OCRUnavailable not a raw exception.
Infrastructure: add missing --cov flags for all four adapter packages to
make test-cov. Tests: strengthen C-OCR-002 to assert full five-word phrase
and total_pages; add per-page content assertions to test_c003; add
test_binary_not_found_raises_unavailable. Documentation: note C-OCR-011
deferral to T038 in module docstring.
pyproject.toml [tool.coverage.run] source already lists all packages;
duplicate flags in the Makefile were a second source of truth that
would silently drift when a new adapter is added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@likith1908 likith1908 force-pushed the T036-tesseract-adapter branch from 047a1e5 to 5baccdf Compare June 17, 2026 10:40

@anmolg1997 anmolg1997 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved on 5baccdf — clean rebase onto main after #35 landed, own diff byte-identical to what I reviewed (Tesseract adapter + Makefile + coverage, 677 insertions). All checks green. Clear to merge.

@likith1908 likith1908 merged commit 4bc2e0e into main Jun 17, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants