T036: Implement Tesseract local OCR adapter#36
Conversation
anmolg1997
left a comment
There was a problem hiding this comment.
Approved on 664cef3. Cleanest adapter of the four — the patterns from #33-#35 reviews are all baked in from the first commit.
What I verified
Implementation (iris_ocr_local/client.py)
to_threadoffload from the start — both rasterisation and theimage_to_dataloop. No event-loop block. You internalised the #35 feedback without being asked; that's the goal.id = "local"— correct member ofVALID_ADAPTER_IDS(the package isocr-local; "local" not "tesseract").- Binary-presence check at init via
get_tesseract_version()→OCRUnavailablewith an actionable install message. Fail-fast, same shape as PaddleOCR's offline guard. tesseract_cmdre-applied inside_run_pagebefore each subprocess call, and the module docstring honestly documents thatpytesseract.tesseract_cmdis process-wide global state — so concurrent engines with different binaries aren't safe. Exactly the kind of caveat that saves a future debugging session._map_resultcorrectly skips conf==-1 layout rows and empty-text words, groups byblock_numin sorted numeric order, normalises conf [0,100]→[0,1], averages per page, 0.0 on no words. C-OCR-004 guards (max(0,...) / max(1,...)) in place.- Multi-frame TIFF via
ImageSequence.Iterator; malformed PDF/image →OCRMalformedDocument.
DX touches (genuinely good)
make installnow warns when thetesseractbinary is absent, with per-OS install commands. This is the right place for it — a Python-onlyuv synccan't surface a missing system dependency, and a new engineer would otherwise hit a confusing runtime error.distcleannotes that system packages aren't removed. Honest about the boundary of what the Makefile controls.
Gates
- ocr-local suite: 29 passed, 1 deselected (live, slow).
- Full default suite: 302 passed, 16 deselected — four adapter suites coexisting cleanly under importlib.
- mypy: clean on 18 source files. lint: 3 contracts kept, 0 broken.
- T036 flipped; C-OCR-LOCAL-001 satisfied inherently (local subprocess, no network).
One minor cleanup (not blocking)
make test-cov now lists every package explicitly as --cov=iris_engine --cov=iris_config --cov=iris_ocr_adi ... --cov=iris_ocr_local, while [tool.coverage.run] source in pyproject.toml also lists the same set. Two sources of truth — adding adapter #5 means editing both, and they can silently drift. Since source already covers it, the Makefile could drop the explicit --cov= flags and just run pytest --cov --cov-report=... (coverage reads source from config). Or keep the Makefile authoritative and thin out the config. Either way, one place. Fold into T037 or a tiny hygiene commit — not worth a round-trip here.
Stack status
This is adapter four of four. With #34/#35/#36 approved and #37 (contract suite) next, WS003's adapter sprint is essentially done. The contract suite is where all four get verified against the same clauses uniformly — reviewing that now.
851653d to
ba6181f
Compare
94ec43d to
e1c8b2a
Compare
ba6181f to
98ee039
Compare
anmolg1997
left a comment
There was a problem hiding this comment.
Re-confirming on 98ee039 — pure rebase onto the updated stack, own diff byte-identical to what I approved at 664cef3 (Tesseract adapter + Makefile install-warning + coverage). Still CLEAN. No re-review needed beyond confirming the rebase introduced nothing.
98ee039 to
2156122
Compare
2156122 to
047a1e5
Compare
Core adapter (iris-ocr-local): - TesseractEngine backed by pytesseract wrapping the system tesseract binary; satisfies C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 (inherent - no network calls at any point) - PDF rasterisation via PyMuPDF at 150 DPI, same pattern as iris-ocr-paddleocr - image_to_data() provides per-word bboxes in pixels and confidence [0-100]; adapter normalises confidence to [0.0-1.0], averages per page - Words grouped by block_num; blocks joined with double newline in markdown - Rows with conf == -1 (layout rows) and empty text stripped before mapping - IRIS_TESSERACT_CMD env var overrides the binary path (Windows / custom paths) - 28 unit tests, all contract clauses mocked via _pytesseract injection seam - Live test gated on IRIS_OCR_LIVE_LOCAL=1; result: 182 ms, confidence 0.96, 3 bboxes, markdown "IRIS live test" Adapter READMEs (T039 partial - all four OCR adapters): - ocr-adi/README.md: Azure setup, env vars, live test command, limitations - ocr-datalab/README.md: API key setup, polling flow, bbox/confidence notes - ocr-paddleocr/README.md: model download, GPU install, offline mode, limits - ocr-local/README.md: platform install commands, IRIS_TESSERACT_CMD, live test Makefile: - install: cross-platform tesseract check via shutil.which after uv sync; prints per-platform install instructions if binary is absent - distclean (Linux + Windows): note that system packages are not auto-removed Workspace: - iris_ocr_local added to coverage sources in root pyproject.toml - tasks.md: T035 marked complete
README files for adi, datalab, local, and paddleocr are kept untracked locally. They will be added in the T039 PR which covers all adapter documentation together.
extract() was blocking the event loop during PyMuPDF rasterisation and pytesseract.image_to_data() subprocess calls. Wrapped both in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T036 to [x] in tasks.md.
Correctness: re-apply tesseract_cmd per inference call to prevent global state clobber between concurrent engines; wrap per-page get_pixmap and Image.frombytes in try/except OCRMalformedDocument; replace TIFF while-True frame loop with ImageSequence.Iterator; sort blocks by block_num for correct reading order on multi-column documents; move _map_result inside _run_page try/except so KeyError raises OCRUnavailable not a raw exception. Infrastructure: add missing --cov flags for all four adapter packages to make test-cov. Tests: strengthen C-OCR-002 to assert full five-word phrase and total_pages; add per-page content assertions to test_c003; add test_binary_not_found_raises_unavailable. Documentation: note C-OCR-011 deferral to T038 in module docstring.
pyproject.toml [tool.coverage.run] source already lists all packages; duplicate flags in the Makefile were a second source of truth that would silently drift when a new adapter is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
047a1e5 to
5baccdf
Compare
anmolg1997
left a comment
There was a problem hiding this comment.
Re-approved on 5baccdf — clean rebase onto main after #35 landed, own diff byte-identical to what I reviewed (Tesseract adapter + Makefile + coverage, 677 insertions). All checks green. Clear to merge.
Summary
Implements T036: the Tesseract local OCR adapter. Wraps
pytesseract(a thin Python subprocess wrapper around thetesseractbinary). No network calls are ever made; C-OCR-LOCAL-001 is satisfied unconditionally. PDF pages are rasterised by PyMuPDF at 150 DPI, then each image is passed topytesseract.image_to_data()which returns per-word confidence scores in [0, 100] and axis-aligned bounding boxes.packages/iris-adapters/ocr-local/src/iris_ocr_local/client.py:TesseractEngine- lazypytesseractimport with binary-presence check at init;_to_images()handles PDF via PyMuPDF and images viaImageSequence.Iterator;_run_page()re-appliestesseract_cmdbefore each subprocess call to avoid global-state clobber;_map_result()groups words byblock_numin sorted numeric order and normalises per-word confidence to [0.0, 1.0]packages/iris-adapters/ocr-local/src/iris_ocr_local/__init__.py: re-exportsTesseractEnginepackages/iris-adapters/ocr-local/pyproject.toml: dependencies (pytesseract>=0.3.10,pymupdf,Pillow,iris-engine)packages/iris-adapters/ocr-local/tests/test_unit.py: 29 unit tests, all inference mocked via_pytesseractinjection seam; covers C-OCR-001 through C-OCR-010, C-OCR-LOCAL-001, result mapping, block grouping, and bbox conversionpackages/iris-adapters/ocr-local/tests/test_live.py: C-OCR-LIVE-001, gated onIRIS_OCR_LIVE_LOCAL=1andpytest.mark.slowMakefile:make installnow prints a warning whentesseractbinary is absent;make test-covadds--cov=iris_ocr_localto explicit coverage flagspyproject.toml:iris_ocr_localadded to[tool.coverage.run] sourceuv.lock:pytesseract 0.3.13addedtasks/003-ocr-adapter-set/tasks.md: T036 marked completeTask reference
T036003-ocr-adapter-setAcceptance criteria
T036
test_c001_id_is_local,test_c001_version_is_semvertest_c002_pdf_extracts_markdowntest_c003_multipage_ordering,test_c003_page_numbers_start_at_onetest_c004_bbox_non_negative_xy_positive_whtest_c005_confidence_in_range,test_c005_confidence_is_mean_of_word_scores,test_c005_no_words_gives_confidence_zerotest_c006_unsupported_content_type_raises,test_c006_error_does_not_contain_bytestest_c007_malformed_pdf_raises_malformedtest_c008_empty_bytes_raises_malformedtest_c009_adapter_id_in_resulttest_c010_png_input_accepted,test_c010_jpeg_input_acceptedtesseractis a local subprocess, never connects to a network)test_live.pyruns underIRIS_OCR_LIVE_LOCAL=1PR checklist:
tasks.mdentry is satisfied.docs-ciworkflow passes (markdown lint + tasks structural check).Live test result
Run against 2-page PDF with the system Tesseract binary on the my machine.
Notes for the reviewer
Global state in pytesseract.tesseract_cmd
pytesseract.pytesseract.tesseract_cmdis a module-level string that the library reads immediately before spawning the subprocess. It is process-wide; twoTesseractEngineinstances configured with different binary paths cannot safely run concurrently. The fix applied here is:_load_pytesseract()returns(module, cmd), the engine storesself._tesseract_cmd, and_run_page()re-writes the global before every subprocess call. This is safe for the single-engine production case. Concurrent multi-binary usage is documented in the module docstring as unsupported.Block ordering in TSV output
pytesseract.image_to_data()returns TSV rows in layout-visit order, not necessarily ascendingblock_numorder. For multi-column documents, block 2 may appear before block 1 in the row list._map_result()collects words intoblocks: dict[int, list[str]]and emits them withsorted(blocks.items())to guarantee numeric block order in the output markdown.test_words_from_multiple_blocks_joined_by_double_newlineand the multipage ordering test both exercise this path.asyncio.to_thread usage
Both
_to_images()(PyMuPDF PDF rasterisation) and the_run_page()loop (Tesseract subprocess) are CPU-bound and blocking. Both are offloaded viaasyncio.to_threadso the adapter does not block the event loop during extraction. The lambda capturespytesseractandtesseract_cmdas local variables before theto_threadcall to avoid closing overselfunnecessarily.IRIS_TESSERACT_CMD env var
On Windows the Tesseract binary is typically at a path not on
PATH(e.g.C:\Program Files\Tesseract-OCR\tesseract.exe).IRIS_TESSERACT_CMDoverridespytesseract.tesseract_cmdat init. Thetesseract_cmdconstructor parameter takes precedence over the env var; the env var is a deployment-level fallback.Out of scope additions (beyond T036 acceptance)
make installtesseract binary checktesseract-ocrinstalled at the OS level, the adapter fails silently at first use with an unhelpfulTesseractNotFoundError. The check inmake installsurfaces the missing system dependency immediately with install instructions for Ubuntu, macOS, and Windows.--cov=iris_ocr_localinmake test-cov--covflags intest-covoverride[tool.coverage.run] sourceinpyproject.toml. The existing four adapters were covered butiris_ocr_localwas absent from the flags, so its coverage was silently excluded from the report and the 95% gate.Considerations
Confidence is a real measured signal for this adapter
Unlike the PaddleOCR-VL adapter (which has no per-block score and returns 1.0 as a placeholder), Tesseract returns a per-word confidence in [0, 100] from its internal classifier. The adapter normalises to [0.0, 1.0] and averages across all valid words per page. Pages with no detected words return
confidence=0.0. This is the most semantically meaningful confidence value across the four adapters.Binary must be installed at the OS level
pytesseractis a Python package; it wraps thetesseractbinary via subprocess. The binary itself is not a Python dependency and is not installed byuv sync.make installnow checks for the binary and prints install instructions if absent. Deployment images must includetesseract-ocr(and a language pack such astesseract-ocr-eng) as a system package.Language packs
Tesseract requires a language data file to be installed alongside the binary. The default language is
eng(English). Multi-language documents require additional packs installed at the OS level. The adapter does not currently expose alangparameter; this is a follow-on item if multi-language support is needed.DPI choice for PDF rasterisation
150 DPI matches the PaddleOCR adapter default. At 72 DPI the rasterised image is too coarse for Tesseract to produce accurate results. 300 DPI increases image size and per-page inference time approximately 4x. 150 DPI is configurable via the
dpiconstructor parameter.Rebase note
This PR is stacked on #35. Once #35 merges, rebase onto
mainbefore merging this one.