T035: Implement PaddleOCR-VL-1.6 local OCR adapter#35
Conversation
anmolg1997
left a comment
There was a problem hiding this comment.
Strong adapter — the PaddleOCRVL-vs-transformers investigation and the honest documentation of what the VLM does and doesn't provide are exactly the engineering judgment this task needed. Two changes before merge, one housekeeping item.
Change 1: inference blocks the event loop
extract() is async, but everything inside it is synchronous compute: PyMuPDF rasterisation, np.array(image), and — the big one — pipeline.predict(), which is seconds of CPU/GPU inference per page (your own live test logged ~15s cold). Unlike the ADI and Datalab adapters, where every await yields to the loop during network I/O, this adapter holds the event loop hostage for the full inference duration. In the API or worker process, one PaddleOCR extraction freezes every concurrent request — including /healthz.
best-practices.md section 1.1 sets the async rule for I/O; heavy compute in an async context is its sibling: offload to a thread. The fix is two lines:
async def extract(self, ctx, document_id, content, content_type) -> OCRResult:
...
start = time.monotonic()
images = await asyncio.to_thread(_to_images, content, content_type, self._dpi)
ocr_pages = await asyncio.to_thread(
lambda: [_run_page(self._pipeline, img, page_number=i + 1) for i, img in enumerate(images)]
)
...Existing tests keep passing (asyncio.run drives to_thread fine). One thing to verify when you make the change: PaddlePaddle inference from a non-main thread — the live test will confirm. If the library objects to it, fall back to documenting the constraint and we'll address it at the worker level with a process pool in a later workstream; but try the thread first, it almost always just works.
Change 2: T035 not flipped in tasks.md
tasks/003-ocr-adapter-set/tasks.md still shows - [ ] **T035**. Flip it and — given the offline-mode caveat below — extend the acceptance note the same way you did for T034's bbox limitation:
Note: offline mode (
vl_rec_model_dir) covers the VL recognition sub-model only; layout-detection (PP-DocLayoutV3) and doc-preprocessor sub-models are fetched from PaddlePaddle's CDN on first use unless their model dirs are also supplied or~/.paddlex/official_models/is pre-seeded. Full airgap requires pre-seeding; document in T039.
That caveat is currently only visible in a code comment in _load_pipeline(). It's the difference between "works offline" and "works airgapped", and an ops person reading tasks.md should see it. Please also make sure T039's adapter README covers the pre-seed step (~/.paddlex cache + HF snapshot + TRANSFORMERS_OFFLINE=1).
Housekeeping: the Datalab deadline fix is in the wrong PR
This branch carries the max_poll_seconds fix for the Datalab client that my #34 review requested — but #34's own branch doesn't have it. Stacked-PR mechanics: when #34 merges without it, the fix only lands when #35 merges, and #34's merge commit will contain the unbounded loop. Please cherry-pick max_poll_seconds + its test down into the T034-datalab-adapter branch (it'll then drop out of this PR's diff on rebase). Same for the tests/__init__.py deletions and the pytest.mark.slow additions to the ADI/Datalab live tests if those logically belong with #34's collision fix — your call on which PR owns them, but each fix should be in the PR whose feature it patches.
What I verified (all good)
Implementation
PaddleOCRVLlibrary path overtransformers— right call, and the reasoning (bbox structure vs raw text, C-OCR-004 compliance) is documented in the PR body and the module docstring. The note that the original implementation was rewritten after reading the installed library source is the kind of honesty that makes reviews fast.- Injection seam (
_pipeline) bypasses model loading entirely — unit tests run without paddleocr installed since the import is lazy inside_load_pipeline(). 29 clause-named tests, all mocked via_MockResult/_MockBlockmirroring the realparsing_res_listinterface. - Fail-fast at construction when
offline=Truewithoutmodel_path— raises at init rather than hanging at first use. Tested. - PDF rasterisation at 150 DPI via PyMuPDF with malformed-PDF and zero-page guards; multi-frame TIFF → one page per frame; Pillow decode errors →
OCRMalformedDocument. _rect_to_bboxclamps to non-negative x/y and ≥1 width/height — C-OCR-004 safe.- Confidence fixed at 1.0 (VLM emits no per-block score) with 0.0 for empty results — documented compromise, consistent with how Datalab's missing-score case was handled, clause-tested.
Infra
.gitignorefor the 1.8 GB snapshot, ruff exclude, coverage source extended toiris_ocr_paddleocr, model-download instructions inocr-paddleocr/pyproject.tomlcomments and.env.example.- CI stays fast (0.2-0.3 min/job) — uv caching absorbs the paddle dependency weight.
- Live test gated on both
IRIS_OCR_LIVE_PADDLEOCR=1andpytest.mark.slow, with the documented live result in the PR body (real bboxes from the local snapshot).
With the thread offload + tasks.md flip + fix relocation, this is ready. The three-adapter pattern is now established well enough that T036 (Tesseract) should be the fastest of the four.
92d74f3 to
71aaac3
Compare
a28044b to
8e242ab
Compare
anmolg1997
left a comment
There was a problem hiding this comment.
Approved on 94ec43d. Both code changes landed correctly. One verification gap to close manually before this adapter is wired into the worker — flagging it explicitly rather than letting it ride.
What's fixed
Change 1 — event-loop offload. extract() now wraps both _to_images and the per-page inference loop in asyncio.to_thread(...). Correct shape: the blocking PyMuPDF rasterisation and the seconds-long pipeline.predict() no longer hold the event loop. The pipeline = self._pipeline local before the lambda avoids capturing self in the closure — clean.
Change 2 — tasks.md. T035 flipped to [x] with the airgap caveat verbatim (VL sub-model is local; layout-detection + doc-preprocessor still hit PaddlePaddle's CDN unless ~/.paddlex is pre-seeded; full airgap documented in T039). An ops reader now sees the real offline boundary.
Fix relocation — done right. The Datalab max_poll_seconds fix is no longer in this diff; it's in #34 where it belongs. The pytest.mark.slow markers on the ADI/Datalab live tests and the tests/__init__.py additions are the cross-adapter housekeeping that surfaced with the third suite — fine to ride here, they resolve on rebase once #34 merges.
Verified locally
- PaddleOCR suite: 29 passed, 1 deselected (live, slow).
- Full default suite: 273 passed, 15 deselected — no collision across all four adapter test modules under importlib mode.
make lint: 3 contracts kept, 0 broken.
The one thing to verify before relying on this (NOT blocking the merge)
The "Live test result" block in the PR body shows the ~15s cold-load run — but that run predates the to_thread commit. It exercised the synchronous path. The unit tests that cover the threaded path mock the pipeline, so PaddlePaddle inference from an asyncio.to_thread worker thread has not actually been exercised against the real model.
This is the native-ML-lib edge case I flagged in the last review: Paddle's predict() usually works fine off the main thread, but some builds carry main-thread assumptions (signal handlers, CUDA context affinity, paddle's own thread-local state). It can't be caught in CI — it needs the 1.8 GB snapshot on real hardware.
Action: re-run test_live.py on the current threaded HEAD and confirm it still returns the expected result. If it works (most likely), drop a one-line note in the PR body updating the live-test section. If Paddle objects to the worker thread, the fallback is a single-worker ProcessPoolExecutor or documenting the constraint for the worker-integration task — but try the thread first. Please confirm before T038/worker wiring depends on this path; I don't want an unverified-threading assumption baked into the worker later.
Approving because the code is correct, the pattern is right, and every gate that can run is green. The above is a manual confirmation, not a code change.
Solid work on the offload — the lambda-capture detail especially.
Core adapter (iris-ocr-paddleocr): - PaddleOCREngine backed by PaddleOCRVL pipeline (v1.6), satisfying contract clauses C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001 - PDF rasterisation via PyMuPDF at 150 DPI; PNG/JPEG/TIFF accepted directly - Offline mode: IRIS_PADDLEOCR_OFFLINE=1 + IRIS_PADDLEOCR_MODEL_PATH point the VL recognition sub-model at a local HuggingFace snapshot directory; startup raises OCRUnavailable immediately if offline flag is set without path - Result mapping from PaddleOCRVLResult (dict-like, parsing_res_list) to OCRPageResult; VLM returns no per-block confidence so confidence=1.0 - BBox format is axis-aligned [x1,y1,x2,y2] rect, not polygon - converted via _rect_to_bbox with negative-coord clamp and min-size=1 enforcement - 29 unit tests, 96% coverage; live test gated on IRIS_OCR_LIVE_PADDLEOCR=1 Fixes carried forward from code review (Datalab / ADI): - DatalabOCREngine._poll() was an unbounded loop; added max_poll_seconds=300 and deadline check mirroring the ADI pattern (test added) - ocr-adi/tests/__init__.py and ocr-datalab/tests/__init__.py deleted; with importlib mode these caused all three adapter test_unit modules to resolve to the same module name, running ADI tests for every adapter - All three adapter live test files now carry pytest.mark.slow alongside their skipif guard so intent is explicit when running the full suite Workspace changes: - Root pyproject.toml: iris_ocr_paddleocr added to coverage sources, mypy overrides, importlinter; ruff exclude added for PaddleOCR-VL-1.6-snapshot - .gitignore: PaddleOCR-VL-1.6-snapshot/ added (1.8 GB model, local-only) - .env.example: PaddleOCR offline mode vars and model download instructions - huggingface_hub>=0.23 added to ocr-paddleocr dependencies
extract() was holding the event loop during PyMuPDF rasterisation and pipeline.predict() (up to ~15s per doc). Wrapped both _to_images and the per-page inference loop in asyncio.to_thread so concurrent requests are not frozen during OCR. Also flipped T035 to [x] in tasks.md and added the offline/airgap caveat about PP-DocLayoutV3 and doc-preprocessor sub-models requiring ~/.paddlex pre-seeding for a true airgap.
94ec43d to
e1c8b2a
Compare
anmolg1997
left a comment
There was a problem hiding this comment.
Re-approving on 773ad0d — pure rebase onto main after #34 merged (the two substantive commits are byte-identical to what I approved; the third is an empty CI-trigger). All 10 checks green, paddleocr suite 29 passed, lint clean, 0 unresolved threads.
Reminder, still non-blocking: the live-test block in the body documents the pre-threading ~15s run. PaddlePaddle inference from an asyncio.to_thread worker thread is still unverified against the real model. Fine to merge now, but please re-run test_live.py on this HEAD and confirm before T038 wires spans around it or the worker calls into it. Bottom of the stack now — merge this, then #36 and #37 rebase behind it.
paddlepaddle was already resolved as a transitive dep via paddleocr/paddlex but was not declared explicitly. A partial-sync of the venv (e.g. during rebase operations) could leave it missing until a full uv sync --all-packages ran. Declaring it explicitly ensures it is always present after any uv sync.
|
Re-ran live test on current HEAD ( Cold model load, Also noted: |
anmolg1997
left a comment
There was a problem hiding this comment.
Re-approving on 8289d4d. Only change since my last approval is pinning paddlepaddle>=3.1 as an explicit direct dependency in ocr-paddleocr/pyproject.toml (it was previously transitive through paddleocr). Correct call — the inference engine is a first-order dependency of this adapter and shouldn't rely on paddleocr's extras to pull it; pinning it makes the lockfile honest and protects against paddleocr ever dropping it. All checks green. Bottom of the stack — merge first.
Summary
Implements T035: the PaddleOCR-VL-1.6 local OCR adapter. Unlike the ADI and Datalab adapters which call remote HTTP services, this adapter runs inference entirely on-device using the
paddleocrlibrary and a pre-downloaded HuggingFace model snapshot. Also includes two correctness fixes in the Datalab adapter and a pytest module-collision fix that was only visible once all three adapter test suites existed side-by-side.packages/iris-adapters/ocr-paddleocr/src/iris_ocr_paddleocr/client.py:PaddleOCREngine- PDF rasterisation via PyMuPDF at 150 DPI, per-page inference viaPaddleOCRVL.predict(), result mapping fromPaddleOCRVLResult.parsing_res_listtoOCRPageResult; offline mode controlled byIRIS_PADDLEOCR_OFFLINE=1+IRIS_PADDLEOCR_MODEL_PATH; raisesOCRUnavailableat init if offline flag is set without model pathpackages/iris-adapters/ocr-paddleocr/src/iris_ocr_paddleocr/__init__.py: re-exportsPaddleOCREnginepackages/iris-adapters/ocr-paddleocr/pyproject.toml: dependencies (paddlepaddle>=3.1,paddleocr[doc-parser]>=3.6.0,pymupdf,huggingface_hub>=0.23,iris-engine);paddlepaddleis now an explicit direct dep so it survives anyuv synccall without requiring--all-packagespackages/iris-adapters/ocr-paddleocr/tests/test_unit.py: 29 unit tests, all inference mocked via_pipelineinjection seam; covers C-OCR-001 through C-OCR-010 and C-OCR-LOCAL-001packages/iris-adapters/ocr-paddleocr/tests/test_live.py: C-OCR-LIVE-001, gated onIRIS_OCR_LIVE_PADDLEOCR=1andpytest.mark.slowpackages/iris-adapters/ocr-datalab/src/iris_ocr_datalab/client.py: addedmax_poll_seconds=300.0+ deadline check in_poll()- was an unbounded loop (carried forward from code review)packages/iris-adapters/ocr-datalab/tests/test_unit.py: addedtest_poll_deadline_exceeded_raises_unavailablecovering the new deadline pathpackages/iris-adapters/ocr-adi/tests/__init__.py: deleted - causes module name collision under--import-mode=importlibpackages/iris-adapters/ocr-datalab/tests/__init__.py: deleted - same reasonpackages/iris-adapters/ocr-adi/tests/test_live.py: addedpytest.mark.slowtopytestmarkpackages/iris-adapters/ocr-datalab/tests/test_live.py: addedpytest.mark.slowtopytestmarkpyproject.toml:iris_ocr_paddleocradded to coverage sources, mypy overrides, importlinter; ruffexcludeadded forPaddleOCR-VL-1.6-snapshot/.gitignore:PaddleOCR-VL-1.6-snapshot/added (1.8 GB model directory, local-only, never committed).env.example: PaddleOCR offline mode vars and model download instructions addeduv.lock: updated for new dependenciesTask reference
T035003-ocr-adapter-setAcceptance criteria
T035
test_c001_id_is_paddleocr,test_c001_version_is_semvertest_c002_pdf_extracts_markdowntest_c003_multipage_ordering,test_c003_page_numbers_start_at_onetest_c004_bbox_non_negative_xy_positive_whtest_c005_confidence_in_range,test_c005_vl_confidence_is_one,test_c005_empty_blocks_defaults_to_onetest_c006_unsupported_content_type_raises,test_c006_error_does_not_contain_bytestest_c007_malformed_pdf_raises_malformedtest_c008_empty_bytes_raises_malformedtest_c009_adapter_id_in_resulttest_c010_png_input_accepted,test_c010_jpeg_input_acceptedtest_local_offline_without_model_path_raises_at_inittest_live.pyruns underIRIS_OCR_LIVE_PADDLEOCR=1PR checklist:
tasks.mdentry is satisfied.docs-ciworkflow passes (markdown lint + tasks structural check).Live test result
Run against the local
PaddleOCR-VL-1.6-snapshotin offline mode with a PNG fixture.All pipeline stages loaded from cache. Layout detection (PP-DocLayoutV3) used the ModelScope-cached copy at
~/.paddlex/official_models/PP-DocLayoutV3/. VL recognition used the local HF snapshot atPaddleOCR-VL-1.6-snapshot/.Notes for the reviewer
PaddleOCR library path, not transformers
Two inference paths exist for PaddleOCR-VL-1.6. The
transformerspath (AutoModelForImageTextToText) returns raw generated text only with no bounding box structure, meaning bboxes would always be[]. Thepaddleocrlibrary path (PaddleOCRVL) returns structured output with real bboxes. C-OCR-004 requires valid bounding box coordinates, so thepaddleocrlibrary path was the only viable choice.VLM result structure differs from classic OCR
Classic PaddleOCR returns
rec_texts,rec_scores, anddt_polysarrays. PaddleOCR-VL-1.6 returns aPaddleOCRVLResultthat is dict-like:result["parsing_res_list"]yields a list ofPaddleOCRVLBlockobjects, each with.label,.bbox([x1, y1, x2, y2]integers), and.content(text string). The original implementation assumed the classic structure and was completely rewritten once the real API was read from the installed library source. Unit test mocks (_MockResult,_MockBlock) reflect the actual interface.Offline mode and model path
vl_rec_model_diris the correct parameter name for pointing the VL recognition sub-model at a local directory. The parameter namemodel_dir(a guess based on classic PaddleOCR) is rejected by the library at runtime withValueError: Unknown argument: model_dir.vl_rec_model_diris only passed when bothoffline=Trueandmodel_pathis set; online mode callsPaddleOCRVL(pipeline_version="v1.6")with no model path argument.BBox format is axis-aligned rect, not polygon
Classic PaddleOCR returns 4-point polygons (
[[x0,y0],[x1,y1],[x2,y2],[x3,y3]]). PaddleOCR-VL-1.6 returns axis-aligned rectangles ([x1, y1, x2, y2]). The original_poly_to_bboxfunction was replaced with_rect_to_bbox. Negative coordinates are clamped to 0 and degenerate rects (x1==x2 or y1==y2) get a minimum size of 1 to satisfy C-OCR-004.Out of scope additions (beyond T035 acceptance)
DatalabOCREngine._poll()deadline guard_poll()was an unboundedwhile Trueloop. Thehttpxtimeout covers individual requests but not the loop. A stuck Datalab job would hang the caller forever. ADI hadmax_poll_secondsfrom the start; Datalab was missing it. Found during cross-adapter code review while implementing T035.test_poll_deadline_exceeded_raises_unavailablein Datalab testsocr-adi/tests/__init__.pyandocr-datalab/tests/__init__.py--import-mode=importlib, pytest walks up from the test file looking for__init__.py. All three adapter test dirs had__init__.pybut their parent package dirs did not, so all three resolved totests.test_unit. ADI's module was cached first; Datalab and PaddleOCR tests ran ADI code. Deleting__init__.pygives each file a unique hash-based module name. This reversed the fix added in T034 - the T034 fix was the wrong fix.pytest.mark.slowon ADI and Datalab live testsskipifguard and is consistent across all three adapters.excludeforPaddleOCR-VL-1.6-snapshot/configuration_paddleocr_vl.py,image_processing_paddleocr_vl.py) that fail ruff checks. These are upstream files and should not be linted..gitignoreentry forPaddleOCR-VL-1.6-snapshot/Considerations
Confidence is always 1.0 for this adapter
The VLM backbone does not produce a token-level confidence score per detected region. There is no numeric quality signal available from the pipeline. The adapter returns
confidence=1.0as a placeholder.test_c005_vl_confidence_is_onedocuments this behaviour explicitly. If the contract is later tightened to require a real signal, this adapter would need either a different model or a post-hoc heuristic (e.g. character recognition score from a secondary pass).PP-DocLayoutV3 sub-model is not covered by IRIS_PADDLEOCR_OFFLINE=1
IRIS_PADDLEOCR_OFFLINE=1prevents HuggingFace network calls only. The PaddleOCR-VL pipeline internally uses a second model, PP-DocLayoutV3 (layout detection), which auto-downloads from ModelScope (Alibaba) on first use. For a fully airgapped deployment, PP-DocLayoutV3 must also be pre-staged locally and passed vialayout_detection_model_dir. This is a follow-on item. The current env var set is sufficient for development and for environments where ModelScope is reachable.GPU is required for production throughput
CPU inference is measured at approximately 7-8 minutes per page (tested on a 2-page, 97 KB PDF on the dev machine without a GPU). The production Docker image must install
paddlepaddle-gpufrom the PaddlePaddle custom index after the base install. The basemake installinstalls the CPU wheel only, which is correct for CI and local development.paddlepaddle-gpu is not on PyPI
PaddlePaddle distributes GPU wheels from their own index, not PyPI. A
[gpu]optional extra was attempted butuv syncfailed because the resolver cannot reach non-PyPI sources declared in extras. GPU installation must be handled as a post-step in the production Docker image and is documented in.env.example.DPI choice for PDF rasterisation
150 DPI is the default. 72-96 DPI produces images too small for reliable VLM reading on dense documents. 300 DPI increases image size and inference time significantly. 150 DPI is configurable via the
dpiconstructor parameter if a specific deployment requires adjustment.Rebase note
Rebased onto
main(T034 merged as #34). Stack is clean.Live threading verification
Confirmed on
8289d4d(HEAD, with explicitpaddlepaddledep):IRIS_OCR_LIVE_PADDLEOCR=1 uv run pytest -m slow packages/iris-adapters/ocr-paddleocr/tests/test_live.py- 1 passed in 40.61 s (cold model load). PaddlePaddle inference runs correctly from anasyncio.to_threadworker. The concern flagged by Anmol is resolved.