Background
The initial DoclingTableRefiner (issue #16) runs Docling on the whole PDF whenever the refiner is triggered. This is simple and reliable, but the eval data shows Docling takes ~22 s/page on CPU (no OCR, TableFormer FAST), with full-doc conversions ranging from ~1.5 min (5-page MMWR) to ~9 min (15-page WHO mpox sitrep, 12-page ECDC CDTR).
For longer reports — or for pipelines that touch many PDFs per cron run — this wall-clock may become a problem. Docling supports DocumentConverter.convert(source, page_range=(start, end)) to limit conversion to a page slice. Restricting Docling to just the pages that contain suspect tables would cut typical refinement cost from minutes to ~22 s.
The eval explored this: per-page cost test in scripts/eval_docling_ocr_cost.py confirmed page_range works and produces consistent per-page timings.
Scope
Once the main refiner is in production, gather some data and decide whether to switch.
- Measure wall-clock distribution of refiner runs over real ingest traffic for ~2 weeks. Capture:
- Per-doc total refiner time
- Page count
- Number of "suspect" table sections detected
- Their page numbers
- Decide the threshold: at what doc size does full-doc Docling become unacceptable? (Probably when median refiner time exceeds some user-facing budget — TBD.)
- If the data justifies it, implement per-page mode:
- For allowlist-triggered docs: still run full-doc (no in-tree table to identify suspect pages from).
- For heuristic-triggered docs: collect the set of page numbers from suspect
SectionContents, call convert(source, page_range=(min_page, max_page)) or per-page if non-contiguous.
- Merge the page-scoped Docling result back into
ParsedContent using the same page-number matching the main issue introduced.
- Regression test: ensure per-page mode produces equivalent table content to full-doc mode on the eval PDFs (MMWR, cholera).
Definition of done
Out of scope
- GPU/sidecar Docling host. That's a deployment decision, separate from the page-range optimisation.
References
Background
The initial
DoclingTableRefiner(issue #16) runs Docling on the whole PDF whenever the refiner is triggered. This is simple and reliable, but the eval data shows Docling takes ~22 s/page on CPU (no OCR, TableFormer FAST), with full-doc conversions ranging from ~1.5 min (5-page MMWR) to ~9 min (15-page WHO mpox sitrep, 12-page ECDC CDTR).For longer reports — or for pipelines that touch many PDFs per cron run — this wall-clock may become a problem. Docling supports
DocumentConverter.convert(source, page_range=(start, end))to limit conversion to a page slice. Restricting Docling to just the pages that contain suspect tables would cut typical refinement cost from minutes to ~22 s.The eval explored this: per-page cost test in
scripts/eval_docling_ocr_cost.pyconfirmedpage_rangeworks and produces consistent per-page timings.Scope
Once the main refiner is in production, gather some data and decide whether to switch.
SectionContents, callconvert(source, page_range=(min_page, max_page))or per-page if non-contiguous.ParsedContentusing the same page-number matching the main issue introduced.Definition of done
docling_per_page_mode: bool), regression-tested against eval PDFs.Out of scope
References
data/docling_eval/ocr/per_page_cost.jsondata/docling_eval/FINDINGS.md