Skip to content

Optimise DoclingTableRefiner: per-page Docling re-run for large PDFs #17

@smodee

Description

@smodee

Background

The initial DoclingTableRefiner (issue #16) runs Docling on the whole PDF whenever the refiner is triggered. This is simple and reliable, but the eval data shows Docling takes ~22 s/page on CPU (no OCR, TableFormer FAST), with full-doc conversions ranging from ~1.5 min (5-page MMWR) to ~9 min (15-page WHO mpox sitrep, 12-page ECDC CDTR).

For longer reports — or for pipelines that touch many PDFs per cron run — this wall-clock may become a problem. Docling supports DocumentConverter.convert(source, page_range=(start, end)) to limit conversion to a page slice. Restricting Docling to just the pages that contain suspect tables would cut typical refinement cost from minutes to ~22 s.

The eval explored this: per-page cost test in scripts/eval_docling_ocr_cost.py confirmed page_range works and produces consistent per-page timings.

Scope

Once the main refiner is in production, gather some data and decide whether to switch.

  1. Measure wall-clock distribution of refiner runs over real ingest traffic for ~2 weeks. Capture:
    • Per-doc total refiner time
    • Page count
    • Number of "suspect" table sections detected
    • Their page numbers
  2. Decide the threshold: at what doc size does full-doc Docling become unacceptable? (Probably when median refiner time exceeds some user-facing budget — TBD.)
  3. If the data justifies it, implement per-page mode:
    • For allowlist-triggered docs: still run full-doc (no in-tree table to identify suspect pages from).
    • For heuristic-triggered docs: collect the set of page numbers from suspect SectionContents, call convert(source, page_range=(min_page, max_page)) or per-page if non-contiguous.
    • Merge the page-scoped Docling result back into ParsedContent using the same page-number matching the main issue introduced.
  4. Regression test: ensure per-page mode produces equivalent table content to full-doc mode on the eval PDFs (MMWR, cholera).

Definition of done

  • Two weeks of refiner-time telemetry collected after the main issue ships.
  • Decision documented in this issue: per-page or stay full-doc?
  • If per-page: implemented behind a config flag (docling_per_page_mode: bool), regression-tested against eval PDFs.

Out of scope

  • GPU/sidecar Docling host. That's a deployment decision, separate from the page-range optimisation.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpost-benchmarkRevisit after first end-to-end benchmark run

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions