Optimise DoclingTableRefiner: per-page Docling re-run for large PDFs

## Background

The initial `DoclingTableRefiner` (issue #16) runs Docling on the whole PDF whenever the refiner is triggered. This is simple and reliable, but the eval data shows Docling takes **~22 s/page on CPU** (no OCR, TableFormer FAST), with full-doc conversions ranging from ~1.5 min (5-page MMWR) to ~9 min (15-page WHO mpox sitrep, 12-page ECDC CDTR).

For longer reports — or for pipelines that touch many PDFs per cron run — this wall-clock may become a problem. Docling supports `DocumentConverter.convert(source, page_range=(start, end))` to limit conversion to a page slice. Restricting Docling to just the pages that contain suspect tables would cut typical refinement cost from minutes to ~22 s.

The eval explored this: per-page cost test in [`scripts/eval_docling_ocr_cost.py`](scripts/eval_docling_ocr_cost.py) confirmed `page_range` works and produces consistent per-page timings.

## Scope

Once the main refiner is in production, gather some data and decide whether to switch.

1. **Measure wall-clock distribution** of refiner runs over real ingest traffic for ~2 weeks. Capture:
   - Per-doc total refiner time
   - Page count
   - Number of "suspect" table sections detected
   - Their page numbers
2. **Decide the threshold**: at what doc size does full-doc Docling become unacceptable? (Probably when median refiner time exceeds some user-facing budget — TBD.)
3. **If the data justifies it, implement per-page mode**:
   - For allowlist-triggered docs: still run full-doc (no in-tree table to identify suspect pages from).
   - For heuristic-triggered docs: collect the set of page numbers from suspect `SectionContent`s, call `convert(source, page_range=(min_page, max_page))` or per-page if non-contiguous.
   - Merge the page-scoped Docling result back into `ParsedContent` using the same page-number matching the main issue introduced.
4. **Regression test**: ensure per-page mode produces equivalent table content to full-doc mode on the eval PDFs (MMWR, cholera).

## Definition of done

- [ ] Two weeks of refiner-time telemetry collected after the main issue ships.
- [ ] Decision documented in this issue: per-page or stay full-doc?
- [ ] If per-page: implemented behind a config flag (`docling_per_page_mode: bool`), regression-tested against eval PDFs.

## Out of scope

- GPU/sidecar Docling host. That's a deployment decision, separate from the page-range optimisation.

## References

- Main refiner issue: #16
- Per-page cost data: [`data/docling_eval/ocr/per_page_cost.json`](data/docling_eval/ocr/per_page_cost.json)
- Eval write-up: [`data/docling_eval/FINDINGS.md`](data/docling_eval/FINDINGS.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise DoclingTableRefiner: per-page Docling re-run for large PDFs #17

Background

Scope

Definition of done

Out of scope

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimise DoclingTableRefiner: per-page Docling re-run for large PDFs #17

Description

Background

Scope

Definition of done

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions