Skip to content

Add optional TableFormer skip for faster table processing#20

Merged
artiz merged 1 commit into
masterfrom
claude/focused-curie-zkkv2p
Jul 1, 2026
Merged

Add optional TableFormer skip for faster table processing#20
artiz merged 1 commit into
masterfrom
claude/focused-curie-zkkv2p

Conversation

@artiz

@artiz artiz commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

Add a no_table_former option throughout the conversion pipeline to allow users to skip loading and running the TableFormer ONNX model for table structure prediction. When disabled, tables fall back to geometric reconstruction from cell positions, trading table fidelity for faster processing (no model load, no per-table inference).

Key Changes

  • PDF Pipeline (fleischwolf-pdf):

    • Add no_table_former field to Pipeline struct with builder method
    • Modify Worker::load() to accept no_table_former parameter and conditionally skip TableFormer initialization
    • Create convert_with_options(), convert_image_with_options(), and convert_pages_with_options() functions that accept the flag
    • Update existing convenience functions to delegate to new _with_options variants
    • Export new convert_mets_gbs_with_options() from mets module
  • METS/GBS Support (fleischwolf-pdf/mets.rs):

    • Add convert_mets_gbs_with_options() function with no_table_former parameter
    • Refactor existing convert_mets_gbs() to delegate to new function
  • Document Converter (fleischwolf):

    • Add no_table_former field to DocumentConverter struct
    • Add no_table_former() builder method with documentation
    • Thread the flag through to PDF backend calls and streaming pipeline
    • Update all PDF conversion paths to use _with_options variants
  • Streaming Support (fleischwolf/stream.rs):

    • Pass no_table_former through spawn and run functions
    • Apply flag to Pipeline::new() in PDF streaming path
  • CLI (fleischwolf-cli):

    • Add --no-table-former command-line flag
    • Thread flag through to DocumentConverter and benchmark warm-up function
    • Update usage documentation and help text
  • Documentation (README.md):

    • Document the new --no-table-former CLI option and its performance/fidelity tradeoff

Implementation Details

  • The flag is applied at worker initialization time, so it must be set before the first conversion
  • All public conversion entry points now have both standard and _with_options variants for backward compatibility
  • The implementation uses conditional initialization rather than runtime checks, minimizing overhead when the feature is disabled
  • Streaming mode benefits particularly from this optimization since it processes pages incrementally

https://claude.ai/code/session_01StCE48TQ4sw2kQ2zxveNa2

… parsing

Threads a no_table_former flag from the CLI through DocumentConverter,
the streaming path, and the PDF Pipeline down to the per-worker model
load, so tables fall back to geometric reconstruction instead of
running the TableFormer ONNX model. Speeds up parsing (most notably in
streaming mode) when exact table structure isn't needed.
@artiz artiz merged commit abceed8 into master Jul 1, 2026
3 checks passed
@artiz artiz deleted the claude/focused-curie-zkkv2p branch July 1, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants