Add optional TableFormer skip for faster table processing#20
Merged
Conversation
… parsing Threads a no_table_former flag from the CLI through DocumentConverter, the streaming path, and the PDF Pipeline down to the per-worker model load, so tables fall back to geometric reconstruction instead of running the TableFormer ONNX model. Speeds up parsing (most notably in streaming mode) when exact table structure isn't needed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a
no_table_formeroption throughout the conversion pipeline to allow users to skip loading and running the TableFormer ONNX model for table structure prediction. When disabled, tables fall back to geometric reconstruction from cell positions, trading table fidelity for faster processing (no model load, no per-table inference).Key Changes
PDF Pipeline (
fleischwolf-pdf):no_table_formerfield toPipelinestruct with builder methodWorker::load()to acceptno_table_formerparameter and conditionally skip TableFormer initializationconvert_with_options(),convert_image_with_options(), andconvert_pages_with_options()functions that accept the flag_with_optionsvariantsconvert_mets_gbs_with_options()from mets moduleMETS/GBS Support (
fleischwolf-pdf/mets.rs):convert_mets_gbs_with_options()function withno_table_formerparameterconvert_mets_gbs()to delegate to new functionDocument Converter (
fleischwolf):no_table_formerfield toDocumentConverterstructno_table_former()builder method with documentation_with_optionsvariantsStreaming Support (
fleischwolf/stream.rs):no_table_formerthrough spawn and run functionsPipeline::new()in PDF streaming pathCLI (
fleischwolf-cli):--no-table-formercommand-line flagDocumentConverterand benchmark warm-up functionDocumentation (
README.md):--no-table-formerCLI option and its performance/fidelity tradeoffImplementation Details
_with_optionsvariants for backward compatibilityhttps://claude.ai/code/session_01StCE48TQ4sw2kQ2zxveNa2