Improve TableFormer model loading and add memory pattern optimization by artiz · Pull Request #18 · artiz/fleischwolf

artiz · 2026-07-01T13:58:26Z

Summary

This PR improves the TableFormer ONNX model loading experience by adding better diagnostics for missing models and optimizing ONNX Runtime memory usage patterns. It also ensures environment variables are properly exported in setup scripts so TableFormer works correctly regardless of the process's working directory.

Key Changes

Added warn_missing_once() function: Emits a single per-process warning to stderr when TableFormer models aren't found, explaining the fallback to geometric reconstruction and directing users to set environment variables. This replaces silent failures that were particularly problematic for embedding applications or processes not run from the repo root.
Optimized ONNX Runtime memory patterns: Modified the build() closure to accept a mem_pattern parameter and disable memory pattern optimization for the decoder session. This avoids repeated re-validation of the memory plan on each autoregressive step, since the decoder's KV-cache grows by one entry per step, causing input shapes to differ on every run() call.
Updated setup and test scripts: Added explicit exports of DOCLING_TABLEFORMER_ENCODER, DOCLING_TABLEFORMER_DECODER, and DOCLING_TABLEFORMER_BBOX environment variables in:
- scripts/pdf_setup.sh (conditional export when models exist)
- scripts/pdf_conformance.sh (with defaults)
- scripts/pdf_groundtruth.sh (with defaults)
- scripts/performance.sh (conditional export when models exist)
Updated README.md: Added documentation explaining the optional TableFormer environment variables, the silent fallback behavior, and why explicit configuration is recommended when invoking from non-repo-root directories.

Implementation Details

The decoder's memory pattern optimization is disabled (mem_pattern=false) while encoder and bbox use it (mem_pattern=true), reflecting the decoder's dynamic shape requirements.
The warning uses std::sync::Once to ensure it fires exactly once per process, avoiding log spam while still making the issue visible.
All script changes use conditional logic to only export paths when files actually exist, maintaining backward compatibility.

https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

… decoder TableFormer's model paths (DOCLING_TABLEFORMER_ENCODER/_DECODER/_BBOX) default to relative paths ("models/tableformer/*.onnx") when unset, unlike layout/OCR's env vars which are consistently documented with absolute paths everywhere. Every doc/script that sets up the manual local-model workflow (README.md, pdf_setup.sh, pdf_conformance.sh, pdf_groundtruth.sh, performance.sh) only exported the layout/OCR vars, never TableFormer's — so anyone running the CLI or an embedding binding (e.g. fleischwolf-node) from any directory other than the exact one the relative default resolves against silently got the geometric table-reconstruction fallback instead of ML table-structure recognition, with zero diagnostic signal. - tableformer.rs: emit a one-time stderr note when the models can't be found, naming the exact paths checked, instead of returning None with no trace. - README.md + the four pdf_*.sh/performance.sh scripts: export DOCLING_TABLEFORMER_ENCODER/_DECODER/_BBOX as absolute paths alongside layout/OCR, so local dev setups aren't CWD-dependent (fleischwolf-node's deps.js already did this correctly via installDependencies(); this brings the manual-setup docs/scripts to parity with it). - tableformer.rs: disable ONNX Runtime's memory-pattern optimizer specifically on the decoder session. The decoder's KV-cache grows by one entry every autoregressive step, so its input shapes differ on every `run()` call; memory-pattern optimization assumes stable shapes to plan buffer reuse, and strace showed the decoder's external-weights file (decoder.onnx.data) being re-opened on what looks like every decode step. This is `ort`'s own documented guidance for variable-input-size sessions — not independently verified against live models in this environment (no models available here), worth confirming with strace against a real corpus. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

…fig setup installDependencies() could only auto-fetch pdfium and the OCR model; the layout model (required for any PDF/image conversion) had no public download, so every user hit a hard error on first use and had to either run the Python export toolchain locally or find/host the files themselves — with no convenient way to do either from a plain Node app. Both models are permissively licensed (docling-project/docling-layout-heron, Apache-2.0; docling-project/docling-models, CDLA-Permissive-2.0/Apache-2.0), so fleischwolf can redistribute the ONNX export with attribution: - .github/workflows/publish-models.yml: manual workflow that runs the existing scripts/export_{layout,tableformer}.py and publishes the resulting .onnx files as GitHub Release assets (tag models-v1); re-running re-uploads, handling the optional TableFormer decoder.onnx.data external-weights sidecar generically (checked per-file, not hardcoded). - MODELS_NOTICE.md: attribution for the redistributed models. - deps.js: installDependencies() now defaults `modelsUrl` to that release, so it works with zero configuration; `{ modelsUrl }` / FLEISCHWOLF_MODELS_URL still override for a custom export/host. Flattens tableformer/*.onnx to tableformer-*.onnx for the fetch (GitHub release assets can't contain "/"), and both error messages (installDependencies, assertMlReady) now print a numbered, copy-pasteable troubleshooting guide instead of a single dense paragraph, shown only when the fetch itself fails. - scripts/setup_nodejs_dependencies.sh: a curl-pipeable wrapper — run from a Node app's directory to pre-fetch everything (installs the `fleischwolf` npm dep if missing, then drives installDependencies()) ahead of time, e.g. in a container build step. - READMEs (root + fleischwolf-node): document the zero-config flow and the new script. Verified deps.js end-to-end against the real (not-yet-published) release URL: pdfium/OCR download correctly, the 404s on the unpublished model assets are caught per-file (TableFormer skips gracefully, layout falls through to the guided error) rather than crashing the whole install. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

…ependencies.sh installDependencies() duplicated what a plain bash script does better: fetch pdfium + the ONNX models straight into ./models and ./.pdfium, which both the Rust CLI and the Node bindings already default to relative to CWD. Removes the JS download machinery entirely (deps.js keeps only path resolution + status checks), deletes the now-pointless setup_nodejs_dependencies.sh wrapper, and adds the CWD-relative .pdfium/lib fallback to pdfium_backend.rs so the Rust side needs no env vars either. publish-models.yml now also re-hosts pdfium + the OCR model (previously only layout/TableFormer), so download_dependencies.sh talks to a single release. TableFormer assets are published under their bare names (encoder.onnx, decoder.onnx, bbox.onnx) instead of a tableformer- prefix, matching the script's fixed target paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

claude added 3 commits July 1, 2026 13:36

artiz merged commit 7725289 into master Jul 1, 2026
3 checks passed

artiz deleted the fix/tableformer-silent-fallback branch July 1, 2026 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TableFormer model loading and add memory pattern optimization#18

Improve TableFormer model loading and add memory pattern optimization#18
artiz merged 3 commits into
masterfrom
fix/tableformer-silent-fallback

artiz commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

artiz commented Jul 1, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants