Skip to content

Improve TableFormer model loading and add memory pattern optimization#18

Merged
artiz merged 3 commits into
masterfrom
fix/tableformer-silent-fallback
Jul 1, 2026
Merged

Improve TableFormer model loading and add memory pattern optimization#18
artiz merged 3 commits into
masterfrom
fix/tableformer-silent-fallback

Conversation

@artiz

@artiz artiz commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

This PR improves the TableFormer ONNX model loading experience by adding better diagnostics for missing models and optimizing ONNX Runtime memory usage patterns. It also ensures environment variables are properly exported in setup scripts so TableFormer works correctly regardless of the process's working directory.

Key Changes

  • Added warn_missing_once() function: Emits a single per-process warning to stderr when TableFormer models aren't found, explaining the fallback to geometric reconstruction and directing users to set environment variables. This replaces silent failures that were particularly problematic for embedding applications or processes not run from the repo root.

  • Optimized ONNX Runtime memory patterns: Modified the build() closure to accept a mem_pattern parameter and disable memory pattern optimization for the decoder session. This avoids repeated re-validation of the memory plan on each autoregressive step, since the decoder's KV-cache grows by one entry per step, causing input shapes to differ on every run() call.

  • Updated setup and test scripts: Added explicit exports of DOCLING_TABLEFORMER_ENCODER, DOCLING_TABLEFORMER_DECODER, and DOCLING_TABLEFORMER_BBOX environment variables in:

    • scripts/pdf_setup.sh (conditional export when models exist)
    • scripts/pdf_conformance.sh (with defaults)
    • scripts/pdf_groundtruth.sh (with defaults)
    • scripts/performance.sh (conditional export when models exist)
  • Updated README.md: Added documentation explaining the optional TableFormer environment variables, the silent fallback behavior, and why explicit configuration is recommended when invoking from non-repo-root directories.

Implementation Details

  • The decoder's memory pattern optimization is disabled (mem_pattern=false) while encoder and bbox use it (mem_pattern=true), reflecting the decoder's dynamic shape requirements.
  • The warning uses std::sync::Once to ensure it fires exactly once per process, avoiding log spam while still making the issue visible.
  • All script changes use conditional logic to only export paths when files actually exist, maintaining backward compatibility.

https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

claude added 3 commits July 1, 2026 13:36
… decoder

TableFormer's model paths (DOCLING_TABLEFORMER_ENCODER/_DECODER/_BBOX) default
to relative paths ("models/tableformer/*.onnx") when unset, unlike layout/OCR's
env vars which are consistently documented with absolute paths everywhere.
Every doc/script that sets up the manual local-model workflow (README.md,
pdf_setup.sh, pdf_conformance.sh, pdf_groundtruth.sh, performance.sh) only
exported the layout/OCR vars, never TableFormer's — so anyone running the CLI
or an embedding binding (e.g. fleischwolf-node) from any directory other than
the exact one the relative default resolves against silently got the
geometric table-reconstruction fallback instead of ML table-structure
recognition, with zero diagnostic signal.

- tableformer.rs: emit a one-time stderr note when the models can't be found,
  naming the exact paths checked, instead of returning None with no trace.
- README.md + the four pdf_*.sh/performance.sh scripts: export
  DOCLING_TABLEFORMER_ENCODER/_DECODER/_BBOX as absolute paths alongside
  layout/OCR, so local dev setups aren't CWD-dependent (fleischwolf-node's
  deps.js already did this correctly via installDependencies(); this brings
  the manual-setup docs/scripts to parity with it).
- tableformer.rs: disable ONNX Runtime's memory-pattern optimizer specifically
  on the decoder session. The decoder's KV-cache grows by one entry every
  autoregressive step, so its input shapes differ on every `run()` call;
  memory-pattern optimization assumes stable shapes to plan buffer reuse, and
  strace showed the decoder's external-weights file (decoder.onnx.data) being
  re-opened on what looks like every decode step. This is `ort`'s own
  documented guidance for variable-input-size sessions — not independently
  verified against live models in this environment (no models available here),
  worth confirming with strace against a real corpus.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
…fig setup

installDependencies() could only auto-fetch pdfium and the OCR model; the
layout model (required for any PDF/image conversion) had no public download,
so every user hit a hard error on first use and had to either run the Python
export toolchain locally or find/host the files themselves — with no
convenient way to do either from a plain Node app.

Both models are permissively licensed (docling-project/docling-layout-heron,
Apache-2.0; docling-project/docling-models, CDLA-Permissive-2.0/Apache-2.0),
so fleischwolf can redistribute the ONNX export with attribution:

- .github/workflows/publish-models.yml: manual workflow that runs the existing
  scripts/export_{layout,tableformer}.py and publishes the resulting .onnx
  files as GitHub Release assets (tag models-v1); re-running re-uploads,
  handling the optional TableFormer decoder.onnx.data external-weights
  sidecar generically (checked per-file, not hardcoded).
- MODELS_NOTICE.md: attribution for the redistributed models.
- deps.js: installDependencies() now defaults `modelsUrl` to that release, so
  it works with zero configuration; `{ modelsUrl }` / FLEISCHWOLF_MODELS_URL
  still override for a custom export/host. Flattens tableformer/*.onnx to
  tableformer-*.onnx for the fetch (GitHub release assets can't contain "/"),
  and both error messages (installDependencies, assertMlReady) now print a
  numbered, copy-pasteable troubleshooting guide instead of a single dense
  paragraph, shown only when the fetch itself fails.
- scripts/setup_nodejs_dependencies.sh: a curl-pipeable wrapper — run from a
  Node app's directory to pre-fetch everything (installs the `fleischwolf` npm
  dep if missing, then drives installDependencies()) ahead of time, e.g. in a
  container build step.
- READMEs (root + fleischwolf-node): document the zero-config flow and the new
  script.

Verified deps.js end-to-end against the real (not-yet-published) release URL:
pdfium/OCR download correctly, the 404s on the unpublished model assets are
caught per-file (TableFormer skips gracefully, layout falls through to the
guided error) rather than crashing the whole install.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
…ependencies.sh

installDependencies() duplicated what a plain bash script does better: fetch
pdfium + the ONNX models straight into ./models and ./.pdfium, which both the
Rust CLI and the Node bindings already default to relative to CWD. Removes the
JS download machinery entirely (deps.js keeps only path resolution + status
checks), deletes the now-pointless setup_nodejs_dependencies.sh wrapper, and
adds the CWD-relative .pdfium/lib fallback to pdfium_backend.rs so the Rust
side needs no env vars either.

publish-models.yml now also re-hosts pdfium + the OCR model (previously only
layout/TableFormer), so download_dependencies.sh talks to a single release.
TableFormer assets are published under their bare names (encoder.onnx,
decoder.onnx, bbox.onnx) instead of a tableformer- prefix, matching the
script's fixed target paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
@artiz artiz merged commit 7725289 into master Jul 1, 2026
3 checks passed
@artiz artiz deleted the fix/tableformer-silent-fallback branch July 1, 2026 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants