Enhance JATS, ODF, and DOCX backends with improved content extraction by artiz · Pull Request #16 · artiz/fleischwolf

artiz · 2026-07-01T09:47:40Z

Summary

This PR significantly enhances content extraction across multiple document format backends, with major improvements to JATS XML processing, ODF table handling, and DOCX equation rendering.

Key Changes

JATS Backend (`src/backend/jats.rs`)

Ported docling's _walk_linear algorithm for improved document structure traversal
Added comprehensive support for <table-wrap> tables with captions
Added support for <fig> figures with captions
Improved handling of supplementary materials and appendices
Enhanced inline markup and formula processing
Updated documentation to reflect new capabilities

ODF Backend (`src/backend/odf.rs`)

Extended imports to include HashSet and VecDeque for improved data structure handling
Refactored list style tracking with enhanced type definitions
Improved table extraction and processing logic
Better handling of nested structures and list levels

DOCX Backend (`src/backend/docx.rs`)

Improved equation rendering and inline formula handling
Enhanced spacing and formatting around mathematical expressions
Better integration of OMML (Office Math Markup Language) processing

OMML Backend (`src/backend/omml.rs`)

Enhanced LaTeX conversion for mathematical expressions
Improved character escaping and formula rendering

Markdown Output (`fleischwolf-core/src/markdown.rs`)

Added improved table cell rendering logic
Better handling of table structure and formatting

Test Updates

Updated expected outputs across multiple test fixtures:

JATS documents (pone.0234687, elife-56337, pntd.0008301) now include extracted tables and figures
ODF documents show improved table extraction with proper title handling
DOCX equation tests reflect improved spacing around inline formulas
XBRL and other format tests updated to reflect backend improvements

Notable Implementation Details

The JATS backend now properly extracts and structures complex document elements like tables and figures with their captions
ODF table processing now handles multiple tables and title elements more robustly
Improved whitespace handling in equation rendering for better readability in markdown output

https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

…S quirks Advance the three "not migrated" items from MIGRATION.md §5. JATS: port docling's `_walk_linear` so `<body>`/`<back>` now render tables (`<table-wrap>` with caption), figures (`<fig>` caption + picture), bullet lists, `<ref-list>` references with `<element-citation>`/`<mixed-citation>` flattening, `<fn-group>` footnotes and `<disp-formula>` equations. All four JATS fixtures are now byte-for-byte with live docling (were 115–220 diff lines). DOCX: finish the advanced OMML + inline-equation spacing long tail. omml.rs now does limit-label space escaping (`do_lim`), `\operatorname{argmax}` limit funcs, and pylatexenc's exact two-space symbol padding under a single `" "->" "` collapse. docx.rs reproduces docling's inline-group serialization so inline equations pick up the doubled space and list items keep their marker (`_handle_equations_in_text`). equations.docx is now byte-exact. ODF: mixed-style list continuation + empty-list-item level collapse (a port of `_add_odf_list` with `_OdfListState` threaded across sibling lists in both text and presentation walks), and ODS sheet->table region detection (a strict gap-tolerance flood fill splitting a sheet into its data regions, singletons kept as 1x1 tables). odf_table_with_title_01.ods is now byte-exact. Shared: the Markdown table serializer now right-aligns numeric columns that use thousands separators (`7,015`), matching tabulate's column typing — improves JATS/XLSX/XBRL tables. Fixtures regenerated; `cargo fmt`/`clippy -D warnings`/`cargo test` green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

Complete the ODF "rich table cells" long tail and clean up table handling. - Rich vs plain cell split (docling's `_odf_cell_is_rich`/`_odf_cell_text`): a cell holding a list, nested table, image or multiple paragraphs renders its full block content — formatting kept, lists as `- item`, nested tables flattened to space-joined cells — flattened to one line; a plain cell renders unformatted text (subscripts inline, no markers). - Table grid rewrite: covered (`table:covered-table-cell`) columns of a col/row span are left blank instead of duplicating the anchor's text, nested-table rows no longer leak into the outer table, and the grid is trimmed to its data bounds — matching docling's `_add_table_from_odf` / `_find_true_data_bounds`. - Paragraph images emit a placeholder picture before the text (skipping `ObjectReplacements/` previews); run collection recurses into every child so an image's `<svg:desc>` text is picked up; paragraph text is stripped at both ends (a leading `<text:line-break/>` no longer doubles a space). odt/ods table fixtures are now byte-for-byte with live docling (text_document_01 and _03 tables, odf_table_with_title, and the merged-cell presentation table); `.odp` slide titles/shape-text/notes remain (tracked in MIGRATION.md §5). Also adds a `.gitignore` for the local docling ground-truth venv. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

…fleischwolf-node crate The Node.js/Bun N-API bindings (fleischwolf-node, published to npm) landed on master — remove it from §6 Extensions' unbuilt-work list and add the crate to the §1 architecture tree, matching README.md's Layout table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

Add support for MIME HTML archives, a saved-webpage format docling itself has no backend for. mail-parser conforms to RFC 2557 (MIME Encapsulation of Aggregate Documents / MHTML), so the new backend parses the archive as a MIME message, extracts the root `text/html` part, and routes it through the existing HTML backend for full Markdown extraction — reusing block structure, tables, lists and links as-is. Embedded resources (images, CSS) live as sibling MIME parts addressed by `Content-Location` (the resource's original URL, which a saved page's `<img src>` is rewritten to verbatim) or `Content-ID` (`cid:...` references, used by some synthesized parts like inlined stylesheets). Images are resolved against the archive's own parts via the existing `MapImageResolver` (the same mechanism EPUB uses for its archive) and embedded by default — unlike HTML/EPUB's opt-in `fetch_images` gate, there's no separate network/filesystem read here, just the MIME bytes already parsed, so it matches how DOCX/PPTX embed their blobs by default. Vector formats the `image` crate can't decode (e.g. `image/svg+xml`) are left as unresolved placeholders rather than embedded without pixel dimensions. Wires `InputFormat::Mhtml` (`.mhtml`/`.mht`) through the converter and the Node.js bindings' format list. Moves the two MHTML fixtures pushed to the top-level conformance corpus into the crate-level regression corpus instead (matching the `json_docling` precedent: docling has no MHTML backend, so there's nothing to run `scripts/conformance.sh` against — regression fixtures pinning our own output are the only applicable test). Updates MIGRATION.md (moves MHTML from the §6 Extensions roadmap into the §2 format table, now that it's implemented) and README.md (documents the new format in Status and Image extraction). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

artiz force-pushed the feature/extend-docs-formats-support branch 2 times, most recently from 30af77d to d092d98 Compare July 1, 2026 10:29

claude added 3 commits July 1, 2026 11:51

artiz force-pushed the feature/extend-docs-formats-support branch from d092d98 to 333bf77 Compare July 1, 2026 11:56

artiz and others added 2 commits July 1, 2026 14:05

feat(mhtml): add MHTML examples

a9a6aef

artiz merged commit 80894da into master Jul 1, 2026
3 checks passed

artiz deleted the feature/extend-docs-formats-support branch July 1, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance JATS, ODF, and DOCX backends with improved content extraction#16

Enhance JATS, ODF, and DOCX backends with improved content extraction#16
artiz merged 5 commits into
masterfrom
feature/extend-docs-formats-support

artiz commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

artiz commented Jul 1, 2026

Summary

Key Changes

JATS Backend (src/backend/jats.rs)

ODF Backend (src/backend/odf.rs)

DOCX Backend (src/backend/docx.rs)

OMML Backend (src/backend/omml.rs)

Markdown Output (fleischwolf-core/src/markdown.rs)

Test Updates

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JATS Backend (`src/backend/jats.rs`)

ODF Backend (`src/backend/odf.rs`)

DOCX Backend (`src/backend/docx.rs`)

OMML Backend (`src/backend/omml.rs`)

Markdown Output (`fleischwolf-core/src/markdown.rs`)