Skip to content

Enhance JATS, ODF, and DOCX backends with improved content extraction#16

Merged
artiz merged 5 commits into
masterfrom
feature/extend-docs-formats-support
Jul 1, 2026
Merged

Enhance JATS, ODF, and DOCX backends with improved content extraction#16
artiz merged 5 commits into
masterfrom
feature/extend-docs-formats-support

Conversation

@artiz

@artiz artiz commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary

This PR significantly enhances content extraction across multiple document format backends, with major improvements to JATS XML processing, ODF table handling, and DOCX equation rendering.

Key Changes

JATS Backend (src/backend/jats.rs)

  • Ported docling's _walk_linear algorithm for improved document structure traversal
  • Added comprehensive support for <table-wrap> tables with captions
  • Added support for <fig> figures with captions
  • Improved handling of supplementary materials and appendices
  • Enhanced inline markup and formula processing
  • Updated documentation to reflect new capabilities

ODF Backend (src/backend/odf.rs)

  • Extended imports to include HashSet and VecDeque for improved data structure handling
  • Refactored list style tracking with enhanced type definitions
  • Improved table extraction and processing logic
  • Better handling of nested structures and list levels

DOCX Backend (src/backend/docx.rs)

  • Improved equation rendering and inline formula handling
  • Enhanced spacing and formatting around mathematical expressions
  • Better integration of OMML (Office Math Markup Language) processing

OMML Backend (src/backend/omml.rs)

  • Enhanced LaTeX conversion for mathematical expressions
  • Improved character escaping and formula rendering

Markdown Output (fleischwolf-core/src/markdown.rs)

  • Added improved table cell rendering logic
  • Better handling of table structure and formatting

Test Updates

Updated expected outputs across multiple test fixtures:

  • JATS documents (pone.0234687, elife-56337, pntd.0008301) now include extracted tables and figures
  • ODF documents show improved table extraction with proper title handling
  • DOCX equation tests reflect improved spacing around inline formulas
  • XBRL and other format tests updated to reflect backend improvements

Notable Implementation Details

  • The JATS backend now properly extracts and structures complex document elements like tables and figures with their captions
  • ODF table processing now handles multiple tables and title elements more robustly
  • Improved whitespace handling in equation rendering for better readability in markdown output

https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL

@artiz artiz force-pushed the feature/extend-docs-formats-support branch 2 times, most recently from 30af77d to d092d98 Compare July 1, 2026 10:29
claude added 3 commits July 1, 2026 11:51
…S quirks

Advance the three "not migrated" items from MIGRATION.md §5.

JATS: port docling's `_walk_linear` so `<body>`/`<back>` now render tables
(`<table-wrap>` with caption), figures (`<fig>` caption + picture), bullet
lists, `<ref-list>` references with `<element-citation>`/`<mixed-citation>`
flattening, `<fn-group>` footnotes and `<disp-formula>` equations. All four
JATS fixtures are now byte-for-byte with live docling (were 115–220 diff lines).

DOCX: finish the advanced OMML + inline-equation spacing long tail. omml.rs now
does limit-label space escaping (`do_lim`), `\operatorname{argmax}` limit funcs,
and pylatexenc's exact two-space symbol padding under a single `"  "->" "`
collapse. docx.rs reproduces docling's inline-group serialization so inline
equations pick up the doubled space and list items keep their marker
(`_handle_equations_in_text`). equations.docx is now byte-exact.

ODF: mixed-style list continuation + empty-list-item level collapse (a port of
`_add_odf_list` with `_OdfListState` threaded across sibling lists in both text
and presentation walks), and ODS sheet->table region detection (a strict
gap-tolerance flood fill splitting a sheet into its data regions, singletons
kept as 1x1 tables). odf_table_with_title_01.ods is now byte-exact.

Shared: the Markdown table serializer now right-aligns numeric columns that use
thousands separators (`7,015`), matching tabulate's column typing — improves
JATS/XLSX/XBRL tables.

Fixtures regenerated; `cargo fmt`/`clippy -D warnings`/`cargo test` green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
Complete the ODF "rich table cells" long tail and clean up table handling.

- Rich vs plain cell split (docling's `_odf_cell_is_rich`/`_odf_cell_text`): a
  cell holding a list, nested table, image or multiple paragraphs renders its
  full block content — formatting kept, lists as `- item`, nested tables
  flattened to space-joined cells — flattened to one line; a plain cell renders
  unformatted text (subscripts inline, no markers).
- Table grid rewrite: covered (`table:covered-table-cell`) columns of a col/row
  span are left blank instead of duplicating the anchor's text, nested-table rows
  no longer leak into the outer table, and the grid is trimmed to its data
  bounds — matching docling's `_add_table_from_odf` / `_find_true_data_bounds`.
- Paragraph images emit a placeholder picture before the text (skipping
  `ObjectReplacements/` previews); run collection recurses into every child so an
  image's `<svg:desc>` text is picked up; paragraph text is stripped at both ends
  (a leading `<text:line-break/>` no longer doubles a space).

odt/ods table fixtures are now byte-for-byte with live docling (text_document_01
and _03 tables, odf_table_with_title, and the merged-cell presentation table);
`.odp` slide titles/shape-text/notes remain (tracked in MIGRATION.md §5). Also
adds a `.gitignore` for the local docling ground-truth venv.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
…fleischwolf-node crate

The Node.js/Bun N-API bindings (fleischwolf-node, published to npm) landed on
master — remove it from §6 Extensions' unbuilt-work list and add the crate to
the §1 architecture tree, matching README.md's Layout table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
@artiz artiz force-pushed the feature/extend-docs-formats-support branch from d092d98 to 333bf77 Compare July 1, 2026 11:56
artiz and others added 2 commits July 1, 2026 14:05
Add support for MIME HTML archives, a saved-webpage format docling itself has
no backend for. mail-parser conforms to RFC 2557 (MIME Encapsulation of
Aggregate Documents / MHTML), so the new backend parses the archive as a MIME
message, extracts the root `text/html` part, and routes it through the
existing HTML backend for full Markdown extraction — reusing block structure,
tables, lists and links as-is.

Embedded resources (images, CSS) live as sibling MIME parts addressed by
`Content-Location` (the resource's original URL, which a saved page's
`<img src>` is rewritten to verbatim) or `Content-ID` (`cid:...` references,
used by some synthesized parts like inlined stylesheets). Images are resolved
against the archive's own parts via the existing `MapImageResolver` (the same
mechanism EPUB uses for its archive) and embedded by default — unlike
HTML/EPUB's opt-in `fetch_images` gate, there's no separate network/filesystem
read here, just the MIME bytes already parsed, so it matches how DOCX/PPTX
embed their blobs by default. Vector formats the `image` crate can't decode
(e.g. `image/svg+xml`) are left as unresolved placeholders rather than
embedded without pixel dimensions.

Wires `InputFormat::Mhtml` (`.mhtml`/`.mht`) through the converter and the
Node.js bindings' format list. Moves the two MHTML fixtures pushed to the
top-level conformance corpus into the crate-level regression corpus instead
(matching the `json_docling` precedent: docling has no MHTML backend, so
there's nothing to run `scripts/conformance.sh` against — regression fixtures
pinning our own output are the only applicable test).

Updates MIGRATION.md (moves MHTML from the §6 Extensions roadmap into the §2
format table, now that it's implemented) and README.md (documents the new
format in Status and Image extraction).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
@artiz artiz merged commit 80894da into master Jul 1, 2026
3 checks passed
@artiz artiz deleted the feature/extend-docs-formats-support branch July 1, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants