Enhance JATS, ODF, and DOCX backends with improved content extraction#16
Merged
Conversation
30af77d to
d092d98
Compare
…S quirks
Advance the three "not migrated" items from MIGRATION.md §5.
JATS: port docling's `_walk_linear` so `<body>`/`<back>` now render tables
(`<table-wrap>` with caption), figures (`<fig>` caption + picture), bullet
lists, `<ref-list>` references with `<element-citation>`/`<mixed-citation>`
flattening, `<fn-group>` footnotes and `<disp-formula>` equations. All four
JATS fixtures are now byte-for-byte with live docling (were 115–220 diff lines).
DOCX: finish the advanced OMML + inline-equation spacing long tail. omml.rs now
does limit-label space escaping (`do_lim`), `\operatorname{argmax}` limit funcs,
and pylatexenc's exact two-space symbol padding under a single `" "->" "`
collapse. docx.rs reproduces docling's inline-group serialization so inline
equations pick up the doubled space and list items keep their marker
(`_handle_equations_in_text`). equations.docx is now byte-exact.
ODF: mixed-style list continuation + empty-list-item level collapse (a port of
`_add_odf_list` with `_OdfListState` threaded across sibling lists in both text
and presentation walks), and ODS sheet->table region detection (a strict
gap-tolerance flood fill splitting a sheet into its data regions, singletons
kept as 1x1 tables). odf_table_with_title_01.ods is now byte-exact.
Shared: the Markdown table serializer now right-aligns numeric columns that use
thousands separators (`7,015`), matching tabulate's column typing — improves
JATS/XLSX/XBRL tables.
Fixtures regenerated; `cargo fmt`/`clippy -D warnings`/`cargo test` green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
Complete the ODF "rich table cells" long tail and clean up table handling. - Rich vs plain cell split (docling's `_odf_cell_is_rich`/`_odf_cell_text`): a cell holding a list, nested table, image or multiple paragraphs renders its full block content — formatting kept, lists as `- item`, nested tables flattened to space-joined cells — flattened to one line; a plain cell renders unformatted text (subscripts inline, no markers). - Table grid rewrite: covered (`table:covered-table-cell`) columns of a col/row span are left blank instead of duplicating the anchor's text, nested-table rows no longer leak into the outer table, and the grid is trimmed to its data bounds — matching docling's `_add_table_from_odf` / `_find_true_data_bounds`. - Paragraph images emit a placeholder picture before the text (skipping `ObjectReplacements/` previews); run collection recurses into every child so an image's `<svg:desc>` text is picked up; paragraph text is stripped at both ends (a leading `<text:line-break/>` no longer doubles a space). odt/ods table fixtures are now byte-for-byte with live docling (text_document_01 and _03 tables, odf_table_with_title, and the merged-cell presentation table); `.odp` slide titles/shape-text/notes remain (tracked in MIGRATION.md §5). Also adds a `.gitignore` for the local docling ground-truth venv. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
…fleischwolf-node crate The Node.js/Bun N-API bindings (fleischwolf-node, published to npm) landed on master — remove it from §6 Extensions' unbuilt-work list and add the crate to the §1 architecture tree, matching README.md's Layout table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
d092d98 to
333bf77
Compare
Add support for MIME HTML archives, a saved-webpage format docling itself has no backend for. mail-parser conforms to RFC 2557 (MIME Encapsulation of Aggregate Documents / MHTML), so the new backend parses the archive as a MIME message, extracts the root `text/html` part, and routes it through the existing HTML backend for full Markdown extraction — reusing block structure, tables, lists and links as-is. Embedded resources (images, CSS) live as sibling MIME parts addressed by `Content-Location` (the resource's original URL, which a saved page's `<img src>` is rewritten to verbatim) or `Content-ID` (`cid:...` references, used by some synthesized parts like inlined stylesheets). Images are resolved against the archive's own parts via the existing `MapImageResolver` (the same mechanism EPUB uses for its archive) and embedded by default — unlike HTML/EPUB's opt-in `fetch_images` gate, there's no separate network/filesystem read here, just the MIME bytes already parsed, so it matches how DOCX/PPTX embed their blobs by default. Vector formats the `image` crate can't decode (e.g. `image/svg+xml`) are left as unresolved placeholders rather than embedded without pixel dimensions. Wires `InputFormat::Mhtml` (`.mhtml`/`.mht`) through the converter and the Node.js bindings' format list. Moves the two MHTML fixtures pushed to the top-level conformance corpus into the crate-level regression corpus instead (matching the `json_docling` precedent: docling has no MHTML backend, so there's nothing to run `scripts/conformance.sh` against — regression fixtures pinning our own output are the only applicable test). Updates MIGRATION.md (moves MHTML from the §6 Extensions roadmap into the §2 format table, now that it's implemented) and README.md (documents the new format in Status and Image extraction). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR significantly enhances content extraction across multiple document format backends, with major improvements to JATS XML processing, ODF table handling, and DOCX equation rendering.
Key Changes
JATS Backend (
src/backend/jats.rs)_walk_linearalgorithm for improved document structure traversal<table-wrap>tables with captions<fig>figures with captionsODF Backend (
src/backend/odf.rs)HashSetandVecDequefor improved data structure handlingDOCX Backend (
src/backend/docx.rs)OMML Backend (
src/backend/omml.rs)Markdown Output (
fleischwolf-core/src/markdown.rs)Test Updates
Updated expected outputs across multiple test fixtures:
Notable Implementation Details
https://claude.ai/code/session_01DofkqhMuAJbL9arnnuVisL