feat: serialize subscript and superscript in Markdown export#661
feat: serialize subscript and superscript in Markdown export#661LucasArray wants to merge 1 commit into
Conversation
Signed-off-by: Lucas Araujo <29403436+LucasArray@users.noreply.github.com>
|
✅ DCO Check Passed Thanks @LucasArray, all your commits are properly signed off. 🎉 |
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
btw, quick note on scope to set expectations. This is the Markdown counterpart to #319, which added subscript/superscript to the model and the HTML serializer. It covers the serialization step only: when a span is already tagged with Formatting(script=SUB/SUPER), Markdown now emits inline / instead of dropping it. Before this, HTML preserved these and Markdown did not. It does not change detection. Whether a given PDF gets its sub/superscript tagged in the first place is a separate, upstream concern, so this alone will not resolve every case in #520 (for example, footnote markers that never get tagged as superscript, or a subscript that is routed through the formula path). I linked #520 as the umbrella tracker for preserving sub/superscript on export, and used "Addresses" rather than "Closes" for that reason. Happy to adjust the approach or the issue reference if anyone would prefer. |
Summary
The Markdown serializer drops subscript and superscript formatting. A text span marked
Formatting(script=Script.SUB)orScript.SUPERis written out as plain text, so a subscript like the 2 inH2Oor an exponent likeE=mc2becomes an ordinary digit with no way to recover it. The HTML serializer already preserves these as<sub>/<sup>; the Markdown serializer never implemented the hooks.This adds them. #319 introduced the sub/superscript model and the HTML support but intentionally left Markdown out, since the Pandoc
~x~/^x^syntax is uncommon and does not render in editors like VS Code. Emitting inline<sub>/<sup>avoids that: it renders on GitHub and in any CommonMark viewer, and it matches what the HTML serializer already produces, so Markdown and HTML stay in sync.Addresses docling-project/docling#520.
Changes
docling_core/transforms/serializer/markdown.py: implementserialize_subscriptandserialize_superscript, returning<sub>{text}</sub>and<sup>{text}</sup>. They run inpost_processafter the content is HTML-escaped, like the existing bold/italic/strikethrough hooks, so only the wrapping tags are literal.docling_core/transforms/serializer/plain_text.py:PlainTextDocSerializersubclasses the Markdown serializer, so it would otherwise inherit the new tags. Override both hooks to return the text unchanged, consistent with how it already strips bold/italic/strikethrough.test/test_serialization.py: addtest_md_subscript_formattingandtest_md_superscript_formatting.test/data/doc/. The shared test fixture already contains sub/superscript spans, so their Markdown ground truth now renders<sub>/<sup>(one changed line per file). Note: this updates reference test data, which requires a double review per CONTRIBUTING.Example
Before:
H2OAfter:
<sub>H2O</sub>I also checked this directly by exporting the same document with and without the change and rendering both outputs. The subscript and superscript text now displays as actual sub/superscript instead of flat text.
Testing
All run locally and passing:
uv run pytest: 544 passed, 6 skippeduv run ruff checkanduv run ruff format --check: cleanuv run mypy docling_core test: clean<sub>/<sup>in the Markdown output; the plain-text serializer test confirms the tags are stripped there