feat: serialize subscript and superscript in Markdown export by LucasArray · Pull Request #661 · docling-project/docling-core

LucasArray · 2026-06-26T16:46:48Z

Summary

The Markdown serializer drops subscript and superscript formatting. A text span marked Formatting(script=Script.SUB) or Script.SUPER is written out as plain text, so a subscript like the 2 in H2O or an exponent like E=mc2 becomes an ordinary digit with no way to recover it. The HTML serializer already preserves these as /; the Markdown serializer never implemented the hooks.

This adds them. #319 introduced the sub/superscript model and the HTML support but intentionally left Markdown out, since the Pandoc ~x~/^x^ syntax is uncommon and does not render in editors like VS Code. Emitting inline / avoids that: it renders on GitHub and in any CommonMark viewer, and it matches what the HTML serializer already produces, so Markdown and HTML stay in sync.

Addresses docling-project/docling#520.

Changes

docling_core/transforms/serializer/markdown.py: implement serialize_subscript and serialize_superscript, returning {text} and {text}. They run in post_process after the content is HTML-escaped, like the existing bold/italic/strikethrough hooks, so only the wrapping tags are literal.
docling_core/transforms/serializer/plain_text.py: PlainTextDocSerializer subclasses the Markdown serializer, so it would otherwise inherit the new tags. Override both hooks to return the text unchanged, consistent with how it already strips bold/italic/strikethrough.
test/test_serialization.py: add test_md_subscript_formatting and test_md_superscript_formatting.
Regenerated the Markdown reference data under test/data/doc/. The shared test fixture already contains sub/superscript spans, so their Markdown ground truth now renders / (one changed line per file). Note: this updates reference test data, which requires a double review per CONTRIBUTING.

Example

doc.add_text(label=DocItemLabel.TEXT, text="H2O", formatting=Formatting(script=Script.SUB))
doc.export_to_markdown()

Before: H2O
After: H2O

I also checked this directly by exporting the same document with and without the change and rendering both outputs. The subscript and superscript text now displays as actual sub/superscript instead of flat text.

Testing

All run locally and passing:

uv run pytest: 544 passed, 6 skipped
uv run ruff check and uv run ruff format --check: clean
uv run mypy docling_core test: clean
New unit tests assert / in the Markdown output; the plain-text serializer test confirms the tags are stripped there

Signed-off-by: Lucas Araujo <29403436+LucasArray@users.noreply.github.com>

github-actions · 2026-06-26T16:46:57Z

✅ DCO Check Passed

Thanks @LucasArray, all your commits are properly signed off. 🎉

mergify · 2026-06-26T16:47:27Z

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

	Protection	Waiting on
🔴	Require two reviewer for test updates	👀 reviews
🟢	Enforce conventional commit	—

🔴 Require two reviewer for test updates

Waiting for

#approved-reviews-by >= 2

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

LucasArray · 2026-06-29T15:18:59Z

btw, quick note on scope to set expectations.

This is the Markdown counterpart to #319, which added subscript/superscript to the model and the HTML serializer. It covers the serialization step only: when a span is already tagged with Formatting(script=SUB/SUPER), Markdown now emits inline _{/^{instead of dropping it. Before this, HTML preserved these and Markdown did not.}}

It does not change detection. Whether a given PDF gets its sub/superscript tagged in the first place is a separate, upstream concern, so this alone will not resolve every case in #520 (for example, footnote markers that never get tagged as superscript, or a subscript that is routed through the formula path). I linked #520 as the umbrella tracker for preserving sub/superscript on export, and used "Addresses" rather than "Closes" for that reason.

Happy to adjust the approach or the issue reference if anyone would prefer.

feat: serialize subscript and superscript in Markdown export

ac75c4c

Signed-off-by: Lucas Araujo <29403436+LucasArray@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: serialize subscript and superscript in Markdown export#661

feat: serialize subscript and superscript in Markdown export#661
LucasArray wants to merge 1 commit into
docling-project:mainfrom
LucasArray:feat/markdown-sub-superscript

LucasArray commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

mergify Bot commented Jun 26, 2026

🟢 Enforce conventional commit

Uh oh!

LucasArray commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LucasArray commented Jun 26, 2026

Summary

Changes

Example

Testing

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

mergify Bot commented Jun 26, 2026

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

LucasArray commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant