feat(markdown): add footnote serialization support#569
Conversation
|
✅ DCO Check Passed Thanks @ShrillHarrier, all your commits are properly signed off. 🎉 |
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the differences between
|
b849797 to
27e8b28
Compare
|
Signing off with personal email. |
483d902 to
f3f0876
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @ShrillHarrier for suggesting this PR. Please, see my comments.
In general:
- Adding new tests with a programmatic example is fine but it is much more illustrative to show the impact of the new feature in a serialization with a ground truth data file. This allows us to check how the output markdown file gets rendered in applications like Github or VSC. Please, check how this is done in other test modules.
- To keep the repository consistent, I would suggest that you add the tests in the module
test/test_serialization.py, together with the other tests of the markdown serialization, instead of creating a separate module. - There are some DoclingDocument files (
.json) that do not serialize as expected. Please, check some of them and ensure that the markdown serialization generates the right footnotes hook and text. For instance,test/data/doc/2408.09869v3_enriched.jsonhas a footnote. If you regenerate the ground truth files (by running the tests with env variableDOCLING_GEN_TEST_DATA=1), I would expect that the markdown serialization gets updated with the new footnote serialization.
| results = [] | ||
| for footnote in item.footnotes: | ||
| if isinstance(ftn := footnote.resolve(self.doc), TextItem): | ||
| parts = ftn.text.split(" ", 1) |
There was a problem hiding this comment.
The footnote parsing logic assumes a specific format (the identifier and the footnote text). This format is not clearly represented or documented. It would be good that the format is explicit, validated, and clearly documented. We should keep in mind that for the markdown footnote to work, identifiers can be numbers or words, but they can’t contain spaces or tabs..
In addition, I don't think that we keep the footnote references with the correct formatting (a caret and an identifier inside brackets, e.g., [^1]).
If I try to serialize a DoclingDocument from the test dataset, you'll see that the footnote 1 see huggingface.co/ds4sd/docling-models/ (doc item with reference #/texts/29) is not properly serialized as in the markdown specification.
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.base import ImageRefMode
with open("test/data/doc/2408.09869v3_enriched.json") as handler:
content = handler.read()
doc = DoclingDocument.model_validate_json(content)
doc.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER)There was a problem hiding this comment.
Hi, ah I see. Okay, I will refine the serialization and retest on the DoclingDocument files.
|
@ShrillHarrier please let me know when the PR is ready for another review. |
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
- Add special case handling for DocItemLabel.FOOTNOTE in _serialize_text_item - Standalone footnotes now serialize as [^id]: text instead of plain text - Fixes issue where footnotes not referenced by tables/pictures were not formatted correctly Note: Ground truth files will be updated in a follow-up commit
…thub.com> I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: dfca98f I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 030c0b4 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 97de619 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: a534839 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 690c931 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 4cfa21d Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
3f71a9a to
c73f3a5
Compare
…thub.com> I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 4541deb I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 41e67c9 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: e6cc4e3 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 1ef1e75 I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 91af6cd I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 5fb30a4 Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
|
Hello @ceberam, I have cleaned up the functionality and code in this PR. Since review I have completed the following:
|
|
@ShrillHarrier getting back to this after an absence...could you please resolve the conflicts with |
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
|
Hello @ceberam, I have merged with |
|
Hello @ceberam I have re-merged with |
ceberam
left a comment
There was a problem hiding this comment.
Thanks again @ShrillHarrier for all the efforts on this PR.
I left some minor technical comments.
However, my biggest concern is the proper rendering of those footnotes. The footnote itself looks well serialized according to the markdown specifications. However, this formatting (e.g., [^2]: Example) seems to require an anchor in the text. If there is no anchor, the footnote simply does not get displayed in some applications. Unfortunately, the DoclingDocument examples we have from PDFs, which were parsed from PDFs, do not keep a proper link between the anchored text (the reference) and the footnote itself. The changes from this PR would actually mask it. For instance, in test/data/doc/2206.01062.yaml.paged.md, the markdown serialization from this PR generates the text:
To ensure that future benchmarks in the document-layout analysis community can be easily compared, we have split up DocLayNet into pre-defined train-, test- and validation-sets. In this way, we can avoid spurious variations in the evaluation scores due to random splitting in train-, test- and validation-sets. We also ensured that less frequent labels are represented in train and test sets in equal proportions.
[^2]: e.g. AAPL from https://www.annualreports.com/
Table 1 shows the overall frequency and distribution of the labels among the different sets. Importantly, we ensure that subsets are only split on full-document boundaries. This avoids that pages of the same document are spread over train, test and validation set, which can give an undesired evaluation advantage to models and lead to overestimation of their prediction accuracy. We will show the impact of this decision in Section 5.
Github renders this section like this:
While before the changes, we could see this (not ideal but at least visible):
How to move forward
My idea would be to apply the new footnote formatting only if we see an anchor, so that we follow the Markdown syntax strictly and we are sure that Markdown editors will understand it.
The question is how the anchored text should be linked with the footnote. I would then apply the same pattern as the captions with the PictureItem objects.
- Check if a footnote is contained in a
FloatingItem(through itsfootnotesfield) - Use that floating item to store the anchor and render it through the Markdown specs: a caret and an identifier inside brackets (e.g.,
[^2]) - Render the footnote as you have already done: with a caret and the number inside brackets with a colon and text (
[^2]: My footnote.).
It could help to create a sample .json document that includes a reference (anchor) in a paragraph and the footnote with the right structure from above.
Later on, we could enhance the declarative backends in docling repository to follow this formatting (e.g., JATS, USPTO, Markdown,...) in a separate PR. Actually it would be helpful to have a round trip example with footnotes, parsing from and exporting to markdown (.md -> .json -> .md) to fully test this functionality.
Hope this helps and feel free to add any thoughts here.
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
…thub.com> I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 7e8f664 Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza username@users.noreply.github.com><
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
…thub.com> I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 155b189 Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
Signed-off-by: Matthew Panizza username@users.noreply.github.com><
…thub.com> I, Matthew Panizza <username@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 9a54226 Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
|
Hi @ceberam, I have adjusted the code to only show references if an anchor exists. We use the footnotes field in
With the current implementation, we only render the anchors in attached to |
|
This document demonstrates footnote anchoring across multiple element types. ObservationsOur analysis relies on a well-established methodology. Regional revenue data was collected across two consecutive quarters from a standardised reporting framework used by all participating business units. The figures presented in Table 1 summarise gross revenue by region before tax adjustments, allowing direct comparison between the North and South territories over the reference period. The North region consistently outperformed expectations driven by strong consumer demand and favourable currency movements, while the South region recorded more modest growth constrained by supply chain disruptions in the second quarter. Table 1: Quarterly revenue by region.
The revenue growth observed in both regions aligns with the broader market trends reported in the same period. Variance between Q1 and Q2 is within acceptable tolerance bounds and does not indicate systemic reporting issues. Pipeline ArchitectureThe pipeline architecture is illustrated below. The diagram captures the primary processing stages from document ingestion through to structured output generation. Figure 1: Pipeline overview diagram. The diagram above illustrates the end-to-end flow of documents through the system. Each stage operates independently, enabling modular replacement of individual components without disrupting the overall pipeline. Further details are available in the appendix. Footnotes |
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
This PR is related to the Improved Footnote Serialization in MarkdownDocSerializer.
It is a Feature Request submitted by simonschoe in docling-project.
The features added include serializing a footnote in the form
[^{Identifier}]: {Description}. This is done for Table and Picture items, as footnotes are linked to those.In general, footnotes in
.mdfiles should look like:Resolves docling-project/docling#3128
Tests Added: