Skip to content

feat(boxnote): add a BoxNote document backend#3722

Open
pablopupo wants to merge 1 commit into
docling-project:mainfrom
pablopupo:feat/481-boxnote-backend
Open

feat(boxnote): add a BoxNote document backend#3722
pablopupo wants to merge 1 commit into
docling-project:mainfrom
pablopupo:feat/481-boxnote-backend

Conversation

@pablopupo

Copy link
Copy Markdown
Contributor

Description

Adds a backend for Box Notes (.boxnote), modeled on the existing csv and webvtt backends.

.boxnote files are JSON, but Box has two incompatible schemas. The older atext/pool model (what boxnotes2html parses) predates a 2022 change; the current one is a ProseMirror-style doc tree, which is what the app exports now. This handles the current format and maps its nodes onto a DoclingDocument: headings, paragraphs (with bold/italic/underline/strikethrough and links), lists, tables, code and images.

The decode is pure Python, so there's no new dependency. A legacy atext/pool note is detected and rejected with a clear DocumentLoadError rather than mis-parsed. I left legacy out to keep this focused; it's an easy follow-up if you'd want it here.

Changes

  • docling/backend/boxnote_backend.py: new BoxNoteDocumentBackend.
  • docling/datamodel/base_models.py: register InputFormat.BOXNOTE, the .boxnote extension, and a mime type.
  • docling/datamodel/document.py: resolve .boxnote to its mime in format detection.
  • docling/document_converter.py: BoxNoteFormatOption on the SimplePipeline, plus its default option.
  • docs/usage/supported_formats.md: list BoxNote.

Tests

tests/test_backend_boxnote.py covers a sample .boxnote end to end (Markdown, indented-text and JSON ground truth), format detection, the legacy-unsupported path, and empty input.

Resolves #481

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@github-actions

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @pablopupo, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🔴 1 of 2 protections blocking · waiting on 👀 reviews

Protection Waiting on
🔴 Require two reviewer for test updates 👀 reviews
🟢 Enforce conventional commit

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@PeterStaar-IBM PeterStaar-IBM left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the README.me and docs/index.md?

Comment thread docling/backend/boxnote_backend.py Outdated
if len(runs) == 1:
text, formatting, hyperlink = runs[0]
doc.add_text(
label=DocItemLabel.PARAGRAPH,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocItemLabel.PARAGRAPH is deprecated, please use DocItemLabel.TEXT

Comment thread docling/backend/boxnote_backend.py Outdated
col_span = attrs.get("colspan") or 1
cells.append(
TableCell(
text=self._cell_text(cell.get("content", [])),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we have images or multiple text blocks in the table cell? In the DoclingDocument, we need to make a RichTableCell.

It would be good to have this covered.

@PeterStaar-IBM

Copy link
Copy Markdown
Member

@pablopupo Many thanks for this addition! Can you look into my comments?

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.34043% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/boxnote_backend.py 92.10% 18 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pablopupo for enabling Docling with yet another file format 🚀

I left some comments and I agree with @PeterStaar-IBM on the gap with rich cell content.
I also share some recommendations for the style:

  • Instead of Optional[X] / Union style use the modern X | None style
  • Try to be more specific in method signatures, instead of using bare list / dict type hints (e.g., prefer list[dict] / dict[str, Any]). We have seen very often that these type hints help Mypy identify potential issues.

Comment on lines +94 to +99
origin = DocumentOrigin(
filename=self.file.name or "file.boxnote",
# Box Notes are JSON; DocumentOrigin only accepts registered mime types.
mimetype="application/json",
binary_hash=self.document_hash,
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, DocumentOrigin with application/vnd.box.boxnote would raise a Pydantic validation error since it is not registered in the mimetypes library. However, the fix should rather be to register application/vnd.box.boxnote in DocumentOrigin._extra_mimetypes in docling-core, then use the correct MIME.

Comment thread docling/backend/boxnote_backend.py Outdated
Comment on lines +365 to +369
def _as_url(href: str) -> Optional[Union[AnyUrl, Path]]:
try:
return _HYPERLINK.validate_python(href)
except ValueError:
return None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we shouldn't be more strict here, otherwise any random string becomes a URL:

BoxNoteDocumentBackend._as_url("javascript:alert(1)")
# AnyUrl('javascript:alert(1)')
BoxNoteDocumentBackend._as_url("hello")
# PosixPath('hello')
BoxNoteDocumentBackend._as_url("")
# PosixPath('.')

Comment thread docling/backend/boxnote_backend.py Outdated
Comment on lines +75 to +79
@override
def unload(self):
if isinstance(self.path_or_stream, BytesIO):
self.path_or_stream.close()
self.path_or_stream = None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractDocumentBackend.unload() already does the same thing. The override should be removed.

Comment thread docling/backend/boxnote_backend.py Outdated
Comment on lines +260 to +261
if cell.get("type") != "table_cell":
continue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Box Notes ProseMirror schema uses table_header as the type for header cells. With this if statement, the header cells are completely dropped and thus the table header rows vanish from the output.
Please, accept both types and set column_header appropriately in TableCell.

Box Notes (.boxnote) are JSON. This adds a declarative backend that reads
the current ProseMirror-style "doc" schema, the format the Box Notes app
exports today, and maps it onto a DoclingDocument: headings, paragraphs
with bold/italic/underline/strikethrough and links, bullet, ordered and
check lists, tables, code and images.

Decoding is pure-Python (stdlib json), so no new dependency is added.
Legacy notes (the pre-August-2022 atext/pool format that boxnotes2html
parses) are detected and reported as unsupported rather than mis-parsed,
and can follow in a later change.

Registers InputFormat.BOXNOTE with a .boxnote extension and mime, wired to
the SimplePipeline, with a unit test and a sample fixture.

Resolves docling-project#481

Signed-off-by: pablopupo <145598901+pablopupo@users.noreply.github.com>
@pablopupo

Copy link
Copy Markdown
Contributor Author

I pushed a round of fixes that should cover what you both raised:

  • DocItemLabel.PARAGRAPH is now DocItemLabel.TEXT.
  • For tables, table_header cells are kept and marked as column_header, and any cell with images or multiple blocks now becomes a RichTableCell. A lone cell that carries a link or formatting does too, since otherwise the hyperlink would be dropped (a plain TableCell has no hyperlink field). The fixture exercises all three.
  • I removed the unload() override since it just duplicated the base class.
  • _as_url is much stricter, so only http/https/mailto turn into links and everything else (javascript:, bare strings, relative paths) is dropped rather than guessed at. A malformed href no longer takes the whole conversion down with it either.
  • The annotations use X | None and specific list[dict] / dict[str, Any] hints now.
  • README.md and docs/index.md both list Box Notes.

While I was in there I also tightened up a couple of malformed inputs, so a non-object JSON payload or a non-string href fails cleanly instead of raising.

On the mime type, I opened docling-project/docling-core#668 for it. This PR can't switch to application/vnd.box.boxnote until that's in a docling-core release though, since DocumentOrigin rejects the mime on the current floor (2.84.0), so I've kept application/json here with a comment and will flip it (and bump the floor) once it's released. I can hold this PR for that if you prefer!

pablopupo added a commit to pablopupo/docling-core that referenced this pull request Jun 30, 2026
DocumentOrigin rejects application/vnd.box.boxnote because it isn't in
Python's mimetypes registry, so the Box Note backend in docling
(docling-project/docling#3722) falls back to application/json. This
registers it in _extra_mimetypes so the backend can use the correct mime.

The list is also sorted alphabetically now.

Signed-off-by: pablopupo <145598901+pablopupo@users.noreply.github.com>
pablopupo added a commit to pablopupo/docling-core that referenced this pull request Jun 30, 2026
DocumentOrigin rejects application/vnd.box.boxnote because it isn't in
Python's mimetypes registry, so the Box Note backend in docling
(docling-project/docling#3722) falls back to application/json. This
registers it in _extra_mimetypes so the backend can use the correct mime.

The list is also sorted alphabetically now.

Signed-off-by: pablopupo <145598901+pablopupo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a BoxNote backend

3 participants