feat(boxnote): add a BoxNote document backend#3722
Conversation
|
✅ DCO Check Passed Thanks @pablopupo, all your commits are properly signed off. 🎉 |
Merge Protections🔴 1 of 2 protections blocking · waiting on 👀 reviews
🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
PeterStaar-IBM
left a comment
There was a problem hiding this comment.
Can you also update the README.me and docs/index.md?
| if len(runs) == 1: | ||
| text, formatting, hyperlink = runs[0] | ||
| doc.add_text( | ||
| label=DocItemLabel.PARAGRAPH, |
There was a problem hiding this comment.
DocItemLabel.PARAGRAPH is deprecated, please use DocItemLabel.TEXT
| col_span = attrs.get("colspan") or 1 | ||
| cells.append( | ||
| TableCell( | ||
| text=self._cell_text(cell.get("content", [])), |
There was a problem hiding this comment.
What happens if we have images or multiple text blocks in the table cell? In the DoclingDocument, we need to make a RichTableCell.
It would be good to have this covered.
|
@pablopupo Many thanks for this addition! Can you look into my comments? |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @pablopupo for enabling Docling with yet another file format 🚀
I left some comments and I agree with @PeterStaar-IBM on the gap with rich cell content.
I also share some recommendations for the style:
- Instead of
Optional[X]/Unionstyle use the modernX | Nonestyle - Try to be more specific in method signatures, instead of using bare
list/dicttype hints (e.g., preferlist[dict]/dict[str, Any]). We have seen very often that these type hints help Mypy identify potential issues.
| origin = DocumentOrigin( | ||
| filename=self.file.name or "file.boxnote", | ||
| # Box Notes are JSON; DocumentOrigin only accepts registered mime types. | ||
| mimetype="application/json", | ||
| binary_hash=self.document_hash, | ||
| ) |
There was a problem hiding this comment.
Indeed, DocumentOrigin with application/vnd.box.boxnote would raise a Pydantic validation error since it is not registered in the mimetypes library. However, the fix should rather be to register application/vnd.box.boxnote in DocumentOrigin._extra_mimetypes in docling-core, then use the correct MIME.
| def _as_url(href: str) -> Optional[Union[AnyUrl, Path]]: | ||
| try: | ||
| return _HYPERLINK.validate_python(href) | ||
| except ValueError: | ||
| return None |
There was a problem hiding this comment.
Just wondering if we shouldn't be more strict here, otherwise any random string becomes a URL:
BoxNoteDocumentBackend._as_url("javascript:alert(1)")
# AnyUrl('javascript:alert(1)')
BoxNoteDocumentBackend._as_url("hello")
# PosixPath('hello')
BoxNoteDocumentBackend._as_url("")
# PosixPath('.')| @override | ||
| def unload(self): | ||
| if isinstance(self.path_or_stream, BytesIO): | ||
| self.path_or_stream.close() | ||
| self.path_or_stream = None |
There was a problem hiding this comment.
AbstractDocumentBackend.unload() already does the same thing. The override should be removed.
| if cell.get("type") != "table_cell": | ||
| continue |
There was a problem hiding this comment.
The Box Notes ProseMirror schema uses table_header as the type for header cells. With this if statement, the header cells are completely dropped and thus the table header rows vanish from the output.
Please, accept both types and set column_header appropriately in TableCell.
Box Notes (.boxnote) are JSON. This adds a declarative backend that reads the current ProseMirror-style "doc" schema, the format the Box Notes app exports today, and maps it onto a DoclingDocument: headings, paragraphs with bold/italic/underline/strikethrough and links, bullet, ordered and check lists, tables, code and images. Decoding is pure-Python (stdlib json), so no new dependency is added. Legacy notes (the pre-August-2022 atext/pool format that boxnotes2html parses) are detected and reported as unsupported rather than mis-parsed, and can follow in a later change. Registers InputFormat.BOXNOTE with a .boxnote extension and mime, wired to the SimplePipeline, with a unit test and a sample fixture. Resolves docling-project#481 Signed-off-by: pablopupo <145598901+pablopupo@users.noreply.github.com>
0a8dc37 to
eed5c2b
Compare
|
I pushed a round of fixes that should cover what you both raised:
While I was in there I also tightened up a couple of malformed inputs, so a non-object JSON payload or a non-string href fails cleanly instead of raising. On the mime type, I opened docling-project/docling-core#668 for it. This PR can't switch to |
DocumentOrigin rejects application/vnd.box.boxnote because it isn't in Python's mimetypes registry, so the Box Note backend in docling (docling-project/docling#3722) falls back to application/json. This registers it in _extra_mimetypes so the backend can use the correct mime. The list is also sorted alphabetically now. Signed-off-by: pablopupo <145598901+pablopupo@users.noreply.github.com>
DocumentOrigin rejects application/vnd.box.boxnote because it isn't in Python's mimetypes registry, so the Box Note backend in docling (docling-project/docling#3722) falls back to application/json. This registers it in _extra_mimetypes so the backend can use the correct mime. The list is also sorted alphabetically now. Signed-off-by: pablopupo <145598901+pablopupo@users.noreply.github.com>
Description
Adds a backend for Box Notes (
.boxnote), modeled on the existingcsvandwebvttbackends..boxnotefiles are JSON, but Box has two incompatible schemas. The olderatext/poolmodel (what boxnotes2html parses) predates a 2022 change; the current one is a ProseMirror-styledoctree, which is what the app exports now. This handles the current format and maps its nodes onto aDoclingDocument: headings, paragraphs (with bold/italic/underline/strikethrough and links), lists, tables, code and images.The decode is pure Python, so there's no new dependency. A legacy
atext/poolnote is detected and rejected with a clearDocumentLoadErrorrather than mis-parsed. I left legacy out to keep this focused; it's an easy follow-up if you'd want it here.Changes
docling/backend/boxnote_backend.py: newBoxNoteDocumentBackend.docling/datamodel/base_models.py: registerInputFormat.BOXNOTE, the.boxnoteextension, and a mime type.docling/datamodel/document.py: resolve.boxnoteto its mime in format detection.docling/document_converter.py:BoxNoteFormatOptionon theSimplePipeline, plus its default option.docs/usage/supported_formats.md: list BoxNote.Tests
tests/test_backend_boxnote.pycovers a sample.boxnoteend to end (Markdown, indented-text and JSON ground truth), format detection, the legacy-unsupported path, and empty input.Resolves #481
Checklist: