fix(xlsx): preserve headers after section labels#3727
Conversation
|
✅ DCO Check Passed Thanks @Success6666, all your commits are properly signed off. 🎉 |
Merge Protections🟢 Merge protection satisfied — ready to merge. Show 1 satisfied protection🟢 Enforce conventional commitMake sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
12cb182 to
f2a47be
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: Success6666 <Success6666@users.noreply.github.com>
f2a47be to
68eb3e8
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the Excel backend to split out a leading merged “section label” row that is directly attached to a real table header row, so the true column headers remain headers (both in the DoclingDocument table model and in exported HTML). It also adds regression tests to cover the reported issue scenario.
Changes:
- Split a leading merged section label out of detected Excel tables and emit it as a separate
TextItem. - Preserve the real column header row as
column_header=Trueafter the split (so it renders as<th>in HTML). - Add regression tests covering both the
DoclingDocumentstructure and HTML output.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
docling/backend/msexcel_backend.py |
Adds _split_leading_section_label() and integrates it into table detection to extract section labels as text and preserve column headers. |
tests/test_backend_msexcel.py |
Adds regression tests verifying the section label is not absorbed into the table header and that HTML uses <th> for real headers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ( | ||
| title_cell.col != 0 | ||
| or title_cell.row_span != 1 | ||
| or title_cell.col_span <= 1 | ||
| or title_cell.col_span >= table.num_cols | ||
| ): |
Summary
Closes #3687
Tests
uv run --python 3.12 --extra convert-core --extra format-office --extra format-pdf --extra format-latex --extra models-local pytest tests\test_backend_msexcel.py::test_table_with_title tests\test_backend_msexcel.py::test_edge_cases_merging tests\test_backend_msexcel.py::test_gap_tolerance_comparison tests\test_backend_msexcel.py::test_merged_section_label_above_table_preserves_column_headers tests\test_backend_msexcel.py::test_split_leading_section_label_helper -quv run --python 3.12 --extra convert-core --extra format-office --extra format-pdf --extra format-latex --extra models-local pytest tests\test_backend_msexcel.py::test_table_with_title tests\test_backend_msexcel.py::test_edge_cases_merging tests\test_backend_msexcel.py::test_gap_tolerance_comparison tests\test_backend_msexcel.py::test_merged_section_label_above_table_preserves_column_headers tests\test_backend_msexcel.py::test_split_leading_section_label_helper -q --cov=docling.backend.msexcel_backend --cov-report=term-missinguv run --python 3.12 ruff check docling\backend\msexcel_backend.py tests\test_backend_msexcel.pyuv run --python 3.12 ruff format --check docling\backend\msexcel_backend.py tests\test_backend_msexcel.pygit diff --check -- docling/backend/msexcel_backend.py tests/test_backend_msexcel.py