Skip to content

fix(xlsx): preserve headers after section labels#3727

Open
Success6666 wants to merge 1 commit into
docling-project:mainfrom
Success6666:fix-xlsx-section-label-header
Open

fix(xlsx): preserve headers after section labels#3727
Success6666 wants to merge 1 commit into
docling-project:mainfrom
Success6666:fix-xlsx-section-label-header

Conversation

@Success6666

@Success6666 Success6666 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • split a leading merged section label out of an XLSX table when it is directly attached to the real header row
  • keep the real column header row as table headers after the split
  • add regression coverage for the generated DoclingDocument, exported HTML, and section-label split helper boundaries

Closes #3687

Tests

  • uv run --python 3.12 --extra convert-core --extra format-office --extra format-pdf --extra format-latex --extra models-local pytest tests\test_backend_msexcel.py::test_table_with_title tests\test_backend_msexcel.py::test_edge_cases_merging tests\test_backend_msexcel.py::test_gap_tolerance_comparison tests\test_backend_msexcel.py::test_merged_section_label_above_table_preserves_column_headers tests\test_backend_msexcel.py::test_split_leading_section_label_helper -q
  • uv run --python 3.12 --extra convert-core --extra format-office --extra format-pdf --extra format-latex --extra models-local pytest tests\test_backend_msexcel.py::test_table_with_title tests\test_backend_msexcel.py::test_edge_cases_merging tests\test_backend_msexcel.py::test_gap_tolerance_comparison tests\test_backend_msexcel.py::test_merged_section_label_above_table_preserves_column_headers tests\test_backend_msexcel.py::test_split_leading_section_label_helper -q --cov=docling.backend.msexcel_backend --cov-report=term-missing
  • uv run --python 3.12 ruff check docling\backend\msexcel_backend.py tests\test_backend_msexcel.py
  • uv run --python 3.12 ruff format --check docling\backend\msexcel_backend.py tests\test_backend_msexcel.py
  • git diff --check -- docling/backend/msexcel_backend.py tests/test_backend_msexcel.py

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @Success6666, all your commits are properly signed off. 🎉

@mergify

mergify Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

🟢 Merge protection satisfied — ready to merge.

Show 1 satisfied protection

🟢 Enforce conventional commit

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@Success6666 Success6666 force-pushed the fix-xlsx-section-label-header branch from 12cb182 to f2a47be Compare June 30, 2026 19:27
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msexcel_backend.py 90.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Success6666 <Success6666@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 1, 2026 10:16
@Success6666 Success6666 force-pushed the fix-xlsx-section-label-header branch from f2a47be to 68eb3e8 Compare July 1, 2026 10:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Excel backend to split out a leading merged “section label” row that is directly attached to a real table header row, so the true column headers remain headers (both in the DoclingDocument table model and in exported HTML). It also adds regression tests to cover the reported issue scenario.

Changes:

  • Split a leading merged section label out of detected Excel tables and emit it as a separate TextItem.
  • Preserve the real column header row as column_header=True after the split (so it renders as <th> in HTML).
  • Add regression tests covering both the DoclingDocument structure and HTML output.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
docling/backend/msexcel_backend.py Adds _split_leading_section_label() and integrates it into table detection to extract section labels as text and preserve column headers.
tests/test_backend_msexcel.py Adds regression tests verifying the section label is not absorbed into the table header and that HTML uses <th> for real headers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +681 to +686
if (
title_cell.col != 0
or title_cell.row_span != 1
or title_cell.col_span <= 1
or title_cell.col_span >= table.num_cols
):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bee] Excel: section label row adjacent to table is absorbed as table header

2 participants