Summary
Entities detected across different modalities (text NER + image OCR on the same page) currently don't fuse because Location::Text and Location::Image never overlap.
Proposed approach
Add a Location::overlaps_cross_modality() that maps image and text locations to a common coordinate space (page number + normalized position). When enabled via GroupingCriteria::Widening, entities with the same (kind, value) on the same page would be fusion candidates regardless of modality.
Blocked by
- Requires a shared page-coordinate model across text and image locations
- Need test data with multi-modal detections on the same document
Related
GroupingCriteria::Widening already ignores location overlap. This issue is specifically about making the overlap check work across modalities rather than bypassing it entirely.
Summary
Entities detected across different modalities (text NER + image OCR on the same page) currently don't fuse because
Location::TextandLocation::Imagenever overlap.Proposed approach
Add a
Location::overlaps_cross_modality()that maps image and text locations to a common coordinate space (page number + normalized position). When enabled viaGroupingCriteria::Widening, entities with the same(kind, value)on the same page would be fusion candidates regardless of modality.Blocked by
Related
GroupingCriteria::Wideningalready ignores location overlap. This issue is specifically about making the overlap check work across modalities rather than bypassing it entirely.