fix(anonymize): stop honorific person spans at sentence boundaries#176
Conversation
A person span seeded by a full-word honorific over-extended across a
sentence-ending period: "...Assistant to the President. The Employee..."
was tagged as one person ("President. The Employee"). The HONORIFIC_PERSON
regex allowed an optional "." after any honorific, which is correct for
abbreviations ("Mr." Smith) but wrong for full words ("President").
Partition HONORIFICS into an explicit, maintained abbreviation tier
(Mr, Mrs, Sr., Dott., ...) and the full-word remainder (President, Lord,
Judge, ...). Only the abbreviation tier keeps the optional dot; full words
require whitespace before the name, so a trailing period is not consumed
and the span stops at the sentence boundary.
The partition is explicit (not derived from "ends in a dot", which would
mis-class "Mr" and "Lord"). All abbreviation and full-word + name forms
keep matching (Mr. Smith, Sr. García, Lord Davidson, President Washington,
Messrs. Smith). DFA-safe: two alternations, sorted longest-first, no
lookaround.
Not handled here: a dotted abbreviation at a true sentence end followed by
a capitalized sentence-opener ("Mr. Then the parties agreed") — locally
undecidable by shape, needs a post-match boundary check; left as a follow-up.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to prevent full-word honorifics (such as "President" or "Lord") from consuming a trailing period into a person's name span, thereby respecting sentence boundaries. It achieves this by defining a set of abbreviation honorifics and splitting the regex pattern into abbreviation-based honorifics (which allow an optional trailing dot) and full-word honorifics (which do not). The review feedback suggests a robust improvement to strip trailing dots from the abbreviation definitions when building the regex pattern, ensuring that both dotted and dotless forms (e.g., "Sr." and "Sr") are matched consistently, along with adding corresponding test cases.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const buildHonorificAlt = (entries: readonly string[]): string => | ||
| [...entries] | ||
| .toSorted((a, b) => b.length - a.length) | ||
| .map((h) => { | ||
| const escaped = escapeRegex(h); | ||
| return HONORIFIC_BOUNDARY.has(h) ? `\\b${escaped}` : escaped; | ||
| }) | ||
| .join("|"); | ||
|
|
||
| // Abbreviation honorifics ("Mr", "Sr.") may be followed by an | ||
| // abbreviation dot; full-word honorifics ("President", "Lord") | ||
| // may not, so a sentence-ending period after them is not consumed | ||
| // and the person span stops at the sentence boundary. | ||
| const HONORIFIC_ABBREV_ALT = buildHonorificAlt( | ||
| HONORIFICS.filter((h) => HONORIFIC_ABBREVIATION.has(h)), | ||
| ); | ||
| const HONORIFIC_FULLWORD_ALT = buildHonorificAlt( | ||
| HONORIFICS.filter((h) => !HONORIFIC_ABBREVIATION.has(h)), | ||
| ); |
There was a problem hiding this comment.
Currently, abbreviation honorifics that are defined with a trailing dot in HONORIFICS (such as Sr., Sra., Avv., Dott., Sig.) will only match when written with a dot in the input text. However, in real-world usage, these abbreviations are very frequently written without a dot (e.g., Sr Alfonso García).
By stripping the trailing dot from abbreviation honorifics when building HONORIFIC_ABBREV_ALT and allowing the regex pattern to match them with an optional dot (\.?), we can robustly support both dotted and dotless forms of all abbreviation honorifics. We also update buildHonorificAlt to correctly apply word boundaries (\b) to the dotless forms if their dotted counterparts are in HONORIFIC_BOUNDARY.
const buildHonorificAlt = (entries: readonly string[]): string =>
[...entries]
.toSorted((a, b) => b.length - a.length)
.map((h) => {
const escaped = escapeRegex(h);
const hasBoundary = HONORIFIC_BOUNDARY.has(h) || HONORIFIC_BOUNDARY.has(h + ".");
return hasBoundary ? `\b${escaped}` : escaped;
})
.join("|");
// Abbreviation honorifics ("Mr", "Sr.") may be followed by an
// abbreviation dot; full-word honorifics ("President", "Lord")
// may not, so a sentence-ending period after them is not consumed
// and the person span stops at the sentence boundary.
// Trailing dots are stripped from abbreviations to allow matching
// both dotted and dotless forms (e.g., "Sr." and "Sr") consistently.
const HONORIFIC_ABBREV_ALT = buildHonorificAlt(
[...new Set(
HONORIFICS
.filter((h) => HONORIFIC_ABBREVIATION.has(h))
.map((h) => h.endsWith(".") ? h.slice(0, -1) : h)
)]
);
const HONORIFIC_FULLWORD_ALT = buildHonorificAlt(
HONORIFICS.filter((h) => !HONORIFIC_ABBREVIATION.has(h)),
);| test("an abbreviation honorific keeps its dot before the name", async () => { | ||
| expect(persons(await run("We met Mr. John Smith yesterday."))).toContain( | ||
| "Mr. John Smith", | ||
| ); | ||
| }); | ||
|
|
||
| test("a Spanish abbreviation honorific (Sr.) keeps its dot", async () => { | ||
| expect(persons(await run("Sr. Alfonso García signed the deed."))).toContain( | ||
| "Sr. Alfonso García", | ||
| ); | ||
| }); |
There was a problem hiding this comment.
Add test cases to verify that abbreviation honorifics match correctly both with and without a trailing dot (e.g., Mr./Mr and Sr./Sr).
test("an abbreviation honorific keeps its dot before the name", async () => {
expect(persons(await run("We met Mr. John Smith yesterday."))).toContain(
"Mr. John Smith",
);
});
test("an abbreviation honorific without a dot still matches", async () => {
expect(persons(await run("We met Mr John Smith yesterday."))).toContain(
"Mr John Smith",
);
});
test("a Spanish abbreviation honorific (Sr.) keeps its dot", async () => {
expect(persons(await run("Sr. Alfonso García signed the deed."))).toContain(
"Sr. Alfonso García",
);
});
test("a Spanish abbreviation honorific without a dot still matches", async () => {
expect(persons(await run("Sr Alfonso García signed the deed."))).toContain(
"Sr Alfonso García",
);
});|
Thanks @gemini-code-assist. Dotted-only abbreviations (
I've filed it mentally as a separate recall change rather than expanding this boundary-focused PR. Tests for both forms: added in c2b6ccb — CC on behalf of @jan-kubica |
|
Thanks for the clarification, @jan-kubica. That makes perfect sense. Decoupling the boundary logic from the dot-presence in the honorific list is a robust approach, and I agree that handling dotted abbreviations like |
What
A person span seeded by a full-word honorific over-extended across a sentence-ending period:
was tagged as a single
person(President. The Employee).HONORIFIC_PERSONallowed an optional.after any honorific — correct for abbreviations (Mr.Smith), wrong for full words (President).How
Partition
HONORIFICSinto an explicit, maintained abbreviation tier (Mr,Mrs,Sr.,Dott.,Mlle,Messrs, …) and the full-word remainder (President,Lord,Judge,Señor, …). TheHONORIFIC_PERSONpattern becomes:Only the abbreviation tier keeps the optional dot; full words require whitespace before the name, so a sentence-ending period is not consumed and the span stops.
The partition is explicit (
HONORIFIC_ABBREVIATIONinconfig/titles.ts), not derived from "ends in a dot" — that would mis-classMr(abbreviation, no dot in the list) andLord(full word). DFA-safe: two alternations, sorted longest-first,\bboundaries preserved, no lookaround.Tests
src/__test__/honorific-sentence-boundary.test.ts: the boundary fix plus the must-keep forms (Mr. John Smith,Sr. Alfonso García,President George Washington,Lord Peter Davidson). Verified the 27 existing person/name/title/snapshot suites (439 tests) still pass.Follow-up (not in this PR)
A dotted abbreviation at a genuine sentence end followed by a capitalized sentence-opener (
Mr. Then the parties agreed) is locally undecidable by shape and would over-extend; it needs a post-match boundary check (reusingisSentenceStart) and is left for a separate PR.CC on behalf of @jan-kubica