fix(anonymize): stop honorific person spans at sentence boundaries by jan-kubica · Pull Request #176 · stella/anonymize

jan-kubica · 2026-06-04T11:16:29Z

What

A person span seeded by a full-word honorific over-extended across a sentence-ending period:

…appoints Employee as Assistant to the President. The Employee's employment…

was tagged as a single person (President. The Employee). HONORIFIC_PERSON allowed an optional . after any honorific — correct for abbreviations (Mr. Smith), wrong for full words (President).

How

Partition HONORIFICS into an explicit, maintained abbreviation tier (Mr, Mrs, Sr., Dott., Mlle, Messrs, …) and the full-word remainder (President, Lord, Judge, Señor, …). The HONORIFIC_PERSON pattern becomes:

(?:(?:ABBREV_ALT)\.?|(?:FULLWORD_ALT)) SP+ NAME_WORD …

Only the abbreviation tier keeps the optional dot; full words require whitespace before the name, so a sentence-ending period is not consumed and the span stops.

The partition is explicit (HONORIFIC_ABBREVIATION in config/titles.ts), not derived from "ends in a dot" — that would mis-class Mr (abbreviation, no dot in the list) and Lord (full word). DFA-safe: two alternations, sorted longest-first, \b boundaries preserved, no lookaround.

Tests

src/__test__/honorific-sentence-boundary.test.ts: the boundary fix plus the must-keep forms (Mr. John Smith, Sr. Alfonso García, President George Washington, Lord Peter Davidson). Verified the 27 existing person/name/title/snapshot suites (439 tests) still pass.

Follow-up (not in this PR)

A dotted abbreviation at a genuine sentence end followed by a capitalized sentence-opener (Mr. Then the parties agreed) is locally undecidable by shape and would over-extend; it needs a post-match boundary check (reusing isSentenceStart) and is left for a separate PR.

CC on behalf of @jan-kubica

A person span seeded by a full-word honorific over-extended across a sentence-ending period: "...Assistant to the President. The Employee..." was tagged as one person ("President. The Employee"). The HONORIFIC_PERSON regex allowed an optional "." after any honorific, which is correct for abbreviations ("Mr." Smith) but wrong for full words ("President"). Partition HONORIFICS into an explicit, maintained abbreviation tier (Mr, Mrs, Sr., Dott., ...) and the full-word remainder (President, Lord, Judge, ...). Only the abbreviation tier keeps the optional dot; full words require whitespace before the name, so a trailing period is not consumed and the span stops at the sentence boundary. The partition is explicit (not derived from "ends in a dot", which would mis-class "Mr" and "Lord"). All abbreviation and full-word + name forms keep matching (Mr. Smith, Sr. García, Lord Davidson, President Washington, Messrs. Smith). DFA-safe: two alternations, sorted longest-first, no lookaround. Not handled here: a dotted abbreviation at a true sentence end followed by a capitalized sentence-opener ("Mr. Then the parties agreed") — locally undecidable by shape, needs a post-match boundary check; left as a follow-up.

gemini-code-assist

Code Review

This pull request introduces a mechanism to prevent full-word honorifics (such as "President" or "Lord") from consuming a trailing period into a person's name span, thereby respecting sentence boundaries. It achieves this by defining a set of abbreviation honorifics and splitting the regex pattern into abbreviation-based honorifics (which allow an optional trailing dot) and full-word honorifics (which do not). The review feedback suggests a robust improvement to strip trailing dots from the abbreviation definitions when building the regex pattern, ensuring that both dotted and dotless forms (e.g., "Sr." and "Sr") are matched consistently, along with adding corresponding test cases.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T11:18:18Z

+const buildHonorificAlt = (entries: readonly string[]): string =>
+  [...entries]
+    .toSorted((a, b) => b.length - a.length)
+    .map((h) => {
+      const escaped = escapeRegex(h);
+      return HONORIFIC_BOUNDARY.has(h) ? `\\b${escaped}` : escaped;
+    })
+    .join("|");
+
+// Abbreviation honorifics ("Mr", "Sr.") may be followed by an
+// abbreviation dot; full-word honorifics ("President", "Lord")
+// may not, so a sentence-ending period after them is not consumed
+// and the person span stops at the sentence boundary.
+const HONORIFIC_ABBREV_ALT = buildHonorificAlt(
+  HONORIFICS.filter((h) => HONORIFIC_ABBREVIATION.has(h)),
+);
+const HONORIFIC_FULLWORD_ALT = buildHonorificAlt(
+  HONORIFICS.filter((h) => !HONORIFIC_ABBREVIATION.has(h)),
+);


Currently, abbreviation honorifics that are defined with a trailing dot in HONORIFICS (such as Sr., Sra., Avv., Dott., Sig.) will only match when written with a dot in the input text. However, in real-world usage, these abbreviations are very frequently written without a dot (e.g., Sr Alfonso García).

By stripping the trailing dot from abbreviation honorifics when building HONORIFIC_ABBREV_ALT and allowing the regex pattern to match them with an optional dot (\.?), we can robustly support both dotted and dotless forms of all abbreviation honorifics. We also update buildHonorificAlt to correctly apply word boundaries (\b) to the dotless forms if their dotted counterparts are in HONORIFIC_BOUNDARY.

const buildHonorificAlt = (entries: readonly string[]): string => [...entries] .toSorted((a, b) => b.length - a.length) .map((h) => { const escaped = escapeRegex(h); const hasBoundary = HONORIFIC_BOUNDARY.has(h) || HONORIFIC_BOUNDARY.has(h + "."); return hasBoundary ? `\b${escaped}` : escaped; }) .join("|"); // Abbreviation honorifics ("Mr", "Sr.") may be followed by an // abbreviation dot; full-word honorifics ("President", "Lord") // may not, so a sentence-ending period after them is not consumed // and the person span stops at the sentence boundary. // Trailing dots are stripped from abbreviations to allow matching // both dotted and dotless forms (e.g., "Sr." and "Sr") consistently. const HONORIFIC_ABBREV_ALT = buildHonorificAlt( [...new Set( HONORIFICS .filter((h) => HONORIFIC_ABBREVIATION.has(h)) .map((h) => h.endsWith(".") ? h.slice(0, -1) : h) )] ); const HONORIFIC_FULLWORD_ALT = buildHonorificAlt( HONORIFICS.filter((h) => !HONORIFIC_ABBREVIATION.has(h)), );

gemini-code-assist · 2026-06-04T11:18:18Z

+  test("an abbreviation honorific keeps its dot before the name", async () => {
+    expect(persons(await run("We met Mr. John Smith yesterday."))).toContain(
+      "Mr. John Smith",
+    );
+  });
+
+  test("a Spanish abbreviation honorific (Sr.) keeps its dot", async () => {
+    expect(persons(await run("Sr. Alfonso García signed the deed."))).toContain(
+      "Sr. Alfonso García",
+    );
+  });


Add test cases to verify that abbreviation honorifics match correctly both with and without a trailing dot (e.g., Mr./Mr and Sr./Sr).

test("an abbreviation honorific keeps its dot before the name", async () => { expect(persons(await run("We met Mr. John Smith yesterday."))).toContain( "Mr. John Smith", ); }); test("an abbreviation honorific without a dot still matches", async () => { expect(persons(await run("We met Mr John Smith yesterday."))).toContain( "Mr John Smith", ); }); test("a Spanish abbreviation honorific (Sr.) keeps its dot", async () => { expect(persons(await run("Sr. Alfonso García signed the deed."))).toContain( "Sr. Alfonso García", ); }); test("a Spanish abbreviation honorific without a dot still matches", async () => { expect(persons(await run("Sr Alfonso García signed the deed."))).toContain( "Sr Alfonso García", ); });

jan-kubica · 2026-06-04T11:22:15Z

Thanks @gemini-code-assist.

Dotted-only abbreviations (Sr., Avv., Dott., Sig.): good observation, but this is pre-existing and orthogonal to this PR — those entries have carried the literal dot in HONORIFICS all along; this change only partitions the separator (\.? vs none), it doesn't touch which entries include a dot. Two reasons I'm not folding a fix in here:

It isn't uniformly safe to normalize them to dotless + \.?: the single-letter M. (French Monsieur) can't drop its dot without over-matching any M <Name> (which is exactly why it's in HONORIFIC_BOUNDARY). So it needs per-entry care, not a blanket change.
Even today, when the honorific is skipped (Sr Alfonso García), the name is still detected by the corpus detector (Alfonso García) — so no PII leaks; it's a span-completeness nuance, not a miss.

I've filed it mentally as a separate recall change rather than expanding this boundary-focused PR.

Tests for both forms: added in c2b6ccb — Mr John Smith (dotless British style) is now covered alongside Mr. John Smith. Both work because the Mr entry is dotless and the \.? makes the dot optional; the dotted-only entries are the separate follow-up above.

CC on behalf of @jan-kubica

gemini-code-assist · 2026-06-04T11:22:40Z

Thanks for the clarification, @jan-kubica. That makes perfect sense. Decoupling the boundary logic from the dot-presence in the honorific list is a robust approach, and I agree that handling dotted abbreviations like M. requires more granular, per-entry logic to avoid false positives. Keeping the current scope focused on the boundary fix is the right call. The added tests in c2b6ccb correctly address the ambiguity between British (dotless) and standard (dotted) abbreviations. I have no further concerns.

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

test(anonymize): cover dotless abbreviation honorific (Mr John Smith)

c2b6ccb

jan-kubica merged commit f5f9f03 into main Jun 4, 2026
2 checks passed

jan-kubica deleted the feat/person-sentence-boundary branch June 4, 2026 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(anonymize): stop honorific person spans at sentence boundaries#176

fix(anonymize): stop honorific person spans at sentence boundaries#176
jan-kubica merged 2 commits into
mainfrom
feat/person-sentence-boundary

jan-kubica commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

jan-kubica commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jan-kubica commented Jun 4, 2026

What

How

Tests

Follow-up (not in this PR)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant