Skip to content

fix(anonymize): stop honorific person spans at sentence boundaries#176

Merged
jan-kubica merged 2 commits into
mainfrom
feat/person-sentence-boundary
Jun 4, 2026
Merged

fix(anonymize): stop honorific person spans at sentence boundaries#176
jan-kubica merged 2 commits into
mainfrom
feat/person-sentence-boundary

Conversation

@jan-kubica
Copy link
Copy Markdown
Contributor

What

A person span seeded by a full-word honorific over-extended across a sentence-ending period:

…appoints Employee as Assistant to the President. The Employee's employment…

was tagged as a single person (President. The Employee). HONORIFIC_PERSON allowed an optional . after any honorific — correct for abbreviations (Mr. Smith), wrong for full words (President).

How

Partition HONORIFICS into an explicit, maintained abbreviation tier (Mr, Mrs, Sr., Dott., Mlle, Messrs, …) and the full-word remainder (President, Lord, Judge, Señor, …). The HONORIFIC_PERSON pattern becomes:

(?:(?:ABBREV_ALT)\.?|(?:FULLWORD_ALT)) SP+ NAME_WORD …

Only the abbreviation tier keeps the optional dot; full words require whitespace before the name, so a sentence-ending period is not consumed and the span stops.

The partition is explicit (HONORIFIC_ABBREVIATION in config/titles.ts), not derived from "ends in a dot" — that would mis-class Mr (abbreviation, no dot in the list) and Lord (full word). DFA-safe: two alternations, sorted longest-first, \b boundaries preserved, no lookaround.

Tests

src/__test__/honorific-sentence-boundary.test.ts: the boundary fix plus the must-keep forms (Mr. John Smith, Sr. Alfonso García, President George Washington, Lord Peter Davidson). Verified the 27 existing person/name/title/snapshot suites (439 tests) still pass.

Follow-up (not in this PR)

A dotted abbreviation at a genuine sentence end followed by a capitalized sentence-opener (Mr. Then the parties agreed) is locally undecidable by shape and would over-extend; it needs a post-match boundary check (reusing isSentenceStart) and is left for a separate PR.

CC on behalf of @jan-kubica

A person span seeded by a full-word honorific over-extended across a
sentence-ending period: "...Assistant to the President. The Employee..."
was tagged as one person ("President. The Employee"). The HONORIFIC_PERSON
regex allowed an optional "." after any honorific, which is correct for
abbreviations ("Mr." Smith) but wrong for full words ("President").

Partition HONORIFICS into an explicit, maintained abbreviation tier
(Mr, Mrs, Sr., Dott., ...) and the full-word remainder (President, Lord,
Judge, ...). Only the abbreviation tier keeps the optional dot; full words
require whitespace before the name, so a trailing period is not consumed
and the span stops at the sentence boundary.

The partition is explicit (not derived from "ends in a dot", which would
mis-class "Mr" and "Lord"). All abbreviation and full-word + name forms
keep matching (Mr. Smith, Sr. García, Lord Davidson, President Washington,
Messrs. Smith). DFA-safe: two alternations, sorted longest-first, no
lookaround.

Not handled here: a dotted abbreviation at a true sentence end followed by
a capitalized sentence-opener ("Mr. Then the parties agreed") — locally
undecidable by shape, needs a post-match boundary check; left as a follow-up.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to prevent full-word honorifics (such as "President" or "Lord") from consuming a trailing period into a person's name span, thereby respecting sentence boundaries. It achieves this by defining a set of abbreviation honorifics and splitting the regex pattern into abbreviation-based honorifics (which allow an optional trailing dot) and full-word honorifics (which do not). The review feedback suggests a robust improvement to strip trailing dots from the abbreviation definitions when building the regex pattern, ensuring that both dotted and dotless forms (e.g., "Sr." and "Sr") are matched consistently, along with adding corresponding test cases.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +114 to +132
const buildHonorificAlt = (entries: readonly string[]): string =>
[...entries]
.toSorted((a, b) => b.length - a.length)
.map((h) => {
const escaped = escapeRegex(h);
return HONORIFIC_BOUNDARY.has(h) ? `\\b${escaped}` : escaped;
})
.join("|");

// Abbreviation honorifics ("Mr", "Sr.") may be followed by an
// abbreviation dot; full-word honorifics ("President", "Lord")
// may not, so a sentence-ending period after them is not consumed
// and the person span stops at the sentence boundary.
const HONORIFIC_ABBREV_ALT = buildHonorificAlt(
HONORIFICS.filter((h) => HONORIFIC_ABBREVIATION.has(h)),
);
const HONORIFIC_FULLWORD_ALT = buildHonorificAlt(
HONORIFICS.filter((h) => !HONORIFIC_ABBREVIATION.has(h)),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Currently, abbreviation honorifics that are defined with a trailing dot in HONORIFICS (such as Sr., Sra., Avv., Dott., Sig.) will only match when written with a dot in the input text. However, in real-world usage, these abbreviations are very frequently written without a dot (e.g., Sr Alfonso García).

By stripping the trailing dot from abbreviation honorifics when building HONORIFIC_ABBREV_ALT and allowing the regex pattern to match them with an optional dot (\.?), we can robustly support both dotted and dotless forms of all abbreviation honorifics. We also update buildHonorificAlt to correctly apply word boundaries (\b) to the dotless forms if their dotted counterparts are in HONORIFIC_BOUNDARY.

const buildHonorificAlt = (entries: readonly string[]): string =>
  [...entries]
    .toSorted((a, b) => b.length - a.length)
    .map((h) => {
      const escaped = escapeRegex(h);
      const hasBoundary = HONORIFIC_BOUNDARY.has(h) || HONORIFIC_BOUNDARY.has(h + ".");
      return hasBoundary ? `\b${escaped}` : escaped;
    })
    .join("|");

// Abbreviation honorifics ("Mr", "Sr.") may be followed by an
// abbreviation dot; full-word honorifics ("President", "Lord")
// may not, so a sentence-ending period after them is not consumed
// and the person span stops at the sentence boundary.
// Trailing dots are stripped from abbreviations to allow matching
// both dotted and dotless forms (e.g., "Sr." and "Sr") consistently.
const HONORIFIC_ABBREV_ALT = buildHonorificAlt(
  [...new Set(
    HONORIFICS
      .filter((h) => HONORIFIC_ABBREVIATION.has(h))
      .map((h) => h.endsWith(".") ? h.slice(0, -1) : h)
  )]
);
const HONORIFIC_FULLWORD_ALT = buildHonorificAlt(
  HONORIFICS.filter((h) => !HONORIFIC_ABBREVIATION.has(h)),
);

Comment on lines +60 to +70
test("an abbreviation honorific keeps its dot before the name", async () => {
expect(persons(await run("We met Mr. John Smith yesterday."))).toContain(
"Mr. John Smith",
);
});

test("a Spanish abbreviation honorific (Sr.) keeps its dot", async () => {
expect(persons(await run("Sr. Alfonso García signed the deed."))).toContain(
"Sr. Alfonso García",
);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add test cases to verify that abbreviation honorifics match correctly both with and without a trailing dot (e.g., Mr./Mr and Sr./Sr).

  test("an abbreviation honorific keeps its dot before the name", async () => {
    expect(persons(await run("We met Mr. John Smith yesterday."))).toContain(
      "Mr. John Smith",
    );
  });

  test("an abbreviation honorific without a dot still matches", async () => {
    expect(persons(await run("We met Mr John Smith yesterday."))).toContain(
      "Mr John Smith",
    );
  });

  test("a Spanish abbreviation honorific (Sr.) keeps its dot", async () => {
    expect(persons(await run("Sr. Alfonso García signed the deed."))).toContain(
      "Sr. Alfonso García",
    );
  });

  test("a Spanish abbreviation honorific without a dot still matches", async () => {
    expect(persons(await run("Sr Alfonso García signed the deed."))).toContain(
      "Sr Alfonso García",
    );
  });

@jan-kubica
Copy link
Copy Markdown
Contributor Author

Thanks @gemini-code-assist.

Dotted-only abbreviations (Sr., Avv., Dott., Sig.): good observation, but this is pre-existing and orthogonal to this PR — those entries have carried the literal dot in HONORIFICS all along; this change only partitions the separator (\.? vs none), it doesn't touch which entries include a dot. Two reasons I'm not folding a fix in here:

  • It isn't uniformly safe to normalize them to dotless + \.?: the single-letter M. (French Monsieur) can't drop its dot without over-matching any M <Name> (which is exactly why it's in HONORIFIC_BOUNDARY). So it needs per-entry care, not a blanket change.
  • Even today, when the honorific is skipped (Sr Alfonso García), the name is still detected by the corpus detector (Alfonso García) — so no PII leaks; it's a span-completeness nuance, not a miss.

I've filed it mentally as a separate recall change rather than expanding this boundary-focused PR.

Tests for both forms: added in c2b6ccbMr John Smith (dotless British style) is now covered alongside Mr. John Smith. Both work because the Mr entry is dotless and the \.? makes the dot optional; the dotted-only entries are the separate follow-up above.

CC on behalf of @jan-kubica

@gemini-code-assist
Copy link
Copy Markdown

Thanks for the clarification, @jan-kubica. That makes perfect sense. Decoupling the boundary logic from the dot-presence in the honorific list is a robust approach, and I agree that handling dotted abbreviations like M. requires more granular, per-entry logic to avoid false positives. Keeping the current scope focused on the boundary fix is the right call. The added tests in c2b6ccb correctly address the ambiguity between British (dotless) and standard (dotted) abbreviations. I have no further concerns.

@jan-kubica jan-kubica merged commit f5f9f03 into main Jun 4, 2026
2 checks passed
@jan-kubica jan-kubica deleted the feat/person-sentence-boundary branch June 4, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant