Skip to content

feat(anonymize): detect labeled fax / telecopy / phone numbers#175

Merged
jan-kubica merged 3 commits into
mainfrom
feat/intl-and-labeled-phone
Jun 4, 2026
Merged

feat(anonymize): detect labeled fax / telecopy / phone numbers#175
jan-kubica merged 3 commits into
mainfrom
feat/intl-and-labeled-phone

Conversation

@jan-kubica
Copy link
Copy Markdown
Contributor

What

Contract notice blocks label phone/fax numbers with cues — Telecopy:, Fax, Facsimile:, Contact phone number: — and write them in forms the existing detectors miss:

  • US numbers without parentheses: Fax 702-657-1411
  • international +CC grouping: +86-138-2319-3216

The current regex tier covers (NNN) NNN-NNNN, tel:-prefixed, and +-anchored fixed groupings, so these labeled/international numbers survived into anonymized output.

How

A language-agnostic phone trigger group in triggers.global.json (mirrored to packages/data/config/), modeled on the existing fr-phone group:

  • cues: fax, telecopy, telecopier, facsimile, telephone, phone number, contact phone number
  • match-pattern strategy with a flexible multi-group number pattern
  • a (?:\d[^\d]*){6,} (≥6 digit) validation

No regex/DFA changes. Generality lives in the cue + shape:

  • the cue is mandatory, so number-shaped ranges in prose (2018-2019, sections 100-200) are not matched;
  • a cue with no following number does not emit;
  • the value extractor strips the cue, so the captured entity is just the number.

Tests

src/__test__/labeled-phone.test.ts — Telecopy/Fax/Facsimile/+CC detection plus the two mandatory-cue invariants. Also verified the 16 existing phone/trigger/snapshot test files (285 tests) still pass.

Follow-up (not in this PR)

A No./# filler between the cue and the number (Facsimile No.: ...) is currently a clean miss: the shared trigger value-extractor strips leading whitespace/colons but not a number-word token. Handling it cleanly needs a small extractValue enhancement (strip an optional No./Number/# filler), which would also benefit Account No.: / Routing No.:. Left for a separate PR to keep this one data-only.

CC on behalf of @jan-kubica

Contract notice blocks label phone and fax numbers with cues ("Telecopy:",
"Fax", "Facsimile:", "Contact phone number:"). US numbers are often written
without parentheses ("Fax 702-657-1411") and international numbers use "+CC"
grouping ("+86-138-2319-3216"). None of these were detected (the existing
regex tier covers `(NNN) NNN-NNNN`, `tel:`-prefixed, and `+`-anchored fixed
groupings), so they survived into anonymized output.

Add a language-agnostic phone trigger group (triggers.global.json, mirrored
to packages/data/config) that pairs a telephony cue with a flexible
multi-group number pattern and a ">=6 digits" validation. Generality lives in
the cue + shape: the cue is mandatory, so number-shaped ranges in prose
("2018-2019", "sections 100-200") are not matched, and a cue with no following
number does not emit.

Not handled here: a "No."/"#" filler between the cue and the number
("Facsimile No.: ..."). The shared trigger value-extractor strips leading
whitespace/colons but not a number-word token, so that form needs a separate
extractValue enhancement and is left for a follow-up.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new trigger configuration, intl-labeled-phone, to detect international labeled phone, fax, and telecopy numbers, along with comprehensive unit tests to verify its behavior. However, the regular expression pattern used for matching these phone numbers contains nested quantifiers and overlapping character classes, which poses a risk of catastrophic backtracking (ReDoS) when evaluated against complex or long strings. It is recommended to simplify the pattern to ensure linear-time matching.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

"label": "phone number",
"strategy": {
"type": "match-pattern",
"pattern": "(?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regular expression pattern (?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4} contains nested quantifiers and overlapping character classes (such as \\d and [\\s.\\-] which are adjacent, and multiple groups of optional separators). When evaluated against long, partially matching strings, this can lead to catastrophic backtracking (ReDoS). Since this pattern is evaluated using a standard NFA-based regex engine in JavaScript, consider simplifying the pattern or reducing the complexity of the optional groups to ensure linear-time matching.

"label": "phone number",
"strategy": {
"type": "match-pattern",
"pattern": "(?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regular expression pattern (?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4} contains nested quantifiers and overlapping character classes (such as \\d and [\\s.\\-] which are adjacent, and multiple groups of optional separators). When evaluated against long, partially matching strings, this can lead to catastrophic backtracking (ReDoS). Since this pattern is evaluated using a standard NFA-based regex engine in JavaScript, consider simplifying the pattern or reducing the complexity of the optional groups to ensure linear-time matching.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e1891e40b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

{
"id": "intl-labeled-phone",
"triggers": [
"fax",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require punctuation for ambiguous fax cues

With fax added as a bare trigger, prose that uses it as a verb followed by a numeric range is now emitted as a phone number; for example, Please fax 2018-2019 tax returns or fax 100-200 pages satisfies the added pattern and the ≥6 digit validation. This can redact ordinary dates or section/page ranges whenever they immediately follow the verb, so the ambiguous cue should require label punctuation/spacing that distinguishes a field label from prose, or the phone shape should be tightened for bare cues.

Useful? React with 👍 / 👎.

Comment on lines +57 to +58
"type": "match-pattern",
"pattern": "(?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop match-pattern phone labels at line breaks

For this new match-pattern phone group, a label with no inline value can still steal the first number on the next line because extractValue strips leading \s before the match-pattern newline guard runs. In a flattened notice like Fax\n2018-2019 annual report, the new group emits 2018-2019 as a phone number even though the cue line is blank; the group needs a same-line guard or a trigger shape that does not allow newline-crossing extraction.

Useful? React with 👍 / 👎.

Address review: require a separator between digit groups so the pattern has
no overlapping optional-separator ambiguity (provably linear, no backtracking
concern). Same matches as before for all labeled forms; also accepts the
parenthesised "(NNN) NNN-NNNN" shape.
@jan-kubica
Copy link
Copy Markdown
Contributor Author

Thanks @gemini-code-assist. I checked this empirically: the original pattern's quantifiers are both bounded ({2,5} outer, {1,4} inner), so it can't backtrack catastrophically — it ran in ~0.14ms against a 2,000-char pathological input, and matching only ever runs anchored (^) on the post-cue, newline-bounded value window.

That said, the optional-separator ambiguity you flagged is a fair readability point, so I tightened it in e590549: a separator between digit groups is now mandatory (\(?\d{1,4}\)?(?:[\s.\-]\d{1,4}){1,4}), which removes the ambiguity entirely (provably linear), measures faster (~0.08ms), and matches the same labeled forms — plus the parenthesised (NNN) NNN-NNNN shape as a bonus. All 6 new tests and the 16 existing phone/trigger/snapshot suites (291 tests) still pass.

CC on behalf of @jan-kubica

@gemini-code-assist
Copy link
Copy Markdown

Thanks for the update, @jan-kubica. The refined regex pattern \(?\d{1,4}\)?(?:[\s.\-]\d{1,4}){1,4} with mandatory separators is a significant improvement for both performance and precision. It effectively eliminates the ambiguity while maintaining the necessary flexibility for the identified phone number formats. Given that the test suite remains green and the logic is now provably linear, this looks solid.

Address review: a real phone number has at least three digit groups, so a
two-group numeric range following a cue — especially a verb-used cue
("Please fax 2018-2019", "fax 100-200 pages") — is no longer matched. This
keeps year/section/page ranges from being redacted as phone numbers, and
also resolves the cross-line case where a blank "Fax" line preceded a range.
All labeled forms (415-796-1165, 702-657-1411, +86-138-2319-3216, etc.) keep
matching; bare numeric ranges do not.
@jan-kubica
Copy link
Copy Markdown
Contributor Author

Thanks @chatgpt-codex-connector — both real catches. Fixed in 027d208 by requiring the phone shape to have at least 3 digit groups (\(?\d{1,4}\)?(?:[\s.\-]\d{1,4}){2,4}):

  1. Verb-used cue + range: Please fax 2018-2019 tax returns and fax 100-200 pages are two-group ranges, so they no longer match. Added both as regression tests. All real labeled numbers (which are 3+ groups: 415-796-1165, 702-657-1411, +86-138-2319-3216) still match.

  2. Cross-line steal: the same ≥3-group shape removes the Fax\n2018-2019 … example (a 2-group range). The residual cross-line case only fires when the next line genuinely starts with a 3+ group phone shape — which, right after a Fax/phone label, is the actual number and correct to redact for anonymization. I deliberately did not add a strict same-line guard, since that would drop legitimate notice-block fax numbers written on the line below their label. The deeper mechanism you identified (the shared extractValue strip consuming the newline before the match-pattern newline guard) is pre-existing and affects every match-pattern phone group (fr/de too); if we want match-pattern to be strictly same-line, that's a good separate hardening with its own test sweep rather than bundling it into this data group.

292 tests pass (the new FP invariants + the 16 existing phone/trigger/snapshot suites).

CC on behalf of @jan-kubica

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create an environment for this repo.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 027d208b9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"telecopy",
"telecopier",
"facsimile",
"telephone",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept 'Telephone number' labels

When a notice uses the common label Telephone number: 702-657-1500, the added bare telephone trigger fires first, but extractValue only strips whitespace/colons/semicolons and the match-pattern strategy is anchored to the start of the remaining value, so it sees number: ... and rejects the phone. Since phone number cannot match inside telephone number due to the left word-boundary check, this leaves a labeled phone form in the same family as the new phone number/contact phone number cues undetected; add telephone number as a trigger or teach the value extractor to skip that filler.

Useful? React with 👍 / 👎.

"telecopier",
"facsimile",
"telephone",
"phone number",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cover short Tel and Phone labels

For common US notice labels like Tel: 702-657-1500 or Phone: 702-657-1500, this new trigger group still does not run because it only includes telephone and phone number; the existing TEL_PREFIX_PHONE regex is not a fallback for these hyphenated 10-digit numbers, since it expects three 3-digit groups unless the digits are compact or country-code shaped. The new test input even contains Tel: 702-657-1500 but only asserts the following fax, so these ordinary labeled phone numbers still leak unless tel/phone variants are added with the same shape.

Useful? React with 👍 / 👎.

@jan-kubica jan-kubica merged commit 3ca1a52 into main Jun 4, 2026
2 checks passed
@jan-kubica jan-kubica deleted the feat/intl-and-labeled-phone branch June 4, 2026 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant