feat(anonymize): detect labeled fax / telecopy / phone numbers#175
Conversation
Contract notice blocks label phone and fax numbers with cues ("Telecopy:",
"Fax", "Facsimile:", "Contact phone number:"). US numbers are often written
without parentheses ("Fax 702-657-1411") and international numbers use "+CC"
grouping ("+86-138-2319-3216"). None of these were detected (the existing
regex tier covers `(NNN) NNN-NNNN`, `tel:`-prefixed, and `+`-anchored fixed
groupings), so they survived into anonymized output.
Add a language-agnostic phone trigger group (triggers.global.json, mirrored
to packages/data/config) that pairs a telephony cue with a flexible
multi-group number pattern and a ">=6 digits" validation. Generality lives in
the cue + shape: the cue is mandatory, so number-shaped ranges in prose
("2018-2019", "sections 100-200") are not matched, and a cue with no following
number does not emit.
Not handled here: a "No."/"#" filler between the cue and the number
("Facsimile No.: ..."). The shared trigger value-extractor strips leading
whitespace/colons but not a number-word token, so that form needs a separate
extractValue enhancement and is left for a follow-up.
There was a problem hiding this comment.
Code Review
This pull request introduces a new trigger configuration, intl-labeled-phone, to detect international labeled phone, fax, and telecopy numbers, along with comprehensive unit tests to verify its behavior. However, the regular expression pattern used for matching these phone numbers contains nested quantifiers and overlapping character classes, which poses a risk of catastrophic backtracking (ReDoS) when evaluated against complex or long strings. It is recommended to simplify the pattern to ensure linear-time matching.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| "label": "phone number", | ||
| "strategy": { | ||
| "type": "match-pattern", | ||
| "pattern": "(?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4}" |
There was a problem hiding this comment.
The regular expression pattern (?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4} contains nested quantifiers and overlapping character classes (such as \\d and [\\s.\\-] which are adjacent, and multiple groups of optional separators). When evaluated against long, partially matching strings, this can lead to catastrophic backtracking (ReDoS). Since this pattern is evaluated using a standard NFA-based regex engine in JavaScript, consider simplifying the pattern or reducing the complexity of the optional groups to ensure linear-time matching.
| "label": "phone number", | ||
| "strategy": { | ||
| "type": "match-pattern", | ||
| "pattern": "(?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4}" |
There was a problem hiding this comment.
The regular expression pattern (?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4} contains nested quantifiers and overlapping character classes (such as \\d and [\\s.\\-] which are adjacent, and multiple groups of optional separators). When evaluated against long, partially matching strings, this can lead to catastrophic backtracking (ReDoS). Since this pattern is evaluated using a standard NFA-based regex engine in JavaScript, consider simplifying the pattern or reducing the complexity of the optional groups to ensure linear-time matching.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e1891e40b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| "id": "intl-labeled-phone", | ||
| "triggers": [ | ||
| "fax", |
There was a problem hiding this comment.
Require punctuation for ambiguous fax cues
With fax added as a bare trigger, prose that uses it as a verb followed by a numeric range is now emitted as a phone number; for example, Please fax 2018-2019 tax returns or fax 100-200 pages satisfies the added pattern and the ≥6 digit validation. This can redact ordinary dates or section/page ranges whenever they immediately follow the verb, so the ambiguous cue should require label punctuation/spacing that distinguishes a field label from prose, or the phone shape should be tightened for bare cues.
Useful? React with 👍 / 👎.
| "type": "match-pattern", | ||
| "pattern": "(?:\\+?\\d{1,3}[\\s.\\-]?)?(?:\\(?\\d{1,4}\\)?[\\s.\\-]?){2,5}\\d{2,4}" |
There was a problem hiding this comment.
Stop match-pattern phone labels at line breaks
For this new match-pattern phone group, a label with no inline value can still steal the first number on the next line because extractValue strips leading \s before the match-pattern newline guard runs. In a flattened notice like Fax\n2018-2019 annual report, the new group emits 2018-2019 as a phone number even though the cue line is blank; the group needs a same-line guard or a trigger shape that does not allow newline-crossing extraction.
Useful? React with 👍 / 👎.
Address review: require a separator between digit groups so the pattern has no overlapping optional-separator ambiguity (provably linear, no backtracking concern). Same matches as before for all labeled forms; also accepts the parenthesised "(NNN) NNN-NNNN" shape.
|
Thanks @gemini-code-assist. I checked this empirically: the original pattern's quantifiers are both bounded ( That said, the optional-separator ambiguity you flagged is a fair readability point, so I tightened it in e590549: a separator between digit groups is now mandatory ( CC on behalf of @jan-kubica |
|
Thanks for the update, @jan-kubica. The refined regex pattern |
Address review: a real phone number has at least three digit groups, so a
two-group numeric range following a cue — especially a verb-used cue
("Please fax 2018-2019", "fax 100-200 pages") — is no longer matched. This
keeps year/section/page ranges from being redacted as phone numbers, and
also resolves the cross-line case where a blank "Fax" line preceded a range.
All labeled forms (415-796-1165, 702-657-1411, +86-138-2319-3216, etc.) keep
matching; bare numeric ranges do not.
|
Thanks @chatgpt-codex-connector — both real catches. Fixed in 027d208 by requiring the phone shape to have at least 3 digit groups (
292 tests pass (the new FP invariants + the 16 existing phone/trigger/snapshot suites). CC on behalf of @jan-kubica |
|
To use Codex here, create an environment for this repo. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 027d208b9c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "telecopy", | ||
| "telecopier", | ||
| "facsimile", | ||
| "telephone", |
There was a problem hiding this comment.
Accept 'Telephone number' labels
When a notice uses the common label Telephone number: 702-657-1500, the added bare telephone trigger fires first, but extractValue only strips whitespace/colons/semicolons and the match-pattern strategy is anchored to the start of the remaining value, so it sees number: ... and rejects the phone. Since phone number cannot match inside telephone number due to the left word-boundary check, this leaves a labeled phone form in the same family as the new phone number/contact phone number cues undetected; add telephone number as a trigger or teach the value extractor to skip that filler.
Useful? React with 👍 / 👎.
| "telecopier", | ||
| "facsimile", | ||
| "telephone", | ||
| "phone number", |
There was a problem hiding this comment.
Cover short Tel and Phone labels
For common US notice labels like Tel: 702-657-1500 or Phone: 702-657-1500, this new trigger group still does not run because it only includes telephone and phone number; the existing TEL_PREFIX_PHONE regex is not a fallback for these hyphenated 10-digit numbers, since it expects three 3-digit groups unless the digits are compact or country-code shaped. The new test input even contains Tel: 702-657-1500 but only asserts the following fax, so these ordinary labeled phone numbers still leak unless tel/phone variants are added with the same shape.
Useful? React with 👍 / 👎.
What
Contract notice blocks label phone/fax numbers with cues —
Telecopy:,Fax,Facsimile:,Contact phone number:— and write them in forms the existing detectors miss:Fax 702-657-1411+CCgrouping:+86-138-2319-3216The current regex tier covers
(NNN) NNN-NNNN,tel:-prefixed, and+-anchored fixed groupings, so these labeled/international numbers survived into anonymized output.How
A language-agnostic phone trigger group in
triggers.global.json(mirrored topackages/data/config/), modeled on the existingfr-phonegroup:fax,telecopy,telecopier,facsimile,telephone,phone number,contact phone numbermatch-patternstrategy with a flexible multi-group number pattern(?:\d[^\d]*){6,}(≥6 digit) validationNo regex/DFA changes. Generality lives in the cue + shape:
2018-2019,sections 100-200) are not matched;Tests
src/__test__/labeled-phone.test.ts— Telecopy/Fax/Facsimile/+CCdetection plus the two mandatory-cue invariants. Also verified the 16 existing phone/trigger/snapshot test files (285 tests) still pass.Follow-up (not in this PR)
A
No./#filler between the cue and the number (Facsimile No.: ...) is currently a clean miss: the shared trigger value-extractor strips leading whitespace/colons but not a number-word token. Handling it cleanly needs a smallextractValueenhancement (strip an optionalNo./Number/#filler), which would also benefitAccount No.:/Routing No.:. Left for a separate PR to keep this one data-only.CC on behalf of @jan-kubica