fix(en): improve normalization after voxpopuli results#27
Conversation
Align ref/hyp WER gaps for corpus aliases, possessives, article refs, elided percentages, and hundred-scale numbers misheard as trailing zero.
|
Warning Review limit reached
Next review available in: 9 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughAdds several English normalization fixes targeting VoxPopuli parliamentary corpus: possessive token cleanup in ChangesEnglish VoxPopuli Normalization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unit/languages/english_voxpopuli_normalization_test.py (1)
26-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a positive digit-
ofpercentage regression.The stack context calls out a new digit-
ofrewrite, but this table only exercises%, a spelled-out-number positive case, and a digit negative case. A bug in the literal-digit branch would still pass here. Please add something like("15 of latvia population", "15 percent of latvia population").🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/languages/english_voxpopuli_normalization_test.py` around lines 26 - 38, The percent normalization test coverage in test_percent_of_patterns is missing a positive digit-`of` case, so a bug in the literal-digit rewrite could slip through. Extend the parametrized cases in this test to include a digit-based input like “15 of latvia population” with the expected “15 percent of latvia population”, so the pipeline.normalize behavior is verified for the digit-`of` branch as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@normalization/languages/english/operators.py`:
- Line 195: The percent-insertion regex in operators.py is too broad and
rewrites legitimate “<number> of …” phrases, including cases that should remain
as the prior rule/article normalization handled them. Tighten the matching logic
around the existing text rewrite near the percent conversion so it only fires in
the intended spoken-percentage contexts (for example, by constraining it with
surrounding words or possessive/proper-noun patterns), and ensure the
`rule`→`article` replacement and the `re.sub(...)` percent conversion in the
same normalization flow no longer conflict on phrases like “rule 142 of the
agenda” or “5 of the members.”
In `@normalization/languages/english/replacements.py`:
- Around line 1746-1751: The VoxPopuli/parliamentary aliases in
ENGLISH_REPLACEMENTS are too broad and are rewriting valid English tokens like
puttering in all normalization paths. Move these entries out of the shared
replacements map and into a VoxPopuli-specific preset or normalization step,
keeping the general English replacement map free of corpus-specific aliases; use
the replacements module and the ENGLISH_REPLACEMENTS symbol to locate the shared
mapping.
In `@normalization/steps/text/remove_trailing_apostrophe_space.py`:
- Around line 19-20: The current remove_trailing_apostrophe_space text
normalization is over-aggressive because it collapses any 2+ letter "<word> s"
pattern, not just possessives. Update remove_trailing_apostrophe_space.py so the
trailing-s handling happens earlier in the apostrophe-stripping flow, or
restrict it to a narrower corpus-specific transform instead of the shared
text_post substitution; use the existing remove_trailing_apostrophe_space
function as the place to remove the global re.sub on "\b([a-z]{2,}) s\b" and
preserve real tokens like "letter s" and "model s".
---
Nitpick comments:
In `@tests/unit/languages/english_voxpopuli_normalization_test.py`:
- Around line 26-38: The percent normalization test coverage in
test_percent_of_patterns is missing a positive digit-`of` case, so a bug in the
literal-digit rewrite could slip through. Extend the parametrized cases in this
test to include a digit-based input like “15 of latvia population” with the
expected “15 percent of latvia population”, so the pipeline.normalize behavior
is verified for the digit-`of` branch as well.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 47c2e6b0-ea95-4762-b791-a1d1a813bc60
📒 Files selected for processing (7)
docs/steps.mdnormalization/languages/english/number_normalizer.pynormalization/languages/english/operators.pynormalization/languages/english/replacements.pynormalization/steps/text/remove_spaces_between_adjacent_digits.pynormalization/steps/text/remove_trailing_apostrophe_space.pytests/unit/languages/english_voxpopuli_normalization_test.py
Align ref/hyp WER gaps for corpus aliases, possessives, article refs, elided percentages, and hundred-scale numbers misheard as trailing zero.
What does this PR do?
Align ref/hyp WER gaps for corpus aliases, possessives, article refs,
elided percentages, and hundred-scale numbers misheard as trailing zero.
Type of change
Checklist
Only fill in the section(s) that match your change — delete the rest.
Edit existing language
replacements.py, not inline inoperators.pyNone: the step reading it still handlesNonegracefullyHow was this tested?
Summary by CodeRabbit
New Features
Bug Fixes
Documentation