Hi TN date class accuracy improvement#418
Conversation
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
79f8c53 to
017a615
Compare
for more information, see https://pre-commit.ci
…ion of more test cases Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
for more information, see https://pre-commit.ci
…mm-yyyy, dd-m-yyyy and mm-yyyy Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
for more information, see https://pre-commit.ci
|
|
||
| # Create union of suffixes and prefixes | ||
| suffix_union = pynini.union(*suffixes_list) | ||
| prefix_union = pynini.union(*prefixes_list) |
There was a problem hiding this comment.
can l36 through here be replaced by string file?
There was a problem hiding this comment.
Thanks for the suggestion! I explored pynini.string_file() for both. It works for suffix_union if we add identity columns to suffixes.tsv, since suffix comes after graph_year and doesn't affect year graph path selection. However, for prefix_union, even with identity columns, string_file() caused graph_year_thousands to incorrectly win over graph_year_hundreds_as_thousands for years like 1999, 1920, 1971, because prefix_union comes before graph_year in the concatenation, and the identity transducer from string_file() alters the weight landscape at that point.
I considered adjusting weights to compensate, but since graph_year_thousands and graph_year_hundreds_as_thousands are designed to be mutually exclusive, it was unclear why string_file() was breaking that exclusivity, making weight tuning risky. To keep both consistent and avoid that risk, I've now used pynini.string_map() for both, as it eliminates the intermediate list variables and the verbose open() + pynini.union() block while correctly handling single-column entries and preserving the expected FST behavior.
| ०४-०३~चार मार्च | ||
| 25-03-2020~पच्चीस मार्च दो हज़ार बीस | ||
| ३०-०५-२०७०~तीस मई दो हज़ार सत्तर | ||
| 12-07-1970~बारह जुलाई उन्नीस सौ सत्तर |
There was a problem hiding this comment.
This test case has been restored.
| 12-07-1970~बारह जुलाई उन्नीस सौ सत्तर | ||
| ०९-१२-२१०१~नौ दिसंबर इक्कीस सौ एक | ||
| 23-08-2024~तेईस अगस्त दो हज़ार चौबीस | ||
| १०-२९-२०००~अक्टूबर उनतीस दो हज़ार |
There was a problem hiding this comment.
१०-२९-२०००, 11-14-1100 (MM-DD-YYYY cases):
These were removed following your earlier feedback on MM-DD: "if we can't normalize all, does it really make sense to cover for some?" The same applies to MM-DD-YYYY: since we can only handle cases where the day is unambiguously > 12, the coverage is partial. To stay consistent with that decision, we removed MM-DD-YYYY support as well.
०३-२०१०, 11-2024 (MM-YYYY cases):
These were removed because MM-YYYY is not a standard date format and is highly ambiguous; it cannot be reliably distinguished from DD-YYYY. For example, 03-2010 could equally be interpreted as day 3 of the year 2010 and also as a range from three to two thousand ten. Since we cannot confidently resolve this ambiguity, keeping these cases would risk incorrect normalizations.
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
for more information, see https://pre-commit.ci
What does this PR do ?
Improved Date class accuracy from ~87 % to ~99 % by introducing additional graph coverage for the cases failing earlier.
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.