feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#2
Conversation
Add alternative forms, English content-POS gap coverage, and aggressive
quality filtering to the Wikidata Lexemes extension. Top-1000 EN coverage
goes from 84% (omw-en alone) to 99.2% with the extension.
Major changes:
- Emit `<Form>` elements from each lexeme's `forms[]`, so lookups for
inflected forms like `mine`/`should`/`could` resolve to their base
lemma's entry. Filter out forms with negation grammaticalFeature
`Q1478451` (`shouldn't`, `shan't`) and apostrophe-leading contractions
(`'ll`, `'d`).
- English content-POS gap escape: a SKIP_POS-classified lemma whose
(lemma, POS) isn't in omw-en under any morphological lookup is now
included. Pulls in ~7k missing function/content words.
- Modal-verb override (`P31 = Q560570`): always keep `can`/`will`/`shall`
so their forms `could`/`would`/`should` surface. Bypasses the archaic-
Wiktionary filter for these.
- Wiktionary REST fallback for 0-sense lexemes, with quality filters:
reference-only definitions, onomatopoeia / dialectal / archaic
category filters, capitalised-lemma filter, empty-claims filter.
- Filter dedup + multi-word-capital + empty-claims moved into
`filter_lexemes` so `kept_lang_senses` matches what we actually write
— otherwise sense-relation targets dangle when the merge step loads
the XML.
Module refactor:
- New `_wikidata.py`: cached `fetch_wikidata_entity`/`get_label`/
`get_language_iso` + `safe_filename` + `cached_json_fetch`.
- New `_omw_en.py`: cached `omw_en_pos()` returning `{lemma: frozenset
of POSes}`.
- New `_wiktionary.py`: REST + Action API fetchers with per-thread
`requests.Session`, batched/paginated category fetches, HTML entity
decoding, archaic-or-sound filter scoped to English.
- `_pos_map.py`: derived `CONTENT_POS_MAP` (label → WN POS code) for
the gap-escape check.
- `create_extensions.py`: `xml.sax.saxutils.escape` replaces hand-rolled
XML escaping; `NormalizedSense` namedtuple replaces dict-shape
smuggling for Wiktionary fallback senses; in-script caches under
`extras/{wikidata,wiktionary,wiktionary-cats}/`.
EN extension: ~7,900 entries, 130 language files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds helper modules for cached Wikidata/Wiktionary access and POS/content filtering, refactors the extension generator to use them, documents caching and LANG_FILTER, and regenerates many per-language XML files with new Forms and some removals. ChangesLexeme Generation Pipeline and Data Refresh
Sequence Diagram(s)sequenceDiagram
participant CLI as Generator CLI (create_extensions.py)
participant WD as Wikidata API
participant WKT as Wiktionary REST/API
participant Cache as Disk Cache
participant FS as XML Output
CLI->>Cache: try read entity/defs/categories
alt miss
CLI->>WD: GET EntityData (Q-code)
WD-->>CLI: entity JSON
CLI->>WKT: GET definitions/categories (en)
WKT-->>CLI: cleaned def + examples
CLI->>Cache: persist JSON
end
CLI-->>CLI: normalize senses, forms, ILI
CLI->>FS: write per-language XML entries/synsets
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
| return True | ||
| if not any(any(f in c for f in _OLD_FRAGMENTS) for c in relevant): | ||
| return False | ||
| return lemma.lower() not in omw_en_pos() |
There was a problem hiding this comment.
Category substring false positives
Medium Severity
_is_archaic_en treats any English Wiktionary category whose title contains fragments like archaic or dialectal as a whole-word tag. That also matches categories such as English terms with archaic senses, which the same module documents as unsafe for hard filtering. Modern lemmas with only a secondary archaic sense can be dropped from the zero-sense Wiktionary fallback and weaken the content-POS gap fill.
Reviewed by Cursor Bugbot for commit fd417f2. Configure here.
- Refactor `filter_lexemes`, `_fetch_one_batch`, `wiktionary_definition` below ruff's complexity ceiling by extracting helper functions (`_skip_pos_passes`, `_try_keep`, `_http_get_json`, `_parse_batch_response`, `_pick_definition`). - Wrap long XML-emit and dict-comprehension lines. - Mark `_LEADING_APOS` literal with `# noqa: RUF001` (the ambiguous Unicode chars are intentional — Wikidata uses them for clitic forms). - Underscore-prefix the unused `lemma`/`pos_code` from `build_xml_entry`'s tuple now that dedup moved to `filter_lexemes`. Also add `CLAUDE.md` requiring `hatch fmt --linter --check`, `hatch run mypy:check`, `hatch build`, and `hatch test` to all pass before opening or pushing to a PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (2)
extensions/wikidata-lexemes/_omw_en.py (1)
8-12: ⚡ Quick winNarrow the exception handling to specific exception types.
Catching
Exceptionbroadly can hide programming errors and makes debugging harder. Consider catching specific exceptions that are expected whenomw-enis unavailable.♻️ Proposed refinement
def omw_en_pos() -> dict[str, frozenset[str]]: """Return {lemma_lower: frozenset of WN POSes}. Empty if omw-en unavailable.""" try: import wn en = wn.Wordnet(lexicon="omw-en") - except Exception: + except (ImportError, LookupError, OSError): return {}This catches:
ImportErrorwhenwnis not installedLookupErrorwhen the lexicon is not availableOSErrorfor file system issuesAs per static analysis, catching blind
Exceptionshould be avoided.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/wikidata-lexemes/_omw_en.py` around lines 8 - 12, The current broad except in the try block that imports wn and constructs en = wn.Wordnet(lexicon="omw-en") should be narrowed: catch ImportError for missing the wn package, LookupError for the lexicon not found, and OSError for file/IO issues instead of catching Exception; update the except clause to handle these specific exceptions and return {} for those cases so unexpected errors still propagate.extensions/wikidata-lexemes/_wiktionary.py (1)
83-116: ⚖️ Poor tradeoffConsider adding file-write locking for cache writes.
The
@cachedecorator memoizes in-memory, but if multiple threads call this function with the same(lemma, lang_iso)tuple before the first call completes, both may attempt to write to the same cache file (lines 112-115). While JSON writes are typically atomic for small files, explicit locking would guarantee correctness.🔒 Optional: Add file locking
Consider using
fcntl.flock()(Unix) ormsvcrt.locking()(Windows) around the cache write, or a threading.Lock keyed by path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/wikidata-lexemes/_wiktionary.py` around lines 83 - 116, The cache write in fetch_wiktionary (which uses _def_cache_path to determine path) can race if multiple threads/processes call the same (lemma, lang_iso) concurrently; add an explicit lock around the block that creates the parent directory and writes the JSON (the section that uses path.parent.mkdir and open(path, "w")/json.dump) to prevent concurrent writes — either a process-safe file lock (fcntl/msvcrt) or a per-path threading.Lock keyed by str(path) is acceptable; ensure the lock is acquired before writing and released after the file is flushed/closed so the function still returns data or None as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@extensions/wikidata-lexemes/_wikidata.py`:
- Around line 22-31: The cache loader currently crashes on corrupted JSON and
leaves partial files from interrupted writes; modify the logic around
json.load/open and the post-fetch write to (1) catch json.JSONDecodeError
alongside FileNotFoundError and treat either as a cache miss so you call
fetch(), and (2) write the fetched data atomically by writing to a temporary
file in the same directory (use tempfile.NamedTemporaryFile or
path.with_suffix(".tmp")), flush and os.fsync the file, close it, then
os.replace(temp_path, path) to atomically replace the cache; ensure
path.parent.mkdir(...) runs before creating the temp file and clean up the temp
file on errors. Reference the existing identifiers: path, fetch(), json.load,
json.dump.
- Around line 45-47: The code accesses entities = data["entities"] and does
return entities.get(q_code) or next(iter(entities.values())), but if entities is
empty next(iter(...)) raises StopIteration; fix by checking for an empty
entities dict before using the fallback: after setting entities, if not entities
raise a clear exception or return None (or a sentinel) with a helpful message,
otherwise return entities.get(q_code) or next(iter(entities.values())); update
callers if you change the return contract.
In `@extensions/wikidata-lexemes/create_extensions.py`:
- Line 506: The tuple returned by build_xml_entry is being unpacked into entry,
synsets, lemma, pos_code but lemma and pos_code are never used; update the
unpacking at that call site to only capture the used values (e.g., assign entry,
synsets = result) or replace the unused names with throwaway underscores (entry,
synsets, _, _) so the code no longer declares unused variables; locate the
unpacking statement where result is assigned and change it accordingly.
- Line 306: The _LEADING_APOS string literal contains ambiguous apostrophe-like
glyphs; replace the literal characters in the _LEADING_APOS constant with
explicit Unicode escapes (ASCII apostrophe -> \u0027, curly right -> \u2019,
modifier-letter -> \u02BC, curly left -> \u2018) so the value becomes the
equivalent escaped sequence and avoids RUF001 lint warnings while preserving
behavior.
In `@extensions/wikidata-lexemes/output/es.xml`:
- Around line 1152-1153: The Spanish output contains an incorrect English form
because the exporter is emitting representations directly from
lexeme["forms"][].representations[lang_iso].value without validating language or
source; update the export logic that writes <Form writtenForm="..."/> to (1)
ensure it only uses representations where lang_iso == "es" (or the target export
language) and the representation's language metadata matches, (2) ignore or
log/skip values that clearly mismatch the lemma's language (e.g., ASCII/English
tokens like "location") or come from a flagged local dump, and (3) ideally
fallback to live Wikidata EntityData for LexicalEntry id "L1550831" when the
local representation is suspicious; locate the writer that produces Form entries
and change the emission to validate the representation's language tag and
sanitize/skip bad values before writing.
In `@extensions/wikidata-lexemes/output/et.xml`:
- Around line 357-363: The review flags that plural/polite paradigm forms (e.g.,
Form writtenForm="teie", "teid", "meie", "nemad") are being attached to singular
lemma entries like sina, mina, tema; update the form-to-lemma attachment logic
(the code path that maps paradigms to lemma entries—look for functions named
mapFormsToLemma, attachParadigmToLemma or similar) to verify person/number (and
politeness) compatibility before attaching: if a Form has number="plural" or a
politeness attribute, only attach it to a lemma whose lemmaEntry (e.g., 'sina',
'mina', 'tema') has matching person/number/politeness metadata, otherwise create
or attach to the appropriate plural/polite lemma entry (e.g., a separate lemma
for 'teie'/'meie' or mark the paradigm as plural). Ensure the XML generation
preserves correct number/person attributes for Forms so downstream consumers get
the correct singular vs. plural mappings.
In `@extensions/wikidata-lexemes/output/ig.xml`:
- Around line 327-329: Remove the non-lexical placeholder Form element that uses
writtenForm="Verb" from the lexeme entry with Sense id "L747365-S1": locate the
Lemma with writtenForm="gbā ākā" and the sibling Form element that contains the
POS label, and delete that Form node (or replace it with a valid Igbo surface
form if one exists) so only actual lexical surface forms remain in the entry.
- Around line 551-553: Entry L1019088 contains an English gloss incorrectly
emitted as a morphological Form ("Form writtenForm=\"Empty handed\""); remove
that Form element from the Lemma/L1019088 record and instead place the gloss
into the Sense record (Sense id="L1019088-S1") using the appropriate
gloss/definition field for English (e.g., a <Definition> or gloss attribute on
the Sense) so the gloss is not treated as an Igbo Form.
- Around line 635-637: The Form element with writtenForm="Ig" under the Lemma
"nwanne m nwokè" / Sense id "L708447-S1" is an abbreviation artifact and should
be removed; locate the Form node that has writtenForm="Ig" (associated with
Sense id "L708447-S1" and Lemma "nwanne m nwokè") and delete that Form entry so
only valid lexical variants remain.
In `@extensions/wikidata-lexemes/output/nl.xml`:
- Line 169: The Form element currently uses the lexical entry ID ("L1222827") as
the writtenForm value (<Form writtenForm="L1222827"/>); locate the generator
that emits Form elements for lexeme L1222827 (the routine that builds
Form/writtenForm entries) and replace the ID with the actual Dutch inflected
word string (e.g., "elven" or the correct form from the lexeme's
forms/representations), ensuring the writtenForm is drawn from the lexeme's
forms[] or representations[] field rather than the lexeme ID.
In `@extensions/wikidata-lexemes/output/pa.xml`:
- Line 14: Remove any <Form> elements whose writtenForm attribute is empty or
contains only invisible whitespace (e.g., U+200E/U+200F); specifically locate
the <Form> tags with writtenForm values that are whitespace-only and delete
those elements (or replace the attribute value with the actual lexical form if a
real form was intended). Ensure you target the <Form> element and its
writtenForm attribute when making the change so no surrounding XML structure is
broken.
- Line 968: The Form element writtenForm contains invisible directional marks
(zero-width characters) — locate the <Form writtenForm="ਚ"/> occurrence and
remove any U+200E/U+200F or other zero-width characters from the attribute value
so it becomes <Form writtenForm="ਚ"/>; also scan other Form writtenForm
attributes in the same file for similar invisible characters and strip them to
avoid rendering and matching issues.
In `@extensions/wikidata-lexemes/output/ug.xml`:
- Around line 36-40: The Form elements' writtenForm attributes in the UG lexeme
output contain trailing invisible bidi/control characters causing lookup
failures; update the serializer/generator that produces these Form writtenForm
values (or sanitize the ug.xml content) to strip Unicode control and bidi marks
(e.g., U+200E, U+200F, U+202A–U+202E, U+FEFF, ZWJ/ZWSP as appropriate) from
string values before writing them: locate where Form writtenForm attributes are
set/serialized and apply a trim/filter that removes these invisible characters
so the writtenForm attributes contain only the visible text (e.g., for the Form
element/writtenForm assignment logic).
In `@extensions/wikidata-lexemes/output/yi.xml`:
- Around line 230-231: Normalize and deduplicate Form writtenForm values by
stripping Unicode bidi control characters (e.g., U+200E, U+200F, U+202A–U+202E)
before emitting the <Form writtenForm="..."> attributes; when creating or
collecting forms (the code path that produces <Form writtenForm="..."> entries),
perform a normalization step that removes these invisible directionality marks
and then collapse duplicates so only one <Form> with the normalized writtenForm
value is emitted.
---
Nitpick comments:
In `@extensions/wikidata-lexemes/_omw_en.py`:
- Around line 8-12: The current broad except in the try block that imports wn
and constructs en = wn.Wordnet(lexicon="omw-en") should be narrowed: catch
ImportError for missing the wn package, LookupError for the lexicon not found,
and OSError for file/IO issues instead of catching Exception; update the except
clause to handle these specific exceptions and return {} for those cases so
unexpected errors still propagate.
In `@extensions/wikidata-lexemes/_wiktionary.py`:
- Around line 83-116: The cache write in fetch_wiktionary (which uses
_def_cache_path to determine path) can race if multiple threads/processes call
the same (lemma, lang_iso) concurrently; add an explicit lock around the block
that creates the parent directory and writes the JSON (the section that uses
path.parent.mkdir and open(path, "w")/json.dump) to prevent concurrent writes —
either a process-safe file lock (fcntl/msvcrt) or a per-path threading.Lock
keyed by str(path) is acceptable; ensure the lock is acquired before writing and
released after the file is flushed/closed so the function still returns data or
None as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0f68ac9-03e3-4da8-aaf3-e7fcd171bde3
📒 Files selected for processing (82)
extensions/wikidata-lexemes/README.mdextensions/wikidata-lexemes/_omw_en.pyextensions/wikidata-lexemes/_pos_map.pyextensions/wikidata-lexemes/_wikidata.pyextensions/wikidata-lexemes/_wiktionary.pyextensions/wikidata-lexemes/create_extensions.pyextensions/wikidata-lexemes/output/ae.xmlextensions/wikidata-lexemes/output/af.xmlextensions/wikidata-lexemes/output/ar.xmlextensions/wikidata-lexemes/output/az.xmlextensions/wikidata-lexemes/output/bn.xmlextensions/wikidata-lexemes/output/br.xmlextensions/wikidata-lexemes/output/ca.xmlextensions/wikidata-lexemes/output/cs.xmlextensions/wikidata-lexemes/output/cy.xmlextensions/wikidata-lexemes/output/da.xmlextensions/wikidata-lexemes/output/de.xmlextensions/wikidata-lexemes/output/dz.xmlextensions/wikidata-lexemes/output/el.xmlextensions/wikidata-lexemes/output/en.xmlextensions/wikidata-lexemes/output/eo.xmlextensions/wikidata-lexemes/output/es.xmlextensions/wikidata-lexemes/output/et.xmlextensions/wikidata-lexemes/output/eu.xmlextensions/wikidata-lexemes/output/fa.xmlextensions/wikidata-lexemes/output/ff.xmlextensions/wikidata-lexemes/output/fi.xmlextensions/wikidata-lexemes/output/fo.xmlextensions/wikidata-lexemes/output/fr.xmlextensions/wikidata-lexemes/output/ga.xmlextensions/wikidata-lexemes/output/gu.xmlextensions/wikidata-lexemes/output/ha.xmlextensions/wikidata-lexemes/output/he.xmlextensions/wikidata-lexemes/output/hr.xmlextensions/wikidata-lexemes/output/id.xmlextensions/wikidata-lexemes/output/ig.xmlextensions/wikidata-lexemes/output/is.xmlextensions/wikidata-lexemes/output/it.xmlextensions/wikidata-lexemes/output/ja.xmlextensions/wikidata-lexemes/output/ka.xmlextensions/wikidata-lexemes/output/kl.xmlextensions/wikidata-lexemes/output/ko.xmlextensions/wikidata-lexemes/output/ks.xmlextensions/wikidata-lexemes/output/la.xmlextensions/wikidata-lexemes/output/lb.xmlextensions/wikidata-lexemes/output/lv.xmlextensions/wikidata-lexemes/output/mi.xmlextensions/wikidata-lexemes/output/ml.xmlextensions/wikidata-lexemes/output/mn.xmlextensions/wikidata-lexemes/output/ms.xmlextensions/wikidata-lexemes/output/mt.xmlextensions/wikidata-lexemes/output/nb.xmlextensions/wikidata-lexemes/output/nl.xmlextensions/wikidata-lexemes/output/nn.xmlextensions/wikidata-lexemes/output/oc.xmlextensions/wikidata-lexemes/output/oj.xmlextensions/wikidata-lexemes/output/pa.xmlextensions/wikidata-lexemes/output/pl.xmlextensions/wikidata-lexemes/output/ps.xmlextensions/wikidata-lexemes/output/pt.xmlextensions/wikidata-lexemes/output/rn.xmlextensions/wikidata-lexemes/output/ro.xmlextensions/wikidata-lexemes/output/ru.xmlextensions/wikidata-lexemes/output/sa.xmlextensions/wikidata-lexemes/output/sd.xmlextensions/wikidata-lexemes/output/sk.xmlextensions/wikidata-lexemes/output/sq.xmlextensions/wikidata-lexemes/output/sr.xmlextensions/wikidata-lexemes/output/sv.xmlextensions/wikidata-lexemes/output/sw.xmlextensions/wikidata-lexemes/output/ta.xmlextensions/wikidata-lexemes/output/tl.xmlextensions/wikidata-lexemes/output/tr.xmlextensions/wikidata-lexemes/output/tw.xmlextensions/wikidata-lexemes/output/ug.xmlextensions/wikidata-lexemes/output/uk.xmlextensions/wikidata-lexemes/output/uz.xmlextensions/wikidata-lexemes/output/vi.xmlextensions/wikidata-lexemes/output/wo.xmlextensions/wikidata-lexemes/output/yi.xmlextensions/wikidata-lexemes/output/za.xmlextensions/wikidata-lexemes/output/zu.xml
💤 Files with no reviewable changes (2)
- extensions/wikidata-lexemes/output/za.xml
- extensions/wikidata-lexemes/output/rn.xml
| try: | ||
| with open(path, encoding="utf-8") as f: | ||
| return json.load(f) | ||
| except FileNotFoundError: | ||
| pass | ||
| data = fetch() | ||
| path.parent.mkdir(parents=True, exist_ok=True) | ||
| with open(path, "w", encoding="utf-8") as f: | ||
| json.dump(data, f, indent=2) | ||
| return data |
There was a problem hiding this comment.
Make cache writes atomic and recover from corrupted cache JSON.
An interrupted write can leave a partial file; next run then crashes on JSON parsing and won’t self-heal.
Proposed fix
def cached_json_fetch(path: Path, fetch: Callable[[], dict]) -> dict:
try:
- with open(path, encoding="utf-8") as f:
+ with open(path, encoding="utf-8") as f:
return json.load(f)
- except FileNotFoundError:
+ except (FileNotFoundError, json.JSONDecodeError):
pass
data = fetch()
path.parent.mkdir(parents=True, exist_ok=True)
- with open(path, "w", encoding="utf-8") as f:
+ tmp_path = path.with_suffix(path.suffix + ".tmp")
+ with open(tmp_path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2)
+ tmp_path.replace(path)
return data🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/_wikidata.py` around lines 22 - 31, The cache
loader currently crashes on corrupted JSON and leaves partial files from
interrupted writes; modify the logic around json.load/open and the post-fetch
write to (1) catch json.JSONDecodeError alongside FileNotFoundError and treat
either as a cache miss so you call fetch(), and (2) write the fetched data
atomically by writing to a temporary file in the same directory (use
tempfile.NamedTemporaryFile or path.with_suffix(".tmp")), flush and os.fsync the
file, close it, then os.replace(temp_path, path) to atomically replace the
cache; ensure path.parent.mkdir(...) runs before creating the temp file and
clean up the temp file on errors. Reference the existing identifiers: path,
fetch(), json.load, json.dump.
| entities = data["entities"] | ||
| return entities.get(q_code) or next(iter(entities.values())) | ||
|
|
There was a problem hiding this comment.
Guard against empty entities payloads before fallback selection.
If entities is empty, next(iter(...)) raises StopIteration and obscures the root cause.
Proposed fix
data = cached_json_fetch(EXTRAS_DIR / f"{q_code}.json", _fetch)
entities = data["entities"]
- return entities.get(q_code) or next(iter(entities.values()))
+ if not entities:
+ raise ValueError(f"No entities returned for {q_code}")
+ return entities.get(q_code) or next(iter(entities.values()))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| entities = data["entities"] | |
| return entities.get(q_code) or next(iter(entities.values())) | |
| entities = data["entities"] | |
| if not entities: | |
| raise ValueError(f"No entities returned for {q_code}") | |
| return entities.get(q_code) or next(iter(entities.values())) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/_wikidata.py` around lines 45 - 47, The code
accesses entities = data["entities"] and does return entities.get(q_code) or
next(iter(entities.values())), but if entities is empty next(iter(...)) raises
StopIteration; fix by checking for an empty entities dict before using the
fallback: after setting entities, if not entities raise a clear exception or
return None (or a sentinel) with a helpful message, otherwise return
entities.get(q_code) or next(iter(entities.values())); update callers if you
change the return contract.
| return relations | ||
|
|
||
|
|
||
| _LEADING_APOS = "'’ʼ‘" # ASCII, curly right, modifier-letter, curly left |
There was a problem hiding this comment.
Use explicit Unicode escapes for _LEADING_APOS literals.
Line 306 contains ambiguous apostrophe-like characters (RUF001). Switching to escapes keeps behavior and avoids ambiguous-source lint warnings.
Proposed fix
-_LEADING_APOS = "'’ʼ‘" # ASCII, curly right, modifier-letter, curly left
+_LEADING_APOS = "'\u2019\u02BC\u2018" # ASCII, curly right, modifier-letter, curly left📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _LEADING_APOS = "'’ʼ‘" # ASCII, curly right, modifier-letter, curly left | |
| _LEADING_APOS = "'\u2019\u02BC\u2018" # ASCII, curly right, modifier-letter, curly left |
🧰 Tools
🪛 Ruff (0.15.13)
[warning] 306-306: String contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF001)
[warning] 306-306: String contains ambiguous ʼ (MODIFIER LETTER APOSTROPHE). Did you mean ``` (GRAVE ACCENT)?
(RUF001)
[warning] 306-306: String contains ambiguous ‘ (LEFT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/create_extensions.py` at line 306, The
_LEADING_APOS string literal contains ambiguous apostrophe-like glyphs; replace
the literal characters in the _LEADING_APOS constant with explicit Unicode
escapes (ASCII apostrophe -> \u0027, curly right -> \u2019, modifier-letter ->
\u02BC, curly left -> \u2018) so the value becomes the equivalent escaped
sequence and avoids RUF001 lint warnings while preserving behavior.
| <Lemma writtenForm="alcazar" partOfSpeech="n"/> | ||
| <Form writtenForm="location"/> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if other lexemes have similar cross-language form issues
rg -n '<Form writtenForm="[a-z]+"' extensions/wikidata-lexemes/output/es.xml | \
grep -i 'location\|place\|castle' | head -20Repository: sign/wn
Length of output: 93
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Inspect the full lexeme block around the reported lines
sed -n '1135,1175p' extensions/wikidata-lexemes/output/es.xml
# 2) Find where "location" appears in the repo/codegen
rg -n --hidden --no-ignore-vcs '"location"' extensions/wikidata-lexemes || true
rg -n --hidden --no-ignore-vcs 'writtenForm="location"' . || true
# 3) Check for other English-like single-token forms in the Spanish output
# (limited to a small curated list to avoid huge output)
rg -n '<Form writtenForm="(place|castle|location|home|city|country|house|river|mountain|street|road)"/>' extensions/wikidata-lexemes/output/es.xml || trueRepository: sign/wn
Length of output: 2318
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Trace how writtenForm values are generated from Wikidata lexemes
rg -n 'writtenForm=' extensions/wikidata-lexemes -S
rg -n 'L1550831' extensions/wikidata-lexemes -S || true
rg -n 'location' extensions/wikidata-lexemes -SRepository: sign/wn
Length of output: 50365
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Inspect generator logic around where <Form writtenForm="..."/> is emitted
sed -n '380,480p' extensions/wikidata-lexemes/create_extensions.py
# 2) Inspect POS map around the "location": NOUN entry
sed -n '50,110p' extensions/wikidata-lexemes/_pos_map.py
# 3) Find references to the specific lexeme id in repo (narrow search)
rg -n 'L1550831|1550831' extensions/wikidata-lexemes || true
rg -n 'alcazar' extensions/wikidata-lexemes || true
# 4) Locate where input lexeme data is loaded (to understand what "alt_forms" is)
rg -n 'alt_forms|forms|lexeme' extensions/wikidata-lexemes/create_extensions.pyRepository: sign/wn
Length of output: 8637
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the alternative-form extraction logic
sed -n '300,380p' extensions/wikidata-lexemes/create_extensions.py
# Fetch Wikidata Lexeme JSON for L1550831 and look for Spanish-form writtenRep values.
# Use the official Special:EntityData endpoint.
curl -sS 'https://www.wikidata.org/wiki/Special:EntityData/L1550831.json' \
| python3 - <<'PY'
import json,sys
data=json.load(sys.stdin)
ent=data.get('entities',{}).get('L1550831',{})
# Try to locate forms for Spanish (language entity for Spanish is "Q36533").
# Wikidata lexeme structure can vary; print any writtenRep values containing "location".
forms=[]
for k,v in ent.items():
pass
# Walk recursively and collect any "writtenRep" strings
def walk(x):
if isinstance(x, dict):
for kk,vv in x.items():
if kk=='value' and isinstance(vv,str):
pass
for kk,vv in x.items():
# writtenRep usually appears under keys like 'P' or as nested structure; just search strings
return walk(list(x.items()))
if isinstance(x, list):
for i in x:
walk(i)
else:
return
# Better: just collect all strings equal to "location" anywhere in JSON
def collect_strings(x, out):
if isinstance(x, str):
if x == 'location':
out.append(x)
elif isinstance(x, dict):
for vv in x.values():
collect_strings(vv,out)
elif isinstance(x, list):
for vv in x:
collect_strings(vv,out)
out=[]
collect_strings(data,out)
print("count_exact_string_location:", len(out))
PYRepository: sign/wn
Length of output: 3109
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Fetch Wikidata lexeme data and find any form where a representation value is exactly "location"
curl -sS 'https://www.wikidata.org/wiki/Special:EntityData/L1550831.json' | python3 - <<'PY'
import json, sys
data=json.load(sys.stdin)
ent = data.get("entities", {}).get("L1550831", {})
forms = ent.get("forms", [])
def collect():
hits=[]
for i,f in enumerate(forms):
reps = f.get("representations", {}) or {}
# reps is usually keyed by language code with { "language":..., "value":... }
for lang, obj in reps.items():
val = obj.get("value") if isinstance(obj, dict) else None
if val == "location":
hits.append((i, lang, reps))
return hits
hits = collect()
print("num_form_hits_exact_location:", len(hits))
for i,lang,reps in hits[:20]:
# show all rep values for that form so we can see if "location" is only present in en/other languages
rep_vals = {k:(v.get("value") if isinstance(v,dict) else None) for k,v in (reps or {}).items()}
print("hit_form_index:", i, "lang:", lang, "rep_values_for_form:", rep_vals)
PY
# 2) Inspect how languages are iterated for output files (to ensure lang_iso is correct)
rg -n "write_all_extensions|for lang_iso|lang_iso" extensions/wikidata-lexemes/create_extensions.pyRepository: sign/wn
Length of output: 803
🏁 Script executed:
#!/bin/bash
set -euo pipefail
URL='https://www.wikidata.org/wiki/Special:EntityData/L1550831.json'
echo "== curl headers =="
curl -sS -L -D /tmp/curl_headers.txt -o /tmp/curl_body.txt "$URL" || true
cat /tmp/curl_headers.txt | head -30
echo
echo "== curl body prefix (first 300 chars) =="
python3 - <<'PY'
p="/tmp/curl_body.txt"
with open(p,'rb') as f:
b=f.read()
print("body_bytes:", len(b))
print(b[:300].decode('utf-8','replace'))
PY
# If body looks like JSON, extract "location" representations
python3 - <<'PY'
import json
p="/tmp/curl_body.txt"
with open(p,'r',encoding='utf-8',errors='ignore') as f:
txt=f.read()
if not txt.strip().startswith('{'):
print("body_not_json; skipping parse")
raise SystemExit(0)
data=json.loads(txt)
ent=data.get("entities",{}).get("L1550831",{})
hits=[]
for i,f in enumerate(ent.get("forms",[]) or []):
reps=f.get("representations",{}) or {}
for lang,obj in reps.items():
val=obj.get("value") if isinstance(obj,dict) else None
if val=="location":
hits.append((i,lang))
print("hits_exact_location_forms:", hits[:50])
print("num_hits:", len(hits))
PY
# Inspect how lang_iso is set/iterated for this generator
sed -n '450,590p' extensions/wikidata-lexemes/create_extensions.pyRepository: sign/wn
Length of output: 5981
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Inspect get_language_iso + language filtering to rule out mapping bugs
rg -n "def get_language_iso|LANG_FILTER|ENGLISH_LANG_Q" extensions/wikidata-lexemes/create_extensions.py
sed -n '1,120p' extensions/wikidata-lexemes/create_extensions.py
sed -n '120,260p' extensions/wikidata-lexemes/create_extensions.py
# 2) Fetch Wikidata entity data for L1550831 and check for any occurrence of "location"
URL='https://www.wikidata.org/wiki/Special:EntityData/L1550831.json'
curl -sS -L "$URL" > /tmp/L1550831.json
python3 - <<'PY'
import json
txt=open("/tmp/L1550831.json","r",encoding="utf-8",errors="ignore").read()
print("raw_contains_location_substring:", "location" in txt.lower())
data=json.loads(txt)
ent=data.get("entities",{}).get("L1550831",{})
print("entity_keys:", sorted(ent.keys()))
forms=ent.get("forms", None)
print("forms_type:", type(forms).__name__, "forms_len:", (len(forms) if isinstance(forms,list) else None))
# Collect any string values exactly equal to "location" (case-sensitive) anywhere in forms->representations
hits=[]
def walk(x):
if isinstance(x,str):
if x=="location":
hits.append(x)
elif isinstance(x,dict):
for v in x.values():
walk(v)
elif isinstance(x,list):
for v in x:
walk(v)
walk(ent.get("forms",{}))
print("hits_exact_string_location_within_forms:", len(hits))
PY
# 3) See whether this issue is isolated to this lexeme in Spanish output
rg -n 'writtenForm="location"' extensions/wikidata-lexemes/output/es.xmlRepository: sign/wn
Length of output: 9642
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Confirm mapping of Wikidata language Q -> lang_iso
sed -n '1,200p' extensions/wikidata-lexemes/_wikidata.py
# 2) Inspect the actual pipeline input dump for lexeme L1550831
ls -lh extensions/wikidata-lexemes/latest-lexemes.json.bz2
python3 - <<'PY'
import bz2, json, ijson
from pathlib import Path
dump_path = Path("extensions/wikidata-lexemes/latest-lexemes.json.bz2")
target_id = "L1550831"
found = False
with bz2.open(dump_path, "rb") as f:
# file is expected to be a stream of "item" objects (generator uses ijson.items(f, "item"))
for lex in ijson.items(f, "item"):
if not isinstance(lex, dict):
continue
if lex.get("id") != target_id:
continue
found = True
lang_iso = "es"
main_lemma = lex.get("lemmas", {}).get(lang_iso, {}).get("value")
print("found_lexeme_id:", lex.get("id"))
print("lemma_es:", main_lemma)
forms = lex.get("forms", []) or []
hits=[]
for idx, form in enumerate(forms):
reps = form.get("representations", {}) or {}
val = reps.get(lang_iso, {}).get("value") if isinstance(reps.get(lang_iso), dict) else None
if val == "location":
hits.append(idx)
print("num_forms_total:", len(forms))
print("num_alt_form_hits_location_in_lang_es:", len(hits))
if hits:
print("hit_form_indices:", hits[:50])
for idx in hits[:10]:
form = forms[idx]
reps = form.get("representations", {}) or {}
# print all representations for that form to see where "location" appears
rep_map = {}
if isinstance(reps, dict):
for k,v in reps.items():
if isinstance(v, dict) and "value" in v:
rep_map[k] = v["value"]
print(f"form_index_{idx}_rep_map:", rep_map)
# also print any forms for other languages where value == location
other_hits=[]
for idx, form in enumerate(forms):
reps = form.get("representations", {}) or {}
for k,v in reps.items() if isinstance(reps, dict) else []:
if isinstance(v, dict) and v.get("value") == "location":
other_hits.append((idx,k))
print("num_forms_hits_location_any_lang:", len(other_hits))
if other_hits:
print("first_other_hits:", other_hits[:20])
break
if not found:
print("lexeme_not_found_in_dump:", target_id)
PYRepository: sign/wn
Length of output: 2148
🏁 Script executed:
#!/bin/bash
set -euo pipefail
URL='https://www.wikidata.org/wiki/Special:EntityData/L1550831.json'
curl -sS -L "$URL" > /tmp/L1550831.json
python3 - <<'PY'
import json
p="/tmp/L1550831.json"
data=json.load(open(p,'r',encoding='utf-8',errors='ignore'))
ents=data.get("entities",{})
keys=list(ents.keys())
print("entities_keys:", keys[:20], "count:", len(keys))
# If the response is not for L1550831, show what it is and grab es lemma if present
for k in keys[:5]:
ent=ents.get(k, {})
lem=ent.get("lemmas",{})
if "es" in lem:
print("entity:", k, "lemmas.es.value:", lem["es"].get("value"))
# search for forms representations where lang key is exactly "es" and value is exactly "location"
hits=[]
for idx, form in enumerate(ent.get("forms",[]) or []):
rep= (form.get("representations",{}) or {}).get("es",{})
val= rep.get("value") if isinstance(rep,dict) else None
if val=="location":
hits.append(idx)
print("entity:", k, "num_forms_with_rep_es_location:", len(hits), "hits:", hits[:10])
PYRepository: sign/wn
Length of output: 181
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find the actual lexeme dump file(s) present in the repo
fd -a -t f "latest-lexemes*.json.bz2" extensions/wikidata-lexemes || true
fd -a -t f "*.json.bz2" extensions/wikidata-lexemes || true
fd -a -t f "*lexemes*".json.bz2 extensions/wikidata-lexemes || true
# 2) Inspect cached Wikidata entity fetches for L1550831 or L2461
ls -lah extensions/wikidata-lexemes/extras/wikidata 2>/dev/null || true
for q in L1550831 L2461; do
f="extensions/wikidata-lexemes/extras/wikidata/${q}.json"
echo "== $f ==";
if [ -f "$f" ]; then
rg -n '"value"\s*:\s*"location"|\"location\"' "$f" || true
python3 - <<PY
import json
p="$f"
data=json.load(open(p,'r',encoding='utf-8',errors='ignore'))
ents=data.get("entities",{})
ent=ents.get("${q}") or next(iter(ents.values()),{})
# find Spanish form reps equal to location
hits=[]
for idx,form in enumerate(ent.get("forms",[]) or []):
rep=(form.get("representations",{}) or {}).get("es",{}) or {}
val=rep.get("value") if isinstance(rep,dict) else None
if val=="location":
hits.append(idx)
print("es_form_reps_equal_location_count:", len(hits), "hits:", hits[:10])
PY
else
echo "missing"
fi
doneRepository: sign/wn
Length of output: 798
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n 'writtenForm="location"' extensions/wikidata-lexemes/output/es.xml
sed -n '1148,1162p' extensions/wikidata-lexemes/output/es.xmlRepository: sign/wn
Length of output: 706
Fix: English “location” emitted as a Spanish form for lexeme L1550831
extensions/wikidata-lexemes/output/es.xmlline 1153 contains<Form writtenForm="location"/>under<LexicalEntry id="L1550831">whose lemma isalcazar(Spanish), butlocationis English/non-Spanish;<Form>values are emitted verbatim fromlexeme["forms"][].representations[lang_iso].value.- The live Wikidata EntityData for
L1550831appears to return the corresponding entityL2461with lemmaalcázar, and it does not show a Spanish (es) form representation equal to"location", so this is likely coming from the locally used lexeme dump/cache (or a bad extraction input), not current Wikidata.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/output/es.xml` around lines 1152 - 1153, The
Spanish output contains an incorrect English form because the exporter is
emitting representations directly from
lexeme["forms"][].representations[lang_iso].value without validating language or
source; update the export logic that writes <Form writtenForm="..."/> to (1)
ensure it only uses representations where lang_iso == "es" (or the target export
language) and the representation's language metadata matches, (2) ignore or
log/skip values that clearly mismatch the lemma's language (e.g., ASCII/English
tokens like "location") or come from a flagged local dump, and (3) ideally
fallback to live Wikidata EntityData for LexicalEntry id "L1550831" when the
local representation is suspicious; locate the writer that produces Form entries
and change the emission to validate the representation's language tag and
sanitize/skip bad values before writing.
| </LexicalEntry> | ||
| <LexicalEntry id="L1222827"> | ||
| <Lemma writtenForm="elf" partOfSpeech="m"/> | ||
| <Form writtenForm="L1222827"/> |
There was a problem hiding this comment.
Invalid form value: lexical entry ID used instead of word form.
Line 169 contains <Form writtenForm="L1222827"/> where the form value is the lexical entry ID itself rather than an actual inflected word form. This appears to be a data generation error. The form should contain a valid Dutch word (e.g., "elven" or another appropriate inflected form of "elf"), not the entry identifier.
🔍 Verification script
#!/bin/bash
# Check for any Form elements using IDs as writtenForm values across all output files
rg -n 'writtenForm="L\d+' extensions/wikidata-lexemes/output/🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/output/nl.xml` at line 169, The Form element
currently uses the lexical entry ID ("L1222827") as the writtenForm value (<Form
writtenForm="L1222827"/>); locate the generator that emits Form elements for
lexeme L1222827 (the routine that builds Form/writtenForm entries) and replace
the ID with the actual Dutch inflected word string (e.g., "elven" or the correct
form from the lexeme's forms/representations), ensuring the writtenForm is drawn
from the lexeme's forms[] or representations[] field rather than the lexeme ID.
| <LexicalEntry id="L679444"> | ||
| <Lemma writtenForm="ਮੈਂ" partOfSpeech="h"/> | ||
| <Form writtenForm="ਮੈਨੂੰ"/> | ||
| <Form writtenForm=""/> |
There was a problem hiding this comment.
Remove empty or whitespace-only Form elements.
Several <Form> elements have empty or whitespace-only writtenForm attributes (containing only zero-width characters like U+200E or U+200F). These appear at lines 14, 79, 340, 858, and 1052. Empty forms serve no linguistic purpose and degrade data quality.
🗑️ Recommended fix: Remove these empty Form elements
Search for and remove all instances matching this pattern:
- <Form writtenForm=""/>Or if the intent was to represent a specific morphological form, replace with the actual written form.
Also applies to: 79-79, 340-340, 858-858, 1052-1052
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/output/pa.xml` at line 14, Remove any <Form>
elements whose writtenForm attribute is empty or contains only invisible
whitespace (e.g., U+200E/U+200F); specifically locate the <Form> tags with
writtenForm values that are whitespace-only and delete those elements (or
replace the attribute value with the actual lexical form if a real form was
intended). Ensure you target the <Form> element and its writtenForm attribute
when making the change so no surrounding XML structure is broken.
| <Form writtenForm="ਵਿੱਚ"/> | ||
| <Form writtenForm="ਵਿਚੋਂ"/> | ||
| <Form writtenForm="ਵਿੱਚੋਂ"/> | ||
| <Form writtenForm="ਚ"/> |
There was a problem hiding this comment.
Clean up zero-width characters in Form element.
Line 968 contains a Form with embedded zero-width characters: <Form writtenForm="ਚ"/>. These invisible characters (likely U+200E or U+200F marks) may cause text rendering issues or prevent proper string matching.
🧹 Recommended fix: Remove zero-width characters
- <Form writtenForm="ਚ"/>
+ <Form writtenForm="ਚ"/>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <Form writtenForm="ਚ"/> | |
| <Form writtenForm="ਚ"/> |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/output/pa.xml` at line 968, The Form element
writtenForm contains invisible directional marks (zero-width characters) —
locate the <Form writtenForm="ਚ"/> occurrence and remove any U+200E/U+200F or
other zero-width characters from the attribute value so it becomes <Form
writtenForm="ਚ"/>; also scan other Form writtenForm attributes in the same file
for similar invisible characters and strip them to avoid rendering and matching
issues.
| <Form writtenForm="مېنىڭ"/> | ||
| <Form writtenForm="ماڭا"/> | ||
| <Form writtenForm="مېنى"/> | ||
| <Form writtenForm="مەندە"/> | ||
| <Form writtenForm="مەندىن"/> |
There was a problem hiding this comment.
Remove hidden bidi/control marks from writtenForm values.
These form strings include trailing invisible marks, which can cause lookup misses and duplicate-like entries.
Proposed fix
- <Form writtenForm="مېنىڭ"/>
- <Form writtenForm="ماڭا"/>
- <Form writtenForm="مېنى"/>
- <Form writtenForm="مەندە"/>
- <Form writtenForm="مەندىن"/>
+ <Form writtenForm="مېنىڭ"/>
+ <Form writtenForm="ماڭا"/>
+ <Form writtenForm="مېنى"/>
+ <Form writtenForm="مەندە"/>
+ <Form writtenForm="مەندىن"/>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <Form writtenForm="مېنىڭ"/> | |
| <Form writtenForm="ماڭا"/> | |
| <Form writtenForm="مېنى"/> | |
| <Form writtenForm="مەندە"/> | |
| <Form writtenForm="مەندىن"/> | |
| <Form writtenForm="مېنىڭ"/> | |
| <Form writtenForm="ماڭا"/> | |
| <Form writtenForm="مېنى"/> | |
| <Form writtenForm="مەندە"/> | |
| <Form writtenForm="مەندىن"/> |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/output/ug.xml` around lines 36 - 40, The Form
elements' writtenForm attributes in the UG lexeme output contain trailing
invisible bidi/control characters causing lookup failures; update the
serializer/generator that produces these Form writtenForm values (or sanitize
the ug.xml content) to strip Unicode control and bidi marks (e.g., U+200E,
U+200F, U+202A–U+202E, U+FEFF, ZWJ/ZWSP as appropriate) from string values
before writing them: locate where Form writtenForm attributes are set/serialized
and apply a trim/filter that removes these invisible characters so the
writtenForm attributes contain only the visible text (e.g., for the Form
element/writtenForm assignment logic).
| <Form writtenForm="אײַך"/> | ||
| <Form writtenForm="אײַך"/> |
There was a problem hiding this comment.
Normalize and deduplicate bidi-control variants in forms.
Line 230 and Line 231 are effectively the same form (אײַך) with/without an invisible directionality mark. This can create duplicate lookup keys and inconsistent matching downstream. Normalize/strip bidi control chars before emitting <Form> values.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@extensions/wikidata-lexemes/output/yi.xml` around lines 230 - 231, Normalize
and deduplicate Form writtenForm values by stripping Unicode bidi control
characters (e.g., U+200E, U+200F, U+202A–U+202E) before emitting the <Form
writtenForm="..."> attributes; when creating or collecting forms (the code path
that produces <Form writtenForm="..."> entries), perform a normalization step
that removes these invisible directionality marks and then collapse duplicates
so only one <Form> with the normalized writtenForm value is emitted.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8d754b8. Configure here.
| data = {} | ||
| else: | ||
| # Transient (429, 5xx, ...) — don't cache; retry next run. | ||
| return None |
There was a problem hiding this comment.
Transient Wiktionary errors cached
Medium Severity
fetch_wiktionary is wrapped in functools.cache, but it returns None on network errors and on transient HTTP statuses (429/5xx) without writing disk. That None is memoized for the rest of the run, so later calls never hit the on-disk cache or retry the API even though the comment says transient failures should be retried.
Reviewed by Cursor Bugbot for commit 8d754b8. Configure here.


Summary
<Form>elements from each Wikidata lexeme'sforms[], so lookups for inflected forms likemine/should/couldresolve to the base lemma's entry. Filter out forms with negation grammaticalFeatureQ1478451(shouldn't,shan't) and apostrophe-leading contractions ('ll,'d).SKIP_POS-classified lemma whose(lemma, POS)isn't already in omw-en (under any morphological lookup) is now kept. Pulls in ~7k missing function/content words.P31 = Q560570): always keepcan/will/shallso their formscould/would/shouldsurface — bypasses the archaic-Wiktionary filter for these.filter_lexemessokept_lang_sensesmatches what's actually written — without this, sense-relation targets dangle and the merge step rejects the XML.Module refactor
_wikidata.py: cachedfetch_wikidata_entity/get_label/get_language_iso+safe_filename+cached_json_fetch._omw_en.py: cachedomw_en_pos()returning{lemma: frozenset of POSes}._wiktionary.py: REST + Action API fetchers with per-threadrequests.Session, batched/paginated category fetches, HTML entity decoding, archaic-or-sound filter scoped to English._pos_map.py: derivedCONTENT_POS_MAP(Wikidata POS label → WN POS code) for the gap-escape check.create_extensions.py:xml.sax.saxutils.escapereplaces hand-rolled XML escaping;NormalizedSensenamedtuple replaces dict-shape smuggling; caches now underextras/{wikidata,wiktionary,wiktionary-cats}/.Coverage
Verified end-to-end via the Docker image. EN extension: ~7,900 entries, 130 language files. Top-1000 most-common English words covered: omw-en alone 94.1% → with this extension 99.2%. Remaining 8 (
microsoft,sony,ebay,et,rss,st,jul,jun) are brand names / abbreviations / Latin loanwords — defensible exclusions.Test plan
docker build --network=host --build-arg NO_PROXY= --build-arg no_proxy= --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= -t wn-test .succeeds (merge step loads all 130 XMLs without dangling-relation errors).docker run --rm wn-test python -c "import wn, wn.morphy; en = wn.Wordnet(lexicon='omw-en', lemmatizer=wn.morphy.Morphy()); print([(w.lemma(), w.pos, w.synsets()[0].definition()) for w in en.words('mine')])"returns themypronoun entry with definition(first-person singular possessive) belonging to me.shouldreturnsshall (v, "Used before a verb to indicate the simple future tense …").python create_extensions.pyregenerates from the dump deterministically (no network calls if the Wiktionary cache is warm).🤖 Generated with Claude Code
Note
Medium Risk
Changes the lexeme filtering and XML generation pipeline (including new network-backed Wiktionary/Wikidata caching), which can materially alter extension contents and introduce flaky behavior if caching/filters are wrong, but is scoped to the extension generator.
Overview
Expands the
wikidata-lexemesextension generator to emit<Form>variants from Wikidataforms[](skipping negation features and apostrophe-leading contractions) so inflected lookups resolve to the base lemma.Adjusts lexeme selection to fill English content-POS gaps when
omw-enlacks a(lemma, POS)(with a modal-verb keep override), dedupes by(lang, lemma, pos)to prevent dangling sense relations, and adds quality filters to drop likely encyclopedic/proper-noun-derived or unlinked English lexemes.Adds cached Wikidata entity resolution and a Wiktionary REST/Action-API fallback to generate definitions/examples for 0-sense lexemes (with reference/onomatopoeia/dialectal/archaic filters), plus
LANG_FILTERsupport and updated docs; regeneratedoutput/*.xmlreflects the new<Form>emissions.Reviewed by Cursor Bugbot for commit 8d754b8. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
Documentation
New Features
Improvements
Chores