Skip to content

feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#2

Merged
AmitMY merged 2 commits into
mainfrom
feat/wikidata-extension-coverage
May 22, 2026
Merged

feat(wikidata-lexemes): forms, content-POS gap escape, quality filters#2
AmitMY merged 2 commits into
mainfrom
feat/wikidata-extension-coverage

Conversation

@AmitMY

@AmitMY AmitMY commented May 22, 2026

Copy link
Copy Markdown

Summary

  • Emit <Form> elements from each Wikidata lexeme's forms[], so lookups for inflected forms like mine/should/could resolve to the base lemma's entry. Filter out forms with negation grammaticalFeature Q1478451 (shouldn't, shan't) and apostrophe-leading contractions ('ll, 'd).
  • English content-POS gap escape: a SKIP_POS-classified lemma whose (lemma, POS) isn't already in omw-en (under any morphological lookup) is now kept. Pulls in ~7k missing function/content words.
  • Modal-verb override (P31 = Q560570): always keep can/will/shall so their forms could/would/should surface — bypasses the archaic-Wiktionary filter for these.
  • Wiktionary REST fallback for 0-sense lexemes, with quality filters: reference-only definitions, onomatopoeia / dialectal / archaic category filters, capitalised-lemma filter, empty-claims filter.
  • Quality filters (dedup + multi-word-capital + empty-claims) moved into filter_lexemes so kept_lang_senses matches what's actually written — without this, sense-relation targets dangle and the merge step rejects the XML.

Module refactor

  • New _wikidata.py: cached fetch_wikidata_entity / get_label / get_language_iso + safe_filename + cached_json_fetch.
  • New _omw_en.py: cached omw_en_pos() returning {lemma: frozenset of POSes}.
  • New _wiktionary.py: REST + Action API fetchers with per-thread requests.Session, batched/paginated category fetches, HTML entity decoding, archaic-or-sound filter scoped to English.
  • _pos_map.py: derived CONTENT_POS_MAP (Wikidata POS label → WN POS code) for the gap-escape check.
  • create_extensions.py: xml.sax.saxutils.escape replaces hand-rolled XML escaping; NormalizedSense namedtuple replaces dict-shape smuggling; caches now under extras/{wikidata,wiktionary,wiktionary-cats}/.

Coverage

Verified end-to-end via the Docker image. EN extension: ~7,900 entries, 130 language files. Top-1000 most-common English words covered: omw-en alone 94.1% → with this extension 99.2%. Remaining 8 (microsoft, sony, ebay, et, rss, st, jul, jun) are brand names / abbreviations / Latin loanwords — defensible exclusions.

Test plan

  • docker build --network=host --build-arg NO_PROXY= --build-arg no_proxy= --build-arg HTTP_PROXY= --build-arg HTTPS_PROXY= -t wn-test . succeeds (merge step loads all 130 XMLs without dangling-relation errors).
  • docker run --rm wn-test python -c "import wn, wn.morphy; en = wn.Wordnet(lexicon='omw-en', lemmatizer=wn.morphy.Morphy()); print([(w.lemma(), w.pos, w.synsets()[0].definition()) for w in en.words('mine')])" returns the my pronoun entry with definition (first-person singular possessive) belonging to me.
  • Same lookup for should returns shall (v, "Used before a verb to indicate the simple future tense …").
  • python create_extensions.py regenerates from the dump deterministically (no network calls if the Wiktionary cache is warm).

🤖 Generated with Claude Code


Note

Medium Risk
Changes the lexeme filtering and XML generation pipeline (including new network-backed Wiktionary/Wikidata caching), which can materially alter extension contents and introduce flaky behavior if caching/filters are wrong, but is scoped to the extension generator.

Overview
Expands the wikidata-lexemes extension generator to emit <Form> variants from Wikidata forms[] (skipping negation features and apostrophe-leading contractions) so inflected lookups resolve to the base lemma.

Adjusts lexeme selection to fill English content-POS gaps when omw-en lacks a (lemma, POS) (with a modal-verb keep override), dedupes by (lang, lemma, pos) to prevent dangling sense relations, and adds quality filters to drop likely encyclopedic/proper-noun-derived or unlinked English lexemes.

Adds cached Wikidata entity resolution and a Wiktionary REST/Action-API fallback to generate definitions/examples for 0-sense lexemes (with reference/onomatopoeia/dialectal/archaic filters), plus LANG_FILTER support and updated docs; regenerated output/*.xml reflects the new <Form> emissions.

Reviewed by Cursor Bugbot for commit 8d754b8. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Documentation

    • Updated README with caching details and LANG_FILTER environment variable documentation.
  • New Features

    • Added Wiktionary REST API fallback for lexemes lacking Wikidata senses.
    • Implemented interlingual index linking between sense languages.
    • Added LANG_FILTER support for single-language generation during iteration.
  • Improvements

    • Enhanced lexeme coverage with additional grammatical forms across 60+ language outputs.
    • Improved content-gap handling for English lemmas via WordNet integration.
  • Chores

    • Refactored extension pipeline into modular helper components.
    • Implemented on-disk JSON caching for Wikidata and Wiktionary requests.

Review Change Stack

Add alternative forms, English content-POS gap coverage, and aggressive
quality filtering to the Wikidata Lexemes extension. Top-1000 EN coverage
goes from 84% (omw-en alone) to 99.2% with the extension.

Major changes:
- Emit `<Form>` elements from each lexeme's `forms[]`, so lookups for
  inflected forms like `mine`/`should`/`could` resolve to their base
  lemma's entry. Filter out forms with negation grammaticalFeature
  `Q1478451` (`shouldn't`, `shan't`) and apostrophe-leading contractions
  (`'ll`, `'d`).
- English content-POS gap escape: a SKIP_POS-classified lemma whose
  (lemma, POS) isn't in omw-en under any morphological lookup is now
  included. Pulls in ~7k missing function/content words.
- Modal-verb override (`P31 = Q560570`): always keep `can`/`will`/`shall`
  so their forms `could`/`would`/`should` surface. Bypasses the archaic-
  Wiktionary filter for these.
- Wiktionary REST fallback for 0-sense lexemes, with quality filters:
  reference-only definitions, onomatopoeia / dialectal / archaic
  category filters, capitalised-lemma filter, empty-claims filter.
- Filter dedup + multi-word-capital + empty-claims moved into
  `filter_lexemes` so `kept_lang_senses` matches what we actually write
  — otherwise sense-relation targets dangle when the merge step loads
  the XML.

Module refactor:
- New `_wikidata.py`: cached `fetch_wikidata_entity`/`get_label`/
  `get_language_iso` + `safe_filename` + `cached_json_fetch`.
- New `_omw_en.py`: cached `omw_en_pos()` returning `{lemma: frozenset
  of POSes}`.
- New `_wiktionary.py`: REST + Action API fetchers with per-thread
  `requests.Session`, batched/paginated category fetches, HTML entity
  decoding, archaic-or-sound filter scoped to English.
- `_pos_map.py`: derived `CONTENT_POS_MAP` (label → WN POS code) for
  the gap-escape check.
- `create_extensions.py`: `xml.sax.saxutils.escape` replaces hand-rolled
  XML escaping; `NormalizedSense` namedtuple replaces dict-shape
  smuggling for Wiktionary fallback senses; in-script caches under
  `extras/{wikidata,wiktionary,wiktionary-cats}/`.

EN extension: ~7,900 entries, 130 language files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@AmitMY has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 21 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5d2b5a6e-10fe-4af9-baa2-5920da8ce838

📥 Commits

Reviewing files that changed from the base of the PR and between fd417f2 and 8d754b8.

📒 Files selected for processing (3)
  • CLAUDE.md
  • extensions/wikidata-lexemes/_wiktionary.py
  • extensions/wikidata-lexemes/create_extensions.py
📝 Walkthrough

Walkthrough

Adds helper modules for cached Wikidata/Wiktionary access and POS/content filtering, refactors the extension generator to use them, documents caching and LANG_FILTER, and regenerates many per-language XML files with new Forms and some removals.

Changes

Lexeme Generation Pipeline and Data Refresh

Layer / File(s) Summary
Core helpers and pipeline wiring
extensions/wikidata-lexemes/_wikidata.py, .../_wiktionary.py, .../_omw_en.py, .../_pos_map.py, .../create_extensions.py, README.md
Adds cached fetchers, POS/content maps, OMW-En POS baseline, and refactors generator to normalize senses, fall back to Wiktionary, prefetch/cache, and write per-language XML.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as Generator CLI (create_extensions.py)
  participant WD as Wikidata API
  participant WKT as Wiktionary REST/API
  participant Cache as Disk Cache
  participant FS as XML Output

  CLI->>Cache: try read entity/defs/categories
  alt miss
    CLI->>WD: GET EntityData (Q-code)
    WD-->>CLI: entity JSON
    CLI->>WKT: GET definitions/categories (en)
    WKT-->>CLI: cleaned def + examples
    CLI->>Cache: persist JSON
  end
  CLI-->>CLI: normalize senses, forms, ILI
  CLI->>FS: write per-language XML entries/synsets
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • sign/wn#1 — Earlier POS mapping changes align with this PR’s shared POS map and XML partOfSpeech emission.

Poem

In burrows of bytes I hop and cache,
A lexeme’s trail I smartly match;
When senses sleep, I ask Wikt’ry, “please?”
Then stitch the forms like fallen leaves.
ILI threads through fields of code—
A rabbit’s map for every node. 🐇✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/wikidata-extension-coverage

return True
if not any(any(f in c for f in _OLD_FRAGMENTS) for c in relevant):
return False
return lemma.lower() not in omw_en_pos()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Category substring false positives

Medium Severity

_is_archaic_en treats any English Wiktionary category whose title contains fragments like archaic or dialectal as a whole-word tag. That also matches categories such as English terms with archaic senses, which the same module documents as unsafe for hard filtering. Modern lemmas with only a secondary archaic sense can be dropped from the zero-sense Wiktionary fallback and weaken the content-POS gap fill.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit fd417f2. Configure here.

- Refactor `filter_lexemes`, `_fetch_one_batch`, `wiktionary_definition`
  below ruff's complexity ceiling by extracting helper functions
  (`_skip_pos_passes`, `_try_keep`, `_http_get_json`,
  `_parse_batch_response`, `_pick_definition`).
- Wrap long XML-emit and dict-comprehension lines.
- Mark `_LEADING_APOS` literal with `# noqa: RUF001` (the ambiguous
  Unicode chars are intentional — Wikidata uses them for clitic forms).
- Underscore-prefix the unused `lemma`/`pos_code` from
  `build_xml_entry`'s tuple now that dedup moved to `filter_lexemes`.

Also add `CLAUDE.md` requiring `hatch fmt --linter --check`,
`hatch run mypy:check`, `hatch build`, and `hatch test` to all pass
before opening or pushing to a PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (2)
extensions/wikidata-lexemes/_omw_en.py (1)

8-12: ⚡ Quick win

Narrow the exception handling to specific exception types.

Catching Exception broadly can hide programming errors and makes debugging harder. Consider catching specific exceptions that are expected when omw-en is unavailable.

♻️ Proposed refinement
 def omw_en_pos() -> dict[str, frozenset[str]]:
     """Return {lemma_lower: frozenset of WN POSes}. Empty if omw-en unavailable."""
     try:
         import wn
         en = wn.Wordnet(lexicon="omw-en")
-    except Exception:
+    except (ImportError, LookupError, OSError):
         return {}

This catches:

  • ImportError when wn is not installed
  • LookupError when the lexicon is not available
  • OSError for file system issues

As per static analysis, catching blind Exception should be avoided.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/_omw_en.py` around lines 8 - 12, The current
broad except in the try block that imports wn and constructs en =
wn.Wordnet(lexicon="omw-en") should be narrowed: catch ImportError for missing
the wn package, LookupError for the lexicon not found, and OSError for file/IO
issues instead of catching Exception; update the except clause to handle these
specific exceptions and return {} for those cases so unexpected errors still
propagate.
extensions/wikidata-lexemes/_wiktionary.py (1)

83-116: ⚖️ Poor tradeoff

Consider adding file-write locking for cache writes.

The @cache decorator memoizes in-memory, but if multiple threads call this function with the same (lemma, lang_iso) tuple before the first call completes, both may attempt to write to the same cache file (lines 112-115). While JSON writes are typically atomic for small files, explicit locking would guarantee correctness.

🔒 Optional: Add file locking

Consider using fcntl.flock() (Unix) or msvcrt.locking() (Windows) around the cache write, or a threading.Lock keyed by path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/_wiktionary.py` around lines 83 - 116, The cache
write in fetch_wiktionary (which uses _def_cache_path to determine path) can
race if multiple threads/processes call the same (lemma, lang_iso) concurrently;
add an explicit lock around the block that creates the parent directory and
writes the JSON (the section that uses path.parent.mkdir and open(path,
"w")/json.dump) to prevent concurrent writes — either a process-safe file lock
(fcntl/msvcrt) or a per-path threading.Lock keyed by str(path) is acceptable;
ensure the lock is acquired before writing and released after the file is
flushed/closed so the function still returns data or None as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extensions/wikidata-lexemes/_wikidata.py`:
- Around line 22-31: The cache loader currently crashes on corrupted JSON and
leaves partial files from interrupted writes; modify the logic around
json.load/open and the post-fetch write to (1) catch json.JSONDecodeError
alongside FileNotFoundError and treat either as a cache miss so you call
fetch(), and (2) write the fetched data atomically by writing to a temporary
file in the same directory (use tempfile.NamedTemporaryFile or
path.with_suffix(".tmp")), flush and os.fsync the file, close it, then
os.replace(temp_path, path) to atomically replace the cache; ensure
path.parent.mkdir(...) runs before creating the temp file and clean up the temp
file on errors. Reference the existing identifiers: path, fetch(), json.load,
json.dump.
- Around line 45-47: The code accesses entities = data["entities"] and does
return entities.get(q_code) or next(iter(entities.values())), but if entities is
empty next(iter(...)) raises StopIteration; fix by checking for an empty
entities dict before using the fallback: after setting entities, if not entities
raise a clear exception or return None (or a sentinel) with a helpful message,
otherwise return entities.get(q_code) or next(iter(entities.values())); update
callers if you change the return contract.

In `@extensions/wikidata-lexemes/create_extensions.py`:
- Line 506: The tuple returned by build_xml_entry is being unpacked into entry,
synsets, lemma, pos_code but lemma and pos_code are never used; update the
unpacking at that call site to only capture the used values (e.g., assign entry,
synsets = result) or replace the unused names with throwaway underscores (entry,
synsets, _, _) so the code no longer declares unused variables; locate the
unpacking statement where result is assigned and change it accordingly.
- Line 306: The _LEADING_APOS string literal contains ambiguous apostrophe-like
glyphs; replace the literal characters in the _LEADING_APOS constant with
explicit Unicode escapes (ASCII apostrophe -> \u0027, curly right -> \u2019,
modifier-letter -> \u02BC, curly left -> \u2018) so the value becomes the
equivalent escaped sequence and avoids RUF001 lint warnings while preserving
behavior.

In `@extensions/wikidata-lexemes/output/es.xml`:
- Around line 1152-1153: The Spanish output contains an incorrect English form
because the exporter is emitting representations directly from
lexeme["forms"][].representations[lang_iso].value without validating language or
source; update the export logic that writes <Form writtenForm="..."/> to (1)
ensure it only uses representations where lang_iso == "es" (or the target export
language) and the representation's language metadata matches, (2) ignore or
log/skip values that clearly mismatch the lemma's language (e.g., ASCII/English
tokens like "location") or come from a flagged local dump, and (3) ideally
fallback to live Wikidata EntityData for LexicalEntry id "L1550831" when the
local representation is suspicious; locate the writer that produces Form entries
and change the emission to validate the representation's language tag and
sanitize/skip bad values before writing.

In `@extensions/wikidata-lexemes/output/et.xml`:
- Around line 357-363: The review flags that plural/polite paradigm forms (e.g.,
Form writtenForm="teie", "teid", "meie", "nemad") are being attached to singular
lemma entries like sina, mina, tema; update the form-to-lemma attachment logic
(the code path that maps paradigms to lemma entries—look for functions named
mapFormsToLemma, attachParadigmToLemma or similar) to verify person/number (and
politeness) compatibility before attaching: if a Form has number="plural" or a
politeness attribute, only attach it to a lemma whose lemmaEntry (e.g., 'sina',
'mina', 'tema') has matching person/number/politeness metadata, otherwise create
or attach to the appropriate plural/polite lemma entry (e.g., a separate lemma
for 'teie'/'meie' or mark the paradigm as plural). Ensure the XML generation
preserves correct number/person attributes for Forms so downstream consumers get
the correct singular vs. plural mappings.

In `@extensions/wikidata-lexemes/output/ig.xml`:
- Around line 327-329: Remove the non-lexical placeholder Form element that uses
writtenForm="Verb" from the lexeme entry with Sense id "L747365-S1": locate the
Lemma with writtenForm="gbā ākā" and the sibling Form element that contains the
POS label, and delete that Form node (or replace it with a valid Igbo surface
form if one exists) so only actual lexical surface forms remain in the entry.
- Around line 551-553: Entry L1019088 contains an English gloss incorrectly
emitted as a morphological Form ("Form writtenForm=\"Empty handed\""); remove
that Form element from the Lemma/L1019088 record and instead place the gloss
into the Sense record (Sense id="L1019088-S1") using the appropriate
gloss/definition field for English (e.g., a <Definition> or gloss attribute on
the Sense) so the gloss is not treated as an Igbo Form.
- Around line 635-637: The Form element with writtenForm="Ig" under the Lemma
"nwanne m nwokè" / Sense id "L708447-S1" is an abbreviation artifact and should
be removed; locate the Form node that has writtenForm="Ig" (associated with
Sense id "L708447-S1" and Lemma "nwanne m nwokè") and delete that Form entry so
only valid lexical variants remain.

In `@extensions/wikidata-lexemes/output/nl.xml`:
- Line 169: The Form element currently uses the lexical entry ID ("L1222827") as
the writtenForm value (<Form writtenForm="L1222827"/>); locate the generator
that emits Form elements for lexeme L1222827 (the routine that builds
Form/writtenForm entries) and replace the ID with the actual Dutch inflected
word string (e.g., "elven" or the correct form from the lexeme's
forms/representations), ensuring the writtenForm is drawn from the lexeme's
forms[] or representations[] field rather than the lexeme ID.

In `@extensions/wikidata-lexemes/output/pa.xml`:
- Line 14: Remove any <Form> elements whose writtenForm attribute is empty or
contains only invisible whitespace (e.g., U+200E/U+200F); specifically locate
the <Form> tags with writtenForm values that are whitespace-only and delete
those elements (or replace the attribute value with the actual lexical form if a
real form was intended). Ensure you target the <Form> element and its
writtenForm attribute when making the change so no surrounding XML structure is
broken.
- Line 968: The Form element writtenForm contains invisible directional marks
(zero-width characters) — locate the <Form writtenForm="‎‎ਚ"/> occurrence and
remove any U+200E/U+200F or other zero-width characters from the attribute value
so it becomes <Form writtenForm="ਚ"/>; also scan other Form writtenForm
attributes in the same file for similar invisible characters and strip them to
avoid rendering and matching issues.

In `@extensions/wikidata-lexemes/output/ug.xml`:
- Around line 36-40: The Form elements' writtenForm attributes in the UG lexeme
output contain trailing invisible bidi/control characters causing lookup
failures; update the serializer/generator that produces these Form writtenForm
values (or sanitize the ug.xml content) to strip Unicode control and bidi marks
(e.g., U+200E, U+200F, U+202A–U+202E, U+FEFF, ZWJ/ZWSP as appropriate) from
string values before writing them: locate where Form writtenForm attributes are
set/serialized and apply a trim/filter that removes these invisible characters
so the writtenForm attributes contain only the visible text (e.g., for the Form
element/writtenForm assignment logic).

In `@extensions/wikidata-lexemes/output/yi.xml`:
- Around line 230-231: Normalize and deduplicate Form writtenForm values by
stripping Unicode bidi control characters (e.g., U+200E, U+200F, U+202A–U+202E)
before emitting the <Form writtenForm="..."> attributes; when creating or
collecting forms (the code path that produces <Form writtenForm="..."> entries),
perform a normalization step that removes these invisible directionality marks
and then collapse duplicates so only one <Form> with the normalized writtenForm
value is emitted.

---

Nitpick comments:
In `@extensions/wikidata-lexemes/_omw_en.py`:
- Around line 8-12: The current broad except in the try block that imports wn
and constructs en = wn.Wordnet(lexicon="omw-en") should be narrowed: catch
ImportError for missing the wn package, LookupError for the lexicon not found,
and OSError for file/IO issues instead of catching Exception; update the except
clause to handle these specific exceptions and return {} for those cases so
unexpected errors still propagate.

In `@extensions/wikidata-lexemes/_wiktionary.py`:
- Around line 83-116: The cache write in fetch_wiktionary (which uses
_def_cache_path to determine path) can race if multiple threads/processes call
the same (lemma, lang_iso) concurrently; add an explicit lock around the block
that creates the parent directory and writes the JSON (the section that uses
path.parent.mkdir and open(path, "w")/json.dump) to prevent concurrent writes —
either a process-safe file lock (fcntl/msvcrt) or a per-path threading.Lock
keyed by str(path) is acceptable; ensure the lock is acquired before writing and
released after the file is flushed/closed so the function still returns data or
None as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0f68ac9-03e3-4da8-aaf3-e7fcd171bde3

📥 Commits

Reviewing files that changed from the base of the PR and between 41d3227 and fd417f2.

📒 Files selected for processing (82)
  • extensions/wikidata-lexemes/README.md
  • extensions/wikidata-lexemes/_omw_en.py
  • extensions/wikidata-lexemes/_pos_map.py
  • extensions/wikidata-lexemes/_wikidata.py
  • extensions/wikidata-lexemes/_wiktionary.py
  • extensions/wikidata-lexemes/create_extensions.py
  • extensions/wikidata-lexemes/output/ae.xml
  • extensions/wikidata-lexemes/output/af.xml
  • extensions/wikidata-lexemes/output/ar.xml
  • extensions/wikidata-lexemes/output/az.xml
  • extensions/wikidata-lexemes/output/bn.xml
  • extensions/wikidata-lexemes/output/br.xml
  • extensions/wikidata-lexemes/output/ca.xml
  • extensions/wikidata-lexemes/output/cs.xml
  • extensions/wikidata-lexemes/output/cy.xml
  • extensions/wikidata-lexemes/output/da.xml
  • extensions/wikidata-lexemes/output/de.xml
  • extensions/wikidata-lexemes/output/dz.xml
  • extensions/wikidata-lexemes/output/el.xml
  • extensions/wikidata-lexemes/output/en.xml
  • extensions/wikidata-lexemes/output/eo.xml
  • extensions/wikidata-lexemes/output/es.xml
  • extensions/wikidata-lexemes/output/et.xml
  • extensions/wikidata-lexemes/output/eu.xml
  • extensions/wikidata-lexemes/output/fa.xml
  • extensions/wikidata-lexemes/output/ff.xml
  • extensions/wikidata-lexemes/output/fi.xml
  • extensions/wikidata-lexemes/output/fo.xml
  • extensions/wikidata-lexemes/output/fr.xml
  • extensions/wikidata-lexemes/output/ga.xml
  • extensions/wikidata-lexemes/output/gu.xml
  • extensions/wikidata-lexemes/output/ha.xml
  • extensions/wikidata-lexemes/output/he.xml
  • extensions/wikidata-lexemes/output/hr.xml
  • extensions/wikidata-lexemes/output/id.xml
  • extensions/wikidata-lexemes/output/ig.xml
  • extensions/wikidata-lexemes/output/is.xml
  • extensions/wikidata-lexemes/output/it.xml
  • extensions/wikidata-lexemes/output/ja.xml
  • extensions/wikidata-lexemes/output/ka.xml
  • extensions/wikidata-lexemes/output/kl.xml
  • extensions/wikidata-lexemes/output/ko.xml
  • extensions/wikidata-lexemes/output/ks.xml
  • extensions/wikidata-lexemes/output/la.xml
  • extensions/wikidata-lexemes/output/lb.xml
  • extensions/wikidata-lexemes/output/lv.xml
  • extensions/wikidata-lexemes/output/mi.xml
  • extensions/wikidata-lexemes/output/ml.xml
  • extensions/wikidata-lexemes/output/mn.xml
  • extensions/wikidata-lexemes/output/ms.xml
  • extensions/wikidata-lexemes/output/mt.xml
  • extensions/wikidata-lexemes/output/nb.xml
  • extensions/wikidata-lexemes/output/nl.xml
  • extensions/wikidata-lexemes/output/nn.xml
  • extensions/wikidata-lexemes/output/oc.xml
  • extensions/wikidata-lexemes/output/oj.xml
  • extensions/wikidata-lexemes/output/pa.xml
  • extensions/wikidata-lexemes/output/pl.xml
  • extensions/wikidata-lexemes/output/ps.xml
  • extensions/wikidata-lexemes/output/pt.xml
  • extensions/wikidata-lexemes/output/rn.xml
  • extensions/wikidata-lexemes/output/ro.xml
  • extensions/wikidata-lexemes/output/ru.xml
  • extensions/wikidata-lexemes/output/sa.xml
  • extensions/wikidata-lexemes/output/sd.xml
  • extensions/wikidata-lexemes/output/sk.xml
  • extensions/wikidata-lexemes/output/sq.xml
  • extensions/wikidata-lexemes/output/sr.xml
  • extensions/wikidata-lexemes/output/sv.xml
  • extensions/wikidata-lexemes/output/sw.xml
  • extensions/wikidata-lexemes/output/ta.xml
  • extensions/wikidata-lexemes/output/tl.xml
  • extensions/wikidata-lexemes/output/tr.xml
  • extensions/wikidata-lexemes/output/tw.xml
  • extensions/wikidata-lexemes/output/ug.xml
  • extensions/wikidata-lexemes/output/uk.xml
  • extensions/wikidata-lexemes/output/uz.xml
  • extensions/wikidata-lexemes/output/vi.xml
  • extensions/wikidata-lexemes/output/wo.xml
  • extensions/wikidata-lexemes/output/yi.xml
  • extensions/wikidata-lexemes/output/za.xml
  • extensions/wikidata-lexemes/output/zu.xml
💤 Files with no reviewable changes (2)
  • extensions/wikidata-lexemes/output/za.xml
  • extensions/wikidata-lexemes/output/rn.xml

Comment on lines +22 to +31
try:
with open(path, encoding="utf-8") as f:
return json.load(f)
except FileNotFoundError:
pass
data = fetch()
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2)
return data

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make cache writes atomic and recover from corrupted cache JSON.

An interrupted write can leave a partial file; next run then crashes on JSON parsing and won’t self-heal.

Proposed fix
 def cached_json_fetch(path: Path, fetch: Callable[[], dict]) -> dict:
     try:
-        with open(path, encoding="utf-8") as f:
+        with open(path, encoding="utf-8") as f:
             return json.load(f)
-    except FileNotFoundError:
+    except (FileNotFoundError, json.JSONDecodeError):
         pass
     data = fetch()
     path.parent.mkdir(parents=True, exist_ok=True)
-    with open(path, "w", encoding="utf-8") as f:
+    tmp_path = path.with_suffix(path.suffix + ".tmp")
+    with open(tmp_path, "w", encoding="utf-8") as f:
         json.dump(data, f, indent=2)
+    tmp_path.replace(path)
     return data
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/_wikidata.py` around lines 22 - 31, The cache
loader currently crashes on corrupted JSON and leaves partial files from
interrupted writes; modify the logic around json.load/open and the post-fetch
write to (1) catch json.JSONDecodeError alongside FileNotFoundError and treat
either as a cache miss so you call fetch(), and (2) write the fetched data
atomically by writing to a temporary file in the same directory (use
tempfile.NamedTemporaryFile or path.with_suffix(".tmp")), flush and os.fsync the
file, close it, then os.replace(temp_path, path) to atomically replace the
cache; ensure path.parent.mkdir(...) runs before creating the temp file and
clean up the temp file on errors. Reference the existing identifiers: path,
fetch(), json.load, json.dump.

Comment on lines +45 to +47
entities = data["entities"]
return entities.get(q_code) or next(iter(entities.values()))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard against empty entities payloads before fallback selection.

If entities is empty, next(iter(...)) raises StopIteration and obscures the root cause.

Proposed fix
     data = cached_json_fetch(EXTRAS_DIR / f"{q_code}.json", _fetch)
     entities = data["entities"]
-    return entities.get(q_code) or next(iter(entities.values()))
+    if not entities:
+        raise ValueError(f"No entities returned for {q_code}")
+    return entities.get(q_code) or next(iter(entities.values()))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
entities = data["entities"]
return entities.get(q_code) or next(iter(entities.values()))
entities = data["entities"]
if not entities:
raise ValueError(f"No entities returned for {q_code}")
return entities.get(q_code) or next(iter(entities.values()))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/_wikidata.py` around lines 45 - 47, The code
accesses entities = data["entities"] and does return entities.get(q_code) or
next(iter(entities.values())), but if entities is empty next(iter(...)) raises
StopIteration; fix by checking for an empty entities dict before using the
fallback: after setting entities, if not entities raise a clear exception or
return None (or a sentinel) with a helpful message, otherwise return
entities.get(q_code) or next(iter(entities.values())); update callers if you
change the return contract.

return relations


_LEADING_APOS = "'’ʼ‘" # ASCII, curly right, modifier-letter, curly left

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use explicit Unicode escapes for _LEADING_APOS literals.

Line 306 contains ambiguous apostrophe-like characters (RUF001). Switching to escapes keeps behavior and avoids ambiguous-source lint warnings.

Proposed fix
-_LEADING_APOS = "'’ʼ‘"  # ASCII, curly right, modifier-letter, curly left
+_LEADING_APOS = "'\u2019\u02BC\u2018"  # ASCII, curly right, modifier-letter, curly left
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_LEADING_APOS = "'’ʼ‘" # ASCII, curly right, modifier-letter, curly left
_LEADING_APOS = "'\u2019\u02BC\u2018" # ASCII, curly right, modifier-letter, curly left
🧰 Tools
🪛 Ruff (0.15.13)

[warning] 306-306: String contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF001)


[warning] 306-306: String contains ambiguous ʼ (MODIFIER LETTER APOSTROPHE). Did you mean ``` (GRAVE ACCENT)?

(RUF001)


[warning] 306-306: String contains ambiguous (LEFT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/create_extensions.py` at line 306, The
_LEADING_APOS string literal contains ambiguous apostrophe-like glyphs; replace
the literal characters in the _LEADING_APOS constant with explicit Unicode
escapes (ASCII apostrophe -> \u0027, curly right -> \u2019, modifier-letter ->
\u02BC, curly left -> \u2018) so the value becomes the equivalent escaped
sequence and avoids RUF001 lint warnings while preserving behavior.

Comment thread extensions/wikidata-lexemes/create_extensions.py Outdated
Comment on lines 1152 to +1153
<Lemma writtenForm="alcazar" partOfSpeech="n"/>
<Form writtenForm="location"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if other lexemes have similar cross-language form issues
rg -n '<Form writtenForm="[a-z]+"' extensions/wikidata-lexemes/output/es.xml | \
  grep -i 'location\|place\|castle' | head -20

Repository: sign/wn

Length of output: 93


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect the full lexeme block around the reported lines
sed -n '1135,1175p' extensions/wikidata-lexemes/output/es.xml

# 2) Find where "location" appears in the repo/codegen
rg -n --hidden --no-ignore-vcs '"location"' extensions/wikidata-lexemes || true
rg -n --hidden --no-ignore-vcs 'writtenForm="location"' . || true

# 3) Check for other English-like single-token forms in the Spanish output
#    (limited to a small curated list to avoid huge output)
rg -n '<Form writtenForm="(place|castle|location|home|city|country|house|river|mountain|street|road)"/>' extensions/wikidata-lexemes/output/es.xml || true

Repository: sign/wn

Length of output: 2318


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Trace how writtenForm values are generated from Wikidata lexemes
rg -n 'writtenForm=' extensions/wikidata-lexemes -S
rg -n 'L1550831' extensions/wikidata-lexemes -S || true
rg -n 'location' extensions/wikidata-lexemes -S

Repository: sign/wn

Length of output: 50365


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect generator logic around where <Form writtenForm="..."/> is emitted
sed -n '380,480p' extensions/wikidata-lexemes/create_extensions.py

# 2) Inspect POS map around the "location": NOUN entry
sed -n '50,110p' extensions/wikidata-lexemes/_pos_map.py

# 3) Find references to the specific lexeme id in repo (narrow search)
rg -n 'L1550831|1550831' extensions/wikidata-lexemes || true
rg -n 'alcazar' extensions/wikidata-lexemes || true

# 4) Locate where input lexeme data is loaded (to understand what "alt_forms" is)
rg -n 'alt_forms|forms|lexeme' extensions/wikidata-lexemes/create_extensions.py

Repository: sign/wn

Length of output: 8637


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the alternative-form extraction logic
sed -n '300,380p' extensions/wikidata-lexemes/create_extensions.py

# Fetch Wikidata Lexeme JSON for L1550831 and look for Spanish-form writtenRep values.
# Use the official Special:EntityData endpoint.
curl -sS 'https://www.wikidata.org/wiki/Special:EntityData/L1550831.json' \
  | python3 - <<'PY'
import json,sys
data=json.load(sys.stdin)
ent=data.get('entities',{}).get('L1550831',{})
# Try to locate forms for Spanish (language entity for Spanish is "Q36533").
# Wikidata lexeme structure can vary; print any writtenRep values containing "location".
forms=[]
for k,v in ent.items():
    pass
# Walk recursively and collect any "writtenRep" strings
def walk(x):
    if isinstance(x, dict):
        for kk,vv in x.items():
            if kk=='value' and isinstance(vv,str):
                pass
        for kk,vv in x.items():
            # writtenRep usually appears under keys like 'P' or as nested structure; just search strings
        return walk(list(x.items()))
    if isinstance(x, list):
        for i in x:
            walk(i)
    else:
        return
# Better: just collect all strings equal to "location" anywhere in JSON
def collect_strings(x, out):
    if isinstance(x, str):
        if x == 'location':
            out.append(x)
    elif isinstance(x, dict):
        for vv in x.values():
            collect_strings(vv,out)
    elif isinstance(x, list):
        for vv in x:
            collect_strings(vv,out)
out=[]
collect_strings(data,out)
print("count_exact_string_location:", len(out))
PY

Repository: sign/wn

Length of output: 3109


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Fetch Wikidata lexeme data and find any form where a representation value is exactly "location"
curl -sS 'https://www.wikidata.org/wiki/Special:EntityData/L1550831.json' | python3 - <<'PY'
import json, sys

data=json.load(sys.stdin)
ent = data.get("entities", {}).get("L1550831", {})
forms = ent.get("forms", [])

def collect():
    hits=[]
    for i,f in enumerate(forms):
        reps = f.get("representations", {}) or {}
        # reps is usually keyed by language code with { "language":..., "value":... }
        for lang, obj in reps.items():
            val = obj.get("value") if isinstance(obj, dict) else None
            if val == "location":
                hits.append((i, lang, reps))
    return hits

hits = collect()
print("num_form_hits_exact_location:", len(hits))
for i,lang,reps in hits[:20]:
    # show all rep values for that form so we can see if "location" is only present in en/other languages
    rep_vals = {k:(v.get("value") if isinstance(v,dict) else None) for k,v in (reps or {}).items()}
    print("hit_form_index:", i, "lang:", lang, "rep_values_for_form:", rep_vals)
PY

# 2) Inspect how languages are iterated for output files (to ensure lang_iso is correct)
rg -n "write_all_extensions|for lang_iso|lang_iso" extensions/wikidata-lexemes/create_extensions.py

Repository: sign/wn

Length of output: 803


🏁 Script executed:

#!/bin/bash
set -euo pipefail

URL='https://www.wikidata.org/wiki/Special:EntityData/L1550831.json'

echo "== curl headers =="
curl -sS -L -D /tmp/curl_headers.txt -o /tmp/curl_body.txt "$URL" || true
cat /tmp/curl_headers.txt | head -30
echo
echo "== curl body prefix (first 300 chars) =="
python3 - <<'PY'
p="/tmp/curl_body.txt"
with open(p,'rb') as f:
    b=f.read()
print("body_bytes:", len(b))
print(b[:300].decode('utf-8','replace'))
PY

# If body looks like JSON, extract "location" representations
python3 - <<'PY'
import json
p="/tmp/curl_body.txt"
with open(p,'r',encoding='utf-8',errors='ignore') as f:
    txt=f.read()
if not txt.strip().startswith('{'):
    print("body_not_json; skipping parse")
    raise SystemExit(0)
data=json.loads(txt)
ent=data.get("entities",{}).get("L1550831",{})
hits=[]
for i,f in enumerate(ent.get("forms",[]) or []):
    reps=f.get("representations",{}) or {}
    for lang,obj in reps.items():
        val=obj.get("value") if isinstance(obj,dict) else None
        if val=="location":
            hits.append((i,lang))
print("hits_exact_location_forms:", hits[:50])
print("num_hits:", len(hits))
PY

# Inspect how lang_iso is set/iterated for this generator
sed -n '450,590p' extensions/wikidata-lexemes/create_extensions.py

Repository: sign/wn

Length of output: 5981


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect get_language_iso + language filtering to rule out mapping bugs
rg -n "def get_language_iso|LANG_FILTER|ENGLISH_LANG_Q" extensions/wikidata-lexemes/create_extensions.py
sed -n '1,120p' extensions/wikidata-lexemes/create_extensions.py
sed -n '120,260p' extensions/wikidata-lexemes/create_extensions.py

# 2) Fetch Wikidata entity data for L1550831 and check for any occurrence of "location"
URL='https://www.wikidata.org/wiki/Special:EntityData/L1550831.json'
curl -sS -L "$URL" > /tmp/L1550831.json
python3 - <<'PY'
import json
txt=open("/tmp/L1550831.json","r",encoding="utf-8",errors="ignore").read()
print("raw_contains_location_substring:", "location" in txt.lower())
data=json.loads(txt)
ent=data.get("entities",{}).get("L1550831",{})
print("entity_keys:", sorted(ent.keys()))
forms=ent.get("forms", None)
print("forms_type:", type(forms).__name__, "forms_len:", (len(forms) if isinstance(forms,list) else None))
# Collect any string values exactly equal to "location" (case-sensitive) anywhere in forms->representations
hits=[]
def walk(x):
    if isinstance(x,str):
        if x=="location":
            hits.append(x)
    elif isinstance(x,dict):
        for v in x.values():
            walk(v)
    elif isinstance(x,list):
        for v in x:
            walk(v)
walk(ent.get("forms",{}))
print("hits_exact_string_location_within_forms:", len(hits))
PY

# 3) See whether this issue is isolated to this lexeme in Spanish output
rg -n 'writtenForm="location"' extensions/wikidata-lexemes/output/es.xml

Repository: sign/wn

Length of output: 9642


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Confirm mapping of Wikidata language Q -> lang_iso
sed -n '1,200p' extensions/wikidata-lexemes/_wikidata.py

# 2) Inspect the actual pipeline input dump for lexeme L1550831
ls -lh extensions/wikidata-lexemes/latest-lexemes.json.bz2

python3 - <<'PY'
import bz2, json, ijson
from pathlib import Path

dump_path = Path("extensions/wikidata-lexemes/latest-lexemes.json.bz2")

target_id = "L1550831"
found = False

with bz2.open(dump_path, "rb") as f:
    # file is expected to be a stream of "item" objects (generator uses ijson.items(f, "item"))
    for lex in ijson.items(f, "item"):
        if not isinstance(lex, dict):
            continue
        if lex.get("id") != target_id:
            continue
        found = True

        lang_iso = "es"
        main_lemma = lex.get("lemmas", {}).get(lang_iso, {}).get("value")
        print("found_lexeme_id:", lex.get("id"))
        print("lemma_es:", main_lemma)

        forms = lex.get("forms", []) or []
        hits=[]
        for idx, form in enumerate(forms):
            reps = form.get("representations", {}) or {}
            val = reps.get(lang_iso, {}).get("value") if isinstance(reps.get(lang_iso), dict) else None
            if val == "location":
                hits.append(idx)

        print("num_forms_total:", len(forms))
        print("num_alt_form_hits_location_in_lang_es:", len(hits))
        if hits:
            print("hit_form_indices:", hits[:50])
            for idx in hits[:10]:
                form = forms[idx]
                reps = form.get("representations", {}) or {}
                # print all representations for that form to see where "location" appears
                rep_map = {}
                if isinstance(reps, dict):
                    for k,v in reps.items():
                        if isinstance(v, dict) and "value" in v:
                            rep_map[k] = v["value"]
                print(f"form_index_{idx}_rep_map:", rep_map)

        # also print any forms for other languages where value == location
        other_hits=[]
        for idx, form in enumerate(forms):
            reps = form.get("representations", {}) or {}
            for k,v in reps.items() if isinstance(reps, dict) else []:
                if isinstance(v, dict) and v.get("value") == "location":
                    other_hits.append((idx,k))
        print("num_forms_hits_location_any_lang:", len(other_hits))
        if other_hits:
            print("first_other_hits:", other_hits[:20])
        break

if not found:
    print("lexeme_not_found_in_dump:", target_id)
PY

Repository: sign/wn

Length of output: 2148


🏁 Script executed:

#!/bin/bash
set -euo pipefail

URL='https://www.wikidata.org/wiki/Special:EntityData/L1550831.json'
curl -sS -L "$URL" > /tmp/L1550831.json

python3 - <<'PY'
import json
p="/tmp/L1550831.json"
data=json.load(open(p,'r',encoding='utf-8',errors='ignore'))
ents=data.get("entities",{})
keys=list(ents.keys())
print("entities_keys:", keys[:20], "count:", len(keys))

# If the response is not for L1550831, show what it is and grab es lemma if present
for k in keys[:5]:
    ent=ents.get(k, {})
    lem=ent.get("lemmas",{})
    if "es" in lem:
        print("entity:", k, "lemmas.es.value:", lem["es"].get("value"))
        # search for forms representations where lang key is exactly "es" and value is exactly "location"
        hits=[]
        for idx, form in enumerate(ent.get("forms",[]) or []):
            rep= (form.get("representations",{}) or {}).get("es",{})
            val= rep.get("value") if isinstance(rep,dict) else None
            if val=="location":
                hits.append(idx)
        print("entity:", k, "num_forms_with_rep_es_location:", len(hits), "hits:", hits[:10])
PY

Repository: sign/wn

Length of output: 181


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find the actual lexeme dump file(s) present in the repo
fd -a -t f "latest-lexemes*.json.bz2" extensions/wikidata-lexemes || true
fd -a -t f "*.json.bz2" extensions/wikidata-lexemes || true
fd -a -t f "*lexemes*".json.bz2 extensions/wikidata-lexemes || true

# 2) Inspect cached Wikidata entity fetches for L1550831 or L2461
ls -lah extensions/wikidata-lexemes/extras/wikidata 2>/dev/null || true

for q in L1550831 L2461; do
  f="extensions/wikidata-lexemes/extras/wikidata/${q}.json"
  echo "== $f ==";
  if [ -f "$f" ]; then
    rg -n '"value"\s*:\s*"location"|\"location\"' "$f" || true
    python3 - <<PY
import json
p="$f"
data=json.load(open(p,'r',encoding='utf-8',errors='ignore'))
ents=data.get("entities",{})
ent=ents.get("${q}") or next(iter(ents.values()),{})
# find Spanish form reps equal to location
hits=[]
for idx,form in enumerate(ent.get("forms",[]) or []):
    rep=(form.get("representations",{}) or {}).get("es",{}) or {}
    val=rep.get("value") if isinstance(rep,dict) else None
    if val=="location":
        hits.append(idx)
print("es_form_reps_equal_location_count:", len(hits), "hits:", hits[:10])
PY
  else
    echo "missing"
  fi
done

Repository: sign/wn

Length of output: 798


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n 'writtenForm="location"' extensions/wikidata-lexemes/output/es.xml

sed -n '1148,1162p' extensions/wikidata-lexemes/output/es.xml

Repository: sign/wn

Length of output: 706


Fix: English “location” emitted as a Spanish form for lexeme L1550831

  • extensions/wikidata-lexemes/output/es.xml line 1153 contains <Form writtenForm="location"/> under <LexicalEntry id="L1550831"> whose lemma is alcazar (Spanish), but location is English/non-Spanish; <Form> values are emitted verbatim from lexeme["forms"][].representations[lang_iso].value.
  • The live Wikidata EntityData for L1550831 appears to return the corresponding entity L2461 with lemma alcázar, and it does not show a Spanish (es) form representation equal to "location", so this is likely coming from the locally used lexeme dump/cache (or a bad extraction input), not current Wikidata.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/output/es.xml` around lines 1152 - 1153, The
Spanish output contains an incorrect English form because the exporter is
emitting representations directly from
lexeme["forms"][].representations[lang_iso].value without validating language or
source; update the export logic that writes <Form writtenForm="..."/> to (1)
ensure it only uses representations where lang_iso == "es" (or the target export
language) and the representation's language metadata matches, (2) ignore or
log/skip values that clearly mismatch the lemma's language (e.g., ASCII/English
tokens like "location") or come from a flagged local dump, and (3) ideally
fallback to live Wikidata EntityData for LexicalEntry id "L1550831" when the
local representation is suspicious; locate the writer that produces Form entries
and change the emission to validate the representation's language tag and
sanitize/skip bad values before writing.

</LexicalEntry>
<LexicalEntry id="L1222827">
<Lemma writtenForm="elf" partOfSpeech="m"/>
<Form writtenForm="L1222827"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Invalid form value: lexical entry ID used instead of word form.

Line 169 contains <Form writtenForm="L1222827"/> where the form value is the lexical entry ID itself rather than an actual inflected word form. This appears to be a data generation error. The form should contain a valid Dutch word (e.g., "elven" or another appropriate inflected form of "elf"), not the entry identifier.

🔍 Verification script
#!/bin/bash
# Check for any Form elements using IDs as writtenForm values across all output files
rg -n 'writtenForm="L\d+' extensions/wikidata-lexemes/output/
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/output/nl.xml` at line 169, The Form element
currently uses the lexical entry ID ("L1222827") as the writtenForm value (<Form
writtenForm="L1222827"/>); locate the generator that emits Form elements for
lexeme L1222827 (the routine that builds Form/writtenForm entries) and replace
the ID with the actual Dutch inflected word string (e.g., "elven" or the correct
form from the lexeme's forms/representations), ensuring the writtenForm is drawn
from the lexeme's forms[] or representations[] field rather than the lexeme ID.

<LexicalEntry id="L679444">
<Lemma writtenForm="ਮੈਂ" partOfSpeech="h"/>
<Form writtenForm="ਮੈਨੂੰ"/>
<Form writtenForm="‎"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove empty or whitespace-only Form elements.

Several <Form> elements have empty or whitespace-only writtenForm attributes (containing only zero-width characters like U+200E or U+200F). These appear at lines 14, 79, 340, 858, and 1052. Empty forms serve no linguistic purpose and degrade data quality.

🗑️ Recommended fix: Remove these empty Form elements

Search for and remove all instances matching this pattern:

-      <Form writtenForm=""/>

Or if the intent was to represent a specific morphological form, replace with the actual written form.

Also applies to: 79-79, 340-340, 858-858, 1052-1052

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/output/pa.xml` at line 14, Remove any <Form>
elements whose writtenForm attribute is empty or contains only invisible
whitespace (e.g., U+200E/U+200F); specifically locate the <Form> tags with
writtenForm values that are whitespace-only and delete those elements (or
replace the attribute value with the actual lexical form if a real form was
intended). Ensure you target the <Form> element and its writtenForm attribute
when making the change so no surrounding XML structure is broken.

<Form writtenForm="ਵਿੱਚ"/>
<Form writtenForm="ਵਿਚੋਂ"/>
<Form writtenForm="ਵਿੱਚੋਂ"/>
<Form writtenForm="‎‎ਚ"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clean up zero-width characters in Form element.

Line 968 contains a Form with embedded zero-width characters: <Form writtenForm="‎‎ਚ"/>. These invisible characters (likely U+200E or U+200F marks) may cause text rendering issues or prevent proper string matching.

🧹 Recommended fix: Remove zero-width characters
-      <Form writtenForm="‎‎ਚ"/>
+      <Form writtenForm="ਚ"/>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<Form writtenForm="‎‎"/>
<Form writtenForm=""/>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/output/pa.xml` at line 968, The Form element
writtenForm contains invisible directional marks (zero-width characters) —
locate the <Form writtenForm="‎‎ਚ"/> occurrence and remove any U+200E/U+200F or
other zero-width characters from the attribute value so it becomes <Form
writtenForm="ਚ"/>; also scan other Form writtenForm attributes in the same file
for similar invisible characters and strip them to avoid rendering and matching
issues.

Comment on lines +36 to +40
<Form writtenForm="مېنىڭ‎"/>
<Form writtenForm="ماڭا‎"/>
<Form writtenForm="مېنى‎"/>
<Form writtenForm="مەندە‎"/>
<Form writtenForm="مەندىن‎"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove hidden bidi/control marks from writtenForm values.

These form strings include trailing invisible marks, which can cause lookup misses and duplicate-like entries.

Proposed fix
-      <Form writtenForm="مېنىڭ‎"/>
-      <Form writtenForm="ماڭا‎"/>
-      <Form writtenForm="مېنى‎"/>
-      <Form writtenForm="مەندە‎"/>
-      <Form writtenForm="مەندىن‎"/>
+      <Form writtenForm="مېنىڭ"/>
+      <Form writtenForm="ماڭا"/>
+      <Form writtenForm="مېنى"/>
+      <Form writtenForm="مەندە"/>
+      <Form writtenForm="مەندىن"/>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<Form writtenForm="مېنىڭ"/>
<Form writtenForm="ماڭا"/>
<Form writtenForm="مېنى"/>
<Form writtenForm="مەندە"/>
<Form writtenForm="مەندىن"/>
<Form writtenForm="مېنىڭ"/>
<Form writtenForm="ماڭا"/>
<Form writtenForm="مېنى"/>
<Form writtenForm="مەندە"/>
<Form writtenForm="مەندىن"/>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/output/ug.xml` around lines 36 - 40, The Form
elements' writtenForm attributes in the UG lexeme output contain trailing
invisible bidi/control characters causing lookup failures; update the
serializer/generator that produces these Form writtenForm values (or sanitize
the ug.xml content) to strip Unicode control and bidi marks (e.g., U+200E,
U+200F, U+202A–U+202E, U+FEFF, ZWJ/ZWSP as appropriate) from string values
before writing them: locate where Form writtenForm attributes are set/serialized
and apply a trim/filter that removes these invisible characters so the
writtenForm attributes contain only the visible text (e.g., for the Form
element/writtenForm assignment logic).

Comment on lines +230 to +231
<Form writtenForm="אײַך‎"/>
<Form writtenForm="אײַך"/>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize and deduplicate bidi-control variants in forms.

Line 230 and Line 231 are effectively the same form (אײַך) with/without an invisible directionality mark. This can create duplicate lookup keys and inconsistent matching downstream. Normalize/strip bidi control chars before emitting <Form> values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/wikidata-lexemes/output/yi.xml` around lines 230 - 231, Normalize
and deduplicate Form writtenForm values by stripping Unicode bidi control
characters (e.g., U+200E, U+200F, U+202A–U+202E) before emitting the <Form
writtenForm="..."> attributes; when creating or collecting forms (the code path
that produces <Form writtenForm="..."> entries), perform a normalization step
that removes these invisible directionality marks and then collapse duplicates
so only one <Form> with the normalized writtenForm value is emitted.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8d754b8. Configure here.

data = {}
else:
# Transient (429, 5xx, ...) — don't cache; retry next run.
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transient Wiktionary errors cached

Medium Severity

fetch_wiktionary is wrapped in functools.cache, but it returns None on network errors and on transient HTTP statuses (429/5xx) without writing disk. That None is memoized for the rest of the run, so later calls never hit the on-disk cache or retry the API even though the comment says transient failures should be retried.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8d754b8. Configure here.

@AmitMY AmitMY merged commit 7b6b2a8 into main May 22, 2026
13 checks passed
@AmitMY AmitMY deleted the feat/wikidata-extension-coverage branch May 22, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant