feat: merge Wikidata extensions into base lexicons + add function-word POS#1
Conversation
…function-word POS Previously the Dockerfile loaded each `extensions/wikidata-lexemes/output/*.xml` as a separate lexicon, so queries against e.g. `omw-en:1.4` couldn't see the new function-word entries. This change introduces a merge utility that calls `wn.add` and then rewrites `lexicon_rowid` across all content tables so the extension's data lives under the base lexicon, leaving one lexicon per language (see goodmami#304). Also adds six new POS codes (h pronoun, d determiner, m numeral, i interjection, q interrogative, y particle) to `wn.constants.PARTS_OF_SPEECH` and updates the Wikidata extension generator to emit grouped short codes instead of verbose labels like "personal pronoun" or "definite article". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (137)
📝 WalkthroughWalkthroughThis PR implements a complete workflow for merging Wikidata lexicon extensions into base lexicons. It introduces a POS code mapping infrastructure, adds a merge utility that rewrites database rows, integrates with Docker, and regenerates all Wikidata language outputs with standardized part-of-speech codes. ChangesLexicon Extension Merging and POS Standardization
Sequence DiagramsequenceDiagram
participant Gen as create_extensions.py
participant Merge as merge_extension.py
participant DB as WordNet Database
participant XML as lexicon.xml
Gen->>XML: Generate extension XML<br/>with standardized POS codes
Merge->>XML: Read extension XML
Merge->>Merge: Detect extends element
Merge->>DB: Load extension as new lexicon<br/>with wn.add()
Merge->>DB: Find base & extension<br/>lexicon rowids
Merge->>DB: Rewrite content rows<br/>lexicon_rowid: extension→base
Merge->>DB: Delete extension<br/>metadata rows
Merge->>DB: Complete: extension merged<br/>into base lexicon
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d856879. Configure here.
| cur.execute( | ||
| f"UPDATE {table} SET lexicon_rowid = ? WHERE lexicon_rowid = ?", | ||
| (base_rowid, ext_rowid), | ||
| ) |
There was a problem hiding this comment.
Re-merge duplicates base entries
High Severity
After a successful merge the extension lexicon row is deleted, so a later merge_extension run calls wn.add again and re-inserts the same entry ids under a new extension lexicon. Updating lexicon_rowid to the base then collides with rows already merged, violating entries uniqueness on (id, lexicon_rowid).
Reviewed by Cursor Bugbot for commit d856879. Configure here.
| @@ -2759,1687 +2759,1687 @@ | |||
| <Definition>(دبلوم، دبلومة)</Definition> | |||
| </Sense> | |||
| </LexicalEntry> | |||
| <Synset id="wikidata-ar-L5004-S1" ili="wikidata-ar-L5004-S1"> | |||
| <Synset id="wikidata-ar-L5004-S1" ili="wikidata-ar-L5004-S1" partOfSpeech="root"> | |||
There was a problem hiding this comment.
Unmapped POS stays verbose
Medium Severity
Regenerated extension XML sets partOfSpeech on synsets (and many lemmas) to full Wikidata labels such as root instead of the OTHER code x that create_extensions.py applies via POS_MAP.get(pos_name, OTHER), so stored POS values are outside PARTS_OF_SPEECH.
Reviewed by Cursor Bugbot for commit d856879. Configure here.
The previous regeneration of extension XMLs left ~150 verbose Wikidata
POS labels (`abbreviation`, `abstract noun`, `auxiliary verb`, ...) in
place. That contradicts `create_extensions.py`'s contract of
`POS_MAP.get(pos_name, OTHER)` and caused 8,193 entries/synsets across
66 languages to be stored with POS values outside
`wn.constants.PARTS_OF_SPEECH`.
After this change every `partOfSpeech` attribute is one of
`{h, d, n, v, a, s, r, m, p, c, q, i, y, t, x, u}`. Anything Wikidata
didn't classify into our function-word groupings collapses to `x`,
matching what a fresh `create_extensions.py` run would produce.
Reported by Cursor Bugbot on PR #1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>


Summary
extensions/wikidata-lexemes/merge_extension.pythat callswn.addand then rewriteslexicon_rowidacross the content tables so an extension's entries, senses and synsets live under the base lexicon — yielding one lexicon per language. Closes Extending an existing lexicon goodmami/wn#304 for our use case.wn.constants.PARTS_OF_SPEECH:hPronoun,dDeterminer,mNumeral,iInterjection,qInterrogative,yParticle.create_extensions.pyto emit short codes (h,d, …) via a shared_pos_map.pyinstead of verbose Wikidata labels (personal pronoun,definite article, …); also writespartOfSpeechon<Synset>elements.wn.add, with temporarylexicon_rowidindexes created inmain()over the bulk merge.Test plan
pytest tests/(142 passed)ruff checkdocker build .succeeds and produces a single English lexicon:curl /lexicons?lang=en→['omw-en:1.4']curl /lexicons/omw-en:1.4/words?form=youreturnsL482withpos=hand three Wikidata synsetsLexicalEntry iddetects whether the extension is already merged, so re-runs are safe🤖 Generated with Claude Code
Note
Medium Risk
Medium risk because it introduces a new script that rewrites
lexicon_rowidacross multiple SQLite tables during builds, which could corrupt or mis-associate data if the schema or specifiers don’t match; it also changes POS encoding consumed by downstream tooling.Overview
Wikidata lexeme extensions are now merged into their base OMW lexicons at load time so each language ends up with a single lexicon instead of a separate extension lexicon.
Adds
extensions/wikidata-lexemes/merge_extension.py, which loads extension XMLs viawn.add(), bulk-updateslexicon_rowidacross all lexicon-owned tables, removes the extension lexicon/dependency rows, and includes idempotency + temporary indexes for faster merges.Parts-of-speech support is expanded for function words by adding new short POS codes (
h/d/m/i/q/y) town.constantsand documentation, updating the Wikidata extension generator to emit these codes (and to setpartOfSpeechon<Synset>), and regenerating the extension XML outputs accordingly. The Docker image build now runs the merge utility instead of adding extensions directly.Reviewed by Cursor Bugbot for commit d856879. Bugbot is set up for automated code reviews on this repo. Configure here.
Summary by CodeRabbit
New Features
Documentation
Chores