Skip to content

feat: merge Wikidata extensions into base lexicons + add function-word POS#1

Merged
AmitMY merged 1 commit into
mainfrom
feat/merge-wikidata-extensions
May 20, 2026
Merged

feat: merge Wikidata extensions into base lexicons + add function-word POS#1
AmitMY merged 1 commit into
mainfrom
feat/merge-wikidata-extensions

Conversation

@AmitMY

@AmitMY AmitMY commented May 20, 2026

Copy link
Copy Markdown

Summary

  • Adds extensions/wikidata-lexemes/merge_extension.py that calls wn.add and then rewrites lexicon_rowid across the content tables so an extension's entries, senses and synsets live under the base lexicon — yielding one lexicon per language. Closes Extending an existing lexicon goodmami/wn#304 for our use case.
  • Adds six grouped POS codes to wn.constants.PARTS_OF_SPEECH: h Pronoun, d Determiner, m Numeral, i Interjection, q Interrogative, y Particle.
  • Updates create_extensions.py to emit short codes (h, d, …) via a shared _pos_map.py instead of verbose Wikidata labels (personal pronoun, definite article, …); also writes partOfSpeech on <Synset> elements.
  • Re-runs the existing 130 output XMLs to use the new short codes.
  • Dockerfile now invokes the merge utility instead of bare wn.add, with temporary lexicon_rowid indexes created in main() over the bulk merge.

Test plan

  • pytest tests/ (142 passed)
  • ruff check
  • docker build . succeeds and produces a single English lexicon: curl /lexicons?lang=en['omw-en:1.4']
  • curl /lexicons/omw-en:1.4/words?form=you returns L482 with pos=h and three Wikidata synsets
  • Idempotency: a stream peek at the first LexicalEntry id detects whether the extension is already merged, so re-runs are safe

🤖 Generated with Claude Code


Note

Medium Risk
Medium risk because it introduces a new script that rewrites lexicon_rowid across multiple SQLite tables during builds, which could corrupt or mis-associate data if the schema or specifiers don’t match; it also changes POS encoding consumed by downstream tooling.

Overview
Wikidata lexeme extensions are now merged into their base OMW lexicons at load time so each language ends up with a single lexicon instead of a separate extension lexicon.

Adds extensions/wikidata-lexemes/merge_extension.py, which loads extension XMLs via wn.add(), bulk-updates lexicon_rowid across all lexicon-owned tables, removes the extension lexicon/dependency rows, and includes idempotency + temporary indexes for faster merges.

Parts-of-speech support is expanded for function words by adding new short POS codes (h/d/m/i/q/y) to wn.constants and documentation, updating the Wikidata extension generator to emit these codes (and to set partOfSpeech on <Synset>), and regenerating the extension XML outputs accordingly. The Docker image build now runs the merge utility instead of adding extensions directly.

Reviewed by Cursor Bugbot for commit d856879. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • New Features

    • Added support for six additional parts of speech: pronouns, determiners, numerals, interjections, interrogatives, and particles.
    • Implemented standardized part-of-speech code system across language datasets.
  • Documentation

    • Expanded API documentation with new POS constant entries and aliases.
    • Added guide for loading and merging language extensions into the base lexicon.
  • Chores

    • Updated language datasets across multiple languages with improved metadata formatting.

…function-word POS

Previously the Dockerfile loaded each `extensions/wikidata-lexemes/output/*.xml`
as a separate lexicon, so queries against e.g. `omw-en:1.4` couldn't see
the new function-word entries. This change introduces a merge utility that
calls `wn.add` and then rewrites `lexicon_rowid` across all content tables
so the extension's data lives under the base lexicon, leaving one lexicon
per language (see goodmami#304).

Also adds six new POS codes (h pronoun, d determiner, m numeral, i interjection,
q interrogative, y particle) to `wn.constants.PARTS_OF_SPEECH` and updates
the Wikidata extension generator to emit grouped short codes instead of
verbose labels like "personal pronoun" or "definite article".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea7eb444-d20b-46cb-9e89-11c75cac3556

📥 Commits

Reviewing files that changed from the base of the PR and between e8090b8 and d856879.

📒 Files selected for processing (137)
  • Dockerfile
  • docs/api/wn.constants.rst
  • extensions/wikidata-lexemes/README.md
  • extensions/wikidata-lexemes/_pos_map.py
  • extensions/wikidata-lexemes/create_extensions.py
  • extensions/wikidata-lexemes/merge_extension.py
  • extensions/wikidata-lexemes/output/ab.xml
  • extensions/wikidata-lexemes/output/ae.xml
  • extensions/wikidata-lexemes/output/af.xml
  • extensions/wikidata-lexemes/output/am.xml
  • extensions/wikidata-lexemes/output/an.xml
  • extensions/wikidata-lexemes/output/ar.xml
  • extensions/wikidata-lexemes/output/av.xml
  • extensions/wikidata-lexemes/output/az.xml
  • extensions/wikidata-lexemes/output/ba.xml
  • extensions/wikidata-lexemes/output/be.xml
  • extensions/wikidata-lexemes/output/bg.xml
  • extensions/wikidata-lexemes/output/bi.xml
  • extensions/wikidata-lexemes/output/bn.xml
  • extensions/wikidata-lexemes/output/bo.xml
  • extensions/wikidata-lexemes/output/br.xml
  • extensions/wikidata-lexemes/output/ca.xml
  • extensions/wikidata-lexemes/output/ce.xml
  • extensions/wikidata-lexemes/output/cs.xml
  • extensions/wikidata-lexemes/output/cv.xml
  • extensions/wikidata-lexemes/output/cy.xml
  • extensions/wikidata-lexemes/output/da.xml
  • extensions/wikidata-lexemes/output/de.xml
  • extensions/wikidata-lexemes/output/dz.xml
  • extensions/wikidata-lexemes/output/el.xml
  • extensions/wikidata-lexemes/output/en.xml
  • extensions/wikidata-lexemes/output/eo.xml
  • extensions/wikidata-lexemes/output/es.xml
  • extensions/wikidata-lexemes/output/et.xml
  • extensions/wikidata-lexemes/output/eu.xml
  • extensions/wikidata-lexemes/output/fa.xml
  • extensions/wikidata-lexemes/output/ff.xml
  • extensions/wikidata-lexemes/output/fi.xml
  • extensions/wikidata-lexemes/output/fj.xml
  • extensions/wikidata-lexemes/output/fo.xml
  • extensions/wikidata-lexemes/output/fr.xml
  • extensions/wikidata-lexemes/output/fy.xml
  • extensions/wikidata-lexemes/output/ga.xml
  • extensions/wikidata-lexemes/output/gd.xml
  • extensions/wikidata-lexemes/output/gu.xml
  • extensions/wikidata-lexemes/output/gv.xml
  • extensions/wikidata-lexemes/output/ha.xml
  • extensions/wikidata-lexemes/output/he.xml
  • extensions/wikidata-lexemes/output/hr.xml
  • extensions/wikidata-lexemes/output/ht.xml
  • extensions/wikidata-lexemes/output/hu.xml
  • extensions/wikidata-lexemes/output/hy.xml
  • extensions/wikidata-lexemes/output/ia.xml
  • extensions/wikidata-lexemes/output/id.xml
  • extensions/wikidata-lexemes/output/ie.xml
  • extensions/wikidata-lexemes/output/ig.xml
  • extensions/wikidata-lexemes/output/ii.xml
  • extensions/wikidata-lexemes/output/ik.xml
  • extensions/wikidata-lexemes/output/io.xml
  • extensions/wikidata-lexemes/output/is.xml
  • extensions/wikidata-lexemes/output/it.xml
  • extensions/wikidata-lexemes/output/iu.xml
  • extensions/wikidata-lexemes/output/ja.xml
  • extensions/wikidata-lexemes/output/jv.xml
  • extensions/wikidata-lexemes/output/ka.xml
  • extensions/wikidata-lexemes/output/kl.xml
  • extensions/wikidata-lexemes/output/km.xml
  • extensions/wikidata-lexemes/output/ko.xml
  • extensions/wikidata-lexemes/output/ks.xml
  • extensions/wikidata-lexemes/output/kv.xml
  • extensions/wikidata-lexemes/output/kw.xml
  • extensions/wikidata-lexemes/output/ky.xml
  • extensions/wikidata-lexemes/output/la.xml
  • extensions/wikidata-lexemes/output/lb.xml
  • extensions/wikidata-lexemes/output/ln.xml
  • extensions/wikidata-lexemes/output/lo.xml
  • extensions/wikidata-lexemes/output/lt.xml
  • extensions/wikidata-lexemes/output/lv.xml
  • extensions/wikidata-lexemes/output/mh.xml
  • extensions/wikidata-lexemes/output/mi.xml
  • extensions/wikidata-lexemes/output/mk.xml
  • extensions/wikidata-lexemes/output/ml.xml
  • extensions/wikidata-lexemes/output/mn.xml
  • extensions/wikidata-lexemes/output/ms.xml
  • extensions/wikidata-lexemes/output/mt.xml
  • extensions/wikidata-lexemes/output/my.xml
  • extensions/wikidata-lexemes/output/nb.xml
  • extensions/wikidata-lexemes/output/nl.xml
  • extensions/wikidata-lexemes/output/nn.xml
  • extensions/wikidata-lexemes/output/nv.xml
  • extensions/wikidata-lexemes/output/oc.xml
  • extensions/wikidata-lexemes/output/oj.xml
  • extensions/wikidata-lexemes/output/pa.xml
  • extensions/wikidata-lexemes/output/pl.xml
  • extensions/wikidata-lexemes/output/ps.xml
  • extensions/wikidata-lexemes/output/pt.xml
  • extensions/wikidata-lexemes/output/qu.xml
  • extensions/wikidata-lexemes/output/rn.xml
  • extensions/wikidata-lexemes/output/ro.xml
  • extensions/wikidata-lexemes/output/ru.xml
  • extensions/wikidata-lexemes/output/sa.xml
  • extensions/wikidata-lexemes/output/sd.xml
  • extensions/wikidata-lexemes/output/se.xml
  • extensions/wikidata-lexemes/output/si.xml
  • extensions/wikidata-lexemes/output/sk.xml
  • extensions/wikidata-lexemes/output/sl.xml
  • extensions/wikidata-lexemes/output/so.xml
  • extensions/wikidata-lexemes/output/sq.xml
  • extensions/wikidata-lexemes/output/sr.xml
  • extensions/wikidata-lexemes/output/ss.xml
  • extensions/wikidata-lexemes/output/st.xml
  • extensions/wikidata-lexemes/output/su.xml
  • extensions/wikidata-lexemes/output/sv.xml
  • extensions/wikidata-lexemes/output/sw.xml
  • extensions/wikidata-lexemes/output/ta.xml
  • extensions/wikidata-lexemes/output/te.xml
  • extensions/wikidata-lexemes/output/th.xml
  • extensions/wikidata-lexemes/output/tk.xml
  • extensions/wikidata-lexemes/output/tl.xml
  • extensions/wikidata-lexemes/output/tn.xml
  • extensions/wikidata-lexemes/output/to.xml
  • extensions/wikidata-lexemes/output/tr.xml
  • extensions/wikidata-lexemes/output/ts.xml
  • extensions/wikidata-lexemes/output/tw.xml
  • extensions/wikidata-lexemes/output/ug.xml
  • extensions/wikidata-lexemes/output/uk.xml
  • extensions/wikidata-lexemes/output/uz.xml
  • extensions/wikidata-lexemes/output/ve.xml
  • extensions/wikidata-lexemes/output/vi.xml
  • extensions/wikidata-lexemes/output/wa.xml
  • extensions/wikidata-lexemes/output/wo.xml
  • extensions/wikidata-lexemes/output/xh.xml
  • extensions/wikidata-lexemes/output/yi.xml
  • extensions/wikidata-lexemes/output/yo.xml
  • extensions/wikidata-lexemes/output/za.xml
  • extensions/wikidata-lexemes/output/zu.xml
  • wn/constants.py

📝 Walkthrough

Walkthrough

This PR implements a complete workflow for merging Wikidata lexicon extensions into base lexicons. It introduces a POS code mapping infrastructure, adds a merge utility that rewrites database rows, integrates with Docker, and regenerates all Wikidata language outputs with standardized part-of-speech codes.

Changes

Lexicon Extension Merging and POS Standardization

Layer / File(s) Summary
POS Code Mapping and XML Generation
extensions/wikidata-lexemes/_pos_map.py, extensions/wikidata-lexemes/create_extensions.py, docs/api/wn.constants.rst
Introduces POS_MAP constant mapping verbose Wikidata labels to single-letter WN-LMF codes (e.g., pronoun→h, numeral→m, determiner→d), documents the new codes, and updates XML generation to emit standardized partOfSpeech attributes on both Lemma and Synset elements.
Extension Merge Utility
extensions/wikidata-lexemes/merge_extension.py
Adds a CLI utility that detects lexicon extensions via the extends element, loads them into the database, then rewrites all owned table rows from the extension lexicon_rowid to the base lexicon_rowid, removes extension metadata rows, and uses temporary indexes for performance during bulk updates.
System Integration and Documentation
Dockerfile, extensions/wikidata-lexemes/README.md
Integrates merge_extension.py into Docker build workflow to merge each language extension into the base lexicon, documents the output format and loading process, and explains the extension merge workflow for end users.
Regenerated Wikidata Lexeme Outputs
extensions/wikidata-lexemes/output/*.xml
Updates ~80 language-specific XML output files to use standardized partOfSpeech codes on both lemmas and synsets, replacing verbose descriptive labels with single-letter codes aligned with the new POS_MAP normalization.

Sequence Diagram

sequenceDiagram
  participant Gen as create_extensions.py
  participant Merge as merge_extension.py
  participant DB as WordNet Database
  participant XML as lexicon.xml

  Gen->>XML: Generate extension XML<br/>with standardized POS codes
  Merge->>XML: Read extension XML
  Merge->>Merge: Detect extends element
  Merge->>DB: Load extension as new lexicon<br/>with wn.add()
  Merge->>DB: Find base & extension<br/>lexicon rowids
  Merge->>DB: Rewrite content rows<br/>lexicon_rowid: extension→base
  Merge->>DB: Delete extension<br/>metadata rows
  Merge->>DB: Complete: extension merged<br/>into base lexicon
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


🐰 A lexicon grows stronger,
Extensions merge as one voice,
Standards unite all words. 🌍✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/merge-wikidata-extensions

@AmitMY AmitMY merged commit 4d32f41 into main May 20, 2026
11 of 13 checks passed
@AmitMY AmitMY deleted the feat/merge-wikidata-extensions branch May 20, 2026 06:30

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d856879. Configure here.

cur.execute(
f"UPDATE {table} SET lexicon_rowid = ? WHERE lexicon_rowid = ?",
(base_rowid, ext_rowid),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-merge duplicates base entries

High Severity

After a successful merge the extension lexicon row is deleted, so a later merge_extension run calls wn.add again and re-inserts the same entry ids under a new extension lexicon. Updating lexicon_rowid to the base then collides with rows already merged, violating entries uniqueness on (id, lexicon_rowid).

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d856879. Configure here.

@@ -2759,1687 +2759,1687 @@
<Definition>(دبلوم، دبلومة)</Definition>
</Sense>
</LexicalEntry>
<Synset id="wikidata-ar-L5004-S1" ili="wikidata-ar-L5004-S1">
<Synset id="wikidata-ar-L5004-S1" ili="wikidata-ar-L5004-S1" partOfSpeech="root">

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unmapped POS stays verbose

Medium Severity

Regenerated extension XML sets partOfSpeech on synsets (and many lemmas) to full Wikidata labels such as root instead of the OTHER code x that create_extensions.py applies via POS_MAP.get(pos_name, OTHER), so stored POS values are outside PARTS_OF_SPEECH.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d856879. Configure here.

AmitMY added a commit that referenced this pull request May 20, 2026
The previous regeneration of extension XMLs left ~150 verbose Wikidata
POS labels (`abbreviation`, `abstract noun`, `auxiliary verb`, ...) in
place. That contradicts `create_extensions.py`'s contract of
`POS_MAP.get(pos_name, OTHER)` and caused 8,193 entries/synsets across
66 languages to be stored with POS values outside
`wn.constants.PARTS_OF_SPEECH`.

After this change every `partOfSpeech` attribute is one of
`{h, d, n, v, a, s, r, m, p, c, q, i, y, t, x, u}`. Anything Wikidata
didn't classify into our function-word groupings collapses to `x`,
matching what a fresh `create_extensions.py` run would produce.

Reported by Cursor Bugbot on PR #1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extending an existing lexicon

1 participant