OpenNER

This is the data repository for OpenNER, a project of the Broadening Linguistic Technologies Lab at Brandeis University.

OpenNER is described in the paper OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages.

OpenNER contains data for the following languages: Akan/Twi, Algerian Arabic, Amharic, Arabic, Bambara, Basque, Bavarian German, Catalan, Chichewa, chiShona, Croatian, Danish, Dutch, English, Éwé, Finnish, Fon, Galician, German, Ghomálá', Greek, Hausa, Hebrew, Hindi, Igbo, isiXhosa, Italian, Japanese, Kazakh, Kinyarwanda, Kiswahili, Luganda, Luo, Mandarin Chinese, Marathi, Mossi, Naija, Nepali, Norwegian, Persian Farsi, Portuguese, Romanian, Setswana, Slovak, Slovenian, Spanish, Swedish, Thai, Wolof, Yoruba, and Zulu.

Data releases

OpenNER is distributed in two forms:

Standardized: all datasets using the originally annotated types, but with names standardized (e.g. in all datasets that annotate persons, the type is PER).
Core types: all datasets but only containing location (LOC), organization (ORG), and person (PER) annotations. The original annotated types have been either mapped to these three types or removed.

The first release of OpenNER is available via:

GitHub releases: BIO formatted files
Hugging Face releases: standardized and core types

License

The license for the OpenNER data collection (the specific compilation of the data) is CC-BY 4.0. However, individual datasets have their own licenses, which users must abide by. Some datasets restrict commercial usage.

Citation

The citation for OpenNER is as follows:

@inproceedings{palen-michel-etal-2025-openner,
    title = "{O}pen{NER} 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages",
    author = {Palen-Michel, Chester  and
      Pickering, Maxwell  and
      Kruse, Maya  and
      S{\"a}lev{\"a}, Jonne  and
      Lignos, Constantine},
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1708/",
    doi = "10.18653/v1/2025.emnlp-main.1708",
    pages = "33649--33674",
    ISBN = "979-8-89176-332-6",
    abstract = "We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner."
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenNER

Data releases

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OpenNER

Data releases

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages