This is the data repository for OpenNER, a project of the Broadening Linguistic Technologies Lab at Brandeis University.
OpenNER is described in the paper OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages.
OpenNER contains data for the following languages: Akan/Twi, Algerian Arabic, Amharic, Arabic, Bambara, Basque, Bavarian German, Catalan, Chichewa, chiShona, Croatian, Danish, Dutch, English, Éwé, Finnish, Fon, Galician, German, Ghomálá', Greek, Hausa, Hebrew, Hindi, Igbo, isiXhosa, Italian, Japanese, Kazakh, Kinyarwanda, Kiswahili, Luganda, Luo, Mandarin Chinese, Marathi, Mossi, Naija, Nepali, Norwegian, Persian Farsi, Portuguese, Romanian, Setswana, Slovak, Slovenian, Spanish, Swedish, Thai, Wolof, Yoruba, and Zulu.
OpenNER is distributed in two forms:
- Standardized: all datasets using the originally annotated types, but with names standardized (e.g. in all datasets that annotate persons, the type is PER).
- Core types: all datasets but only containing location (LOC), organization (ORG), and person (PER) annotations. The original annotated types have been either mapped to these three types or removed.
The first release of OpenNER is available via:
- GitHub releases: BIO formatted files
- Hugging Face releases: standardized and core types
The license for the OpenNER data collection (the specific compilation of the data) is CC-BY 4.0. However, individual datasets have their own licenses, which users must abide by. Some datasets restrict commercial usage.
The citation for OpenNER is as follows:
@inproceedings{palen-michel-etal-2025-openner,
title = "{O}pen{NER} 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages",
author = {Palen-Michel, Chester and
Pickering, Maxwell and
Kruse, Maya and
S{\"a}lev{\"a}, Jonne and
Lignos, Constantine},
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1708/",
doi = "10.18653/v1/2025.emnlp-main.1708",
pages = "33649--33674",
ISBN = "979-8-89176-332-6",
abstract = "We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner."
}