Skip to content

bltlab/open-ner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

OpenNER

This is the data repository for OpenNER, a project of the Broadening Linguistic Technologies Lab at Brandeis University.

OpenNER is described in the paper OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages.

OpenNER contains data for the following languages: Akan/Twi, Algerian Arabic, Amharic, Arabic, Bambara, Basque, Bavarian German, Catalan, Chichewa, chiShona, Croatian, Danish, Dutch, English, Éwé, Finnish, Fon, Galician, German, Ghomálá', Greek, Hausa, Hebrew, Hindi, Igbo, isiXhosa, Italian, Japanese, Kazakh, Kinyarwanda, Kiswahili, Luganda, Luo, Mandarin Chinese, Marathi, Mossi, Naija, Nepali, Norwegian, Persian Farsi, Portuguese, Romanian, Setswana, Slovak, Slovenian, Spanish, Swedish, Thai, Wolof, Yoruba, and Zulu.

Data releases

OpenNER is distributed in two forms:

  1. Standardized: all datasets using the originally annotated types, but with names standardized (e.g. in all datasets that annotate persons, the type is PER).
  2. Core types: all datasets but only containing location (LOC), organization (ORG), and person (PER) annotations. The original annotated types have been either mapped to these three types or removed.

The first release of OpenNER is available via:

License

The license for the OpenNER data collection (the specific compilation of the data) is CC-BY 4.0. However, individual datasets have their own licenses, which users must abide by. Some datasets restrict commercial usage.

Citation

The citation for OpenNER is as follows:

@inproceedings{palen-michel-etal-2025-openner,
    title = "{O}pen{NER} 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages",
    author = {Palen-Michel, Chester  and
      Pickering, Maxwell  and
      Kruse, Maya  and
      S{\"a}lev{\"a}, Jonne  and
      Lignos, Constantine},
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1708/",
    doi = "10.18653/v1/2025.emnlp-main.1708",
    pages = "33649--33674",
    ISBN = "979-8-89176-332-6",
    abstract = "We present OpenNER 1.0, a standardized collection of openly-available named entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52 languages, human-annotated in varying named entity ontologies. We correct annotation format issues, standardize the original datasets into a uniform representation with consistent entity type names across corpora, and provide the collection in a structure that enables research in multilingual and multi-ontology NER. We provide baseline results using three pretrained multilingual language models and two large language models to compare the performance of recent models and facilitate future research in NER. We find that no single model is best in all languages and that significant work remains to obtain high performance from LLMs on the NER task. OpenNER is released at https://github.com/bltlab/open-ner."
}

About

OpenNER

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors