You can access the dataset from HuggingFace and on Zenodo
This project presents an updated and extended version of the UnarXive dataset, a large-scale full-text scholarly corpus derived from arXiv.org. We process and structure over 2.28 million papers, preserving rich document content and enriching metadata. Our pipeline enhances section-level grouping while maintaining compatibility with existing formats.
The dataset consists of structured JSONL files, each representing a parsed scholarly document from arXiv. Each document includes:
- Full-text grouped by sections
- Metadata (title, authors, abstract, date, language, cited_by_count etc.)
- Citation information (bib entries and reference entries)
- Structural annotations like
cite_spansandref_spans - Licensing and category labels
The total number of papers in our dataset is 2,338,911. Among these:
- Physics: 1,146,066 papers (49.12%)
- Mathematics: 584,727 papers (25.28%)
- Computer Science: 608,118 papers (25,6%) Languages: Predominantly English, with several hundred documents in other languages
The dataset can be used for:
- Retrieval-augmented generation (RAG)
- Citation recommendation and analysis
- Scientific question answering
- Training and evaluation of domain-specific language models (e.g., SciBERT, SciNCL)
