UnarXive-2024

You can access the dataset from HuggingFace and on Zenodo

This project presents an updated and extended version of the UnarXive dataset, a large-scale full-text scholarly corpus derived from arXiv.org. We process and structure over 2.28 million papers, preserving rich document content and enriching metadata. Our pipeline enhances section-level grouping while maintaining compatibility with existing formats.

Dataset Overview

The dataset consists of structured JSONL files, each representing a parsed scholarly document from arXiv. Each document includes:

Full-text grouped by sections
Metadata (title, authors, abstract, date, language, cited_by_count etc.)
Citation information (bib entries and reference entries)
Structural annotations like cite_spans and ref_spans
Licensing and category labels

Pipeline

Key Statistics

The total number of papers in our dataset is 2,338,911. Among these:

Physics: 1,146,066 papers (49.12%)
Mathematics: 584,727 papers (25.28%)
Computer Science: 608,118 papers (25,6%) Languages: Predominantly English, with several hundred documents in other languages

Usage

The dataset can be used for:

Retrieval-augmented generation (RAG)
Citation recommendation and analysis
Scientific question answering
Training and evaluation of domain-specific language models (e.g., SciBERT, SciNCL)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
data		data
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnarXive-2024

Dataset Overview

Pipeline

Key Statistics

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UnarXive-2024

Dataset Overview

Pipeline

Key Statistics

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages