This document provides a reference for importing and using the core AthletiStat classes directly in Python.
All core classes can be imported and used directly.
from athletistat.core.scraper import Scraper
# Scrape performance data for the current season
scraper = Scraper(mode="seasons")
scraper.run(max_workers=10)
# Scrape a specific historical year
scraper = Scraper(mode="seasons")
scraper.run(year=2022, max_workers=12)
# Scrape all-time records
scraper = Scraper(mode="all-time")
scraper.run(max_workers=10)Key behaviors:
- Uses
ThreadPoolExecutorto scrape multiple events concurrently. - Automatically paginates through all available result pages per event.
- Persists job queues to disk; failed or interrupted jobs remain in the queue and are resumed on the next run.
- Historical years (not the current year) are cached in
completed_seasons.jsonand skipped if already fully scraped.
from athletistat.core.preprocessing import Preprocessor
# Preprocess seasons data
preprocessor = Preprocessor(mode="seasons")
preprocessor.run()
# Preprocess all-time data
preprocessor = Preprocessor(mode="all-time")
preprocessor.run()
# Preprocess both
preprocessor = Preprocessor(mode="both")
preprocessor.run()Key transformations applied:
- Discipline slug normalization (e.g.,
decathlon-u20→decathlon). - Performance mark parsing:
MM:SS.ss/H:MM:SS.ss→ total seconds as a float. - Event classification:
track_fieldcolumn set totrack,field, ormixed. - Country code resolution to full country names (
nat_full,venue_country). - Date parsing and
age_at_eventcalculation fromdobanddate. seasoncolumn extracted from the performance date year.
from athletistat.core.generator import DatasetGenerator
# Generate per-year season datasets
generator = DatasetGenerator(mode="seasons")
generator.run()
# Generate all-time dataset
generator = DatasetGenerator(mode="all-time")
generator.run()
# Generate season datasets and combine into a single multi-year file
generator = DatasetGenerator(mode="seasons")
generator.run(combine=True)from athletistat.core.generator import DatasetSplitter
# Split seasons dataset
splitter = DatasetSplitter(mode="seasons")
splitter.run()
# Split all-time dataset
splitter = DatasetSplitter(mode="all-time")
splitter.run()Split outputs produced:
split_global/- Full individual events and relay events as separate files.split_by_type/{gender}/- One CSV per event type (e.g.,sprints,jumps,hurdles).split_by_discipline/{gender}/- One CSV per normalized discipline (e.g.,100-metres,long-jump).{gender}/relays/- Relay events by discipline (excludesdobandage_at_event).