Formula 1 Race Podium Prediction

A comprehensive machine learning pipeline to predict F1 Race Winners and Podium finishes using historical telemetry, driver error rates, and qualifying data.

This project aggregates historical F1 data (from 1950 to present) to build predictive models. It features custom features as calculating "Driver Error Rates" and handling historic team name changes (e.g., Toro Rosso → AlphaTauri → RB)—and compares Random Forest vs. Logistic Regression performance.

Project Overview

The goal is to predict race outcomes based on pre-race factors like Qualifying position (q3), historical performance (standings), and driver reliability (error_rates).

Features

Custom Feature:
- Driver/Constructor Error Rates: Calculates the probability of a driver crashing or having a mechanical failure based on historical statusId data.
- Home Advantage: Flags if a driver is racing in their home country.
- Legacy Team Mapping: Merges historical team IDs (e.g., Force India $\rightarrow$ Aston Martin) to maintain data continuity.
Dimensionality Reduction: Uses PCA (Principal Component Analysis) to analyze feature variance.
Model Comparison: Benchmarks Random Forest (with Hyperparameter Tuning) against Logistic Regression (CV).

The project is divided into Data Preparation, Feature Engineering, and Analysis.

1. Data Preparation & Cleaning

create_training_dataset.py:
- Merges raw CSVs (results, drivers, constructors, qualifying, etc.).
- Standardizes Qualifying times (converts 1:20.142 to seconds).
- Creates the master training file: current_driver_dataset.csv.
fix_dataset.py:
- Cleans specific datasets (maria.tsv) and merges them with the latest race results.
- Generates the target variables: podium.csv (Top 3) and winner.csv (P1).

2. Feature Engineering

calc_driver_error.py:
- Analyzes status.csv to identify DNF causes (Accidents, Collisions, etc.).
- Computes and plots error rates for current drivers (e.g., VER, HAM, LEC).
- Output: driver_error_rates.csv, constructor_error_rates.csv.

3. Analysis & Modeling

pca_new.py:
- Standardizes data and runs PCA to visualize explained variance.
- Generates Heatmaps and Feature Importance bar charts to understand which variables drive performance.
Analysis.py:
- The Core Logic: Trains models to predict Winners and Podiums.
- Random Forest: Uses RandomizedSearchCV for optimization and Permutation Importance for feature selection.
- Logistic Regression: Uses LogisticRegressionCV and calculates Odds Ratios.

Prerequisites

You will need Python 3 and the standard data science stack:

pip install pandas numpy matplotlib seaborn scikit-learn scipy

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Analysis.py		Analysis.py
F1.pdf		F1.pdf
LICENSE		LICENSE
README.md		README.md
calc_driver_error.py		calc_driver_error.py
create_training_dataset.py		create_training_dataset.py
fix_dataset.py		fix_dataset.py
pca_new.py		pca_new.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula 1 Race Podium Prediction

Project Overview

Features

1. Data Preparation & Cleaning

2. Feature Engineering

3. Analysis & Modeling

Prerequisites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Formula 1 Race Podium Prediction

Project Overview

Features

1. Data Preparation & Cleaning

2. Feature Engineering

3. Analysis & Modeling

Prerequisites

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages