A comprehensive machine learning pipeline to predict F1 Race Winners and Podium finishes using historical telemetry, driver error rates, and qualifying data.
This project aggregates historical F1 data (from 1950 to present) to build predictive models. It features custom features as calculating "Driver Error Rates" and handling historic team name changes (e.g., Toro Rosso → AlphaTauri → RB)—and compares Random Forest vs. Logistic Regression performance.
The goal is to predict race outcomes based on pre-race factors like Qualifying position (q3), historical performance (standings), and driver reliability (error_rates).
-
Custom Feature:
-
Driver/Constructor Error Rates: Calculates the probability of a driver crashing or having a mechanical failure based on historical
statusIddata. - Home Advantage: Flags if a driver is racing in their home country.
-
Legacy Team Mapping: Merges historical team IDs (e.g., Force India
$\rightarrow$ Aston Martin) to maintain data continuity.
-
Driver/Constructor Error Rates: Calculates the probability of a driver crashing or having a mechanical failure based on historical
- Dimensionality Reduction: Uses PCA (Principal Component Analysis) to analyze feature variance.
- Model Comparison: Benchmarks Random Forest (with Hyperparameter Tuning) against Logistic Regression (CV).
The project is divided into Data Preparation, Feature Engineering, and Analysis.
create_training_dataset.py:- Merges raw CSVs (
results,drivers,constructors,qualifying, etc.). - Standardizes Qualifying times (converts
1:20.142to seconds). - Creates the master training file:
current_driver_dataset.csv.
- Merges raw CSVs (
fix_dataset.py:- Cleans specific datasets (
maria.tsv) and merges them with the latest race results. - Generates the target variables:
podium.csv(Top 3) andwinner.csv(P1).
- Cleans specific datasets (
calc_driver_error.py:- Analyzes
status.csvto identify DNF causes (Accidents, Collisions, etc.). - Computes and plots error rates for current drivers (e.g., VER, HAM, LEC).
- Output:
driver_error_rates.csv,constructor_error_rates.csv.
- Analyzes
pca_new.py:- Standardizes data and runs PCA to visualize explained variance.
- Generates Heatmaps and Feature Importance bar charts to understand which variables drive performance.
Analysis.py:- The Core Logic: Trains models to predict Winners and Podiums.
- Random Forest: Uses
RandomizedSearchCVfor optimization and Permutation Importance for feature selection. - Logistic Regression: Uses
LogisticRegressionCVand calculates Odds Ratios.
You will need Python 3 and the standard data science stack:
pip install pandas numpy matplotlib seaborn scikit-learn scipy