This project presents an end-to-end machine learning pipeline using tabular data on student performance. It covers the entire Data Science lifecycle, including Exploratory Data Analysis (EDA), advanced data preprocessing using standard and custom scikit-learn pipelines, and the training of various regression and classification models. Additionally, student clustering was performed using K-Means, with dimensionality reduction and interactive visualization using PCA.
- Regression: Predicting the student's final grade (target variable
G3). - Classification: Determining the likelihood of exam success (binary pass/fail classification).
- Clustering: Segmenting students into behavioral/academic groups and visualizing the resulting clusters.
- Language: Python
- Data Processing: Pandas, NumPy
- Machine Learning: Scikit-learn (including Custom Transformers:
RareMergeEncoder,BackwardEliminationSelector,ImportanceFeatureSelector), Statsmodels - Visualization: Matplotlib, Seaborn, Plotly
/data— The original dataset (student-mat.csv)./src— Source code containing the EDA, custom preprocessing classes, and ML pipelines (students.py)./docs— Detailed analytical report (in Russian) outlining the workflow, feature engineering, and interpretation of the visualizations.
- Custom Scikit-Learn Transformers: Implemented custom classes for rare category merging and feature selection (Backward Elimination and Feature Importance) directly integrated into the pipeline.
- Hyperparameter Tuning: Automated grid search with cross-validation (
GridSearchCV) for optimal model performance. - Handling Imbalanced Data: Applied stratification and class weights to address target variable imbalance in the classification task.