Student Performance: ML Pipeline & Clustering

Project Overview

This project presents an end-to-end machine learning pipeline using tabular data on student performance. It covers the entire Data Science lifecycle, including Exploratory Data Analysis (EDA), advanced data preprocessing using standard and custom scikit-learn pipelines, and the training of various regression and classification models. Additionally, student clustering was performed using K-Means, with dimensionality reduction and interactive visualization using PCA.

Key Objectives

Regression: Predicting the student's final grade (target variable G3).
Classification: Determining the likelihood of exam success (binary pass/fail classification).
Clustering: Segmenting students into behavioral/academic groups and visualizing the resulting clusters.

Tech Stack

Language: Python
Data Processing: Pandas, NumPy
Machine Learning: Scikit-learn (including Custom Transformers: RareMergeEncoder, BackwardEliminationSelector, ImportanceFeatureSelector), Statsmodels
Visualization: Matplotlib, Seaborn, Plotly

Repository Structure

/data — The original dataset (student-mat.csv).
/src — Source code containing the EDA, custom preprocessing classes, and ML pipelines (students.py).
/docs — Detailed analytical report (in Russian) outlining the workflow, feature engineering, and interpretation of the visualizations.

Key Features

Custom Scikit-Learn Transformers: Implemented custom classes for rare category merging and feature selection (Backward Elimination and Feature Importance) directly integrated into the pipeline.
Hyperparameter Tuning: Automated grid search with cross-validation (GridSearchCV) for optimal model performance.
Handling Imbalanced Data: Applied stratification and class weights to address target variable imbalance in the classification task.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Student Performance: ML Pipeline & Clustering

Project Overview

Key Objectives

Tech Stack

Repository Structure

Key Features

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Student Performance: ML Pipeline & Clustering

Project Overview

Key Objectives

Tech Stack

Repository Structure

Key Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages