Skip to content

gorop51-2/Student-Performance-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Student Performance: ML Pipeline & Clustering

Project Overview

This project presents an end-to-end machine learning pipeline using tabular data on student performance. It covers the entire Data Science lifecycle, including Exploratory Data Analysis (EDA), advanced data preprocessing using standard and custom scikit-learn pipelines, and the training of various regression and classification models. Additionally, student clustering was performed using K-Means, with dimensionality reduction and interactive visualization using PCA.

Key Objectives

  • Regression: Predicting the student's final grade (target variable G3).
  • Classification: Determining the likelihood of exam success (binary pass/fail classification).
  • Clustering: Segmenting students into behavioral/academic groups and visualizing the resulting clusters.

Tech Stack

  • Language: Python
  • Data Processing: Pandas, NumPy
  • Machine Learning: Scikit-learn (including Custom Transformers: RareMergeEncoder, BackwardEliminationSelector, ImportanceFeatureSelector), Statsmodels
  • Visualization: Matplotlib, Seaborn, Plotly

Repository Structure

  • /data — The original dataset (student-mat.csv).
  • /src — Source code containing the EDA, custom preprocessing classes, and ML pipelines (students.py).
  • /docs — Detailed analytical report (in Russian) outlining the workflow, feature engineering, and interpretation of the visualizations.

Key Features

  • Custom Scikit-Learn Transformers: Implemented custom classes for rare category merging and feature selection (Backward Elimination and Feature Importance) directly integrated into the pipeline.
  • Hyperparameter Tuning: Automated grid search with cross-validation (GridSearchCV) for optimal model performance.
  • Handling Imbalanced Data: Applied stratification and class weights to address target variable imbalance in the classification task.

About

End-to-end ML pipeline for predicting and clustering student performance using custom scikit-learn transformers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors