Deployed on Streamlit: View Live App
- Project Overview
- Dataset
- Data Exploration & Cleaning
- Data Preprocessing
- Model Selection
- Hyperparameter Tuning
- Final Model Performance
- Predictions
- Key Takeaways
- Technologies Used
Predict whether a bank customer will churn (leave the bank) using historical customer data. The main goal is maximizing recall, since missing potential churners is costlier than giving false alarms.
- Source: Kaggle Playground Series S4E1
- Files:
train.csv– 165,034 rows, 14 columnstest.csv– Test data for predictionssample_submission.csv– Example submission format
- Dropped irrelevant columns:
id,CustomerId,Surname - Categorical columns:
Geography,Gender,HasCrCard,IsActiveMember - Numerical columns:
CreditScore,Age,Tenure,Balance,NumOfProducts,EstimatedSalary - Removed outliers:
Age > 85 - Converted categorical features:
Gender: Male = 1, Female = 0HasCrCard&IsActiveMember→ int
- One-Hot Encoding for categorical variables
- StandardScaler for numerical features
- Train and test datasets aligned for consistent features
- Models evaluated:
- Logistic Regression, Decision Tree, Random Forest, AdaBoost, Gradient Boost, XGBoost, KNN
- Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
- Top models: Gradient Boost and XGBoost
- Recall prioritized → XGBoost selected
- Gradient Boosting:
n_estimators,learning_rate,max_depth,subsample - XGBoost:
n_estimators,learning_rate,max_depth,subsample,colsample_bytree - Used GridSearchCV with StratifiedKFold (5 splits)
- Optimal XGBoost parameters:
XGBClassifier(
colsample_bytree=1,
learning_rate=0.05,
max_depth=5,
n_estimators=200,
subsample=0.7
)id,Exited
165034,0.036570
165035,0.144758
165036,0.083989| Metric | XGBoost |
|---|---|
| Accuracy | 0.8663 |
| Precision | 0.739 |
| Recall | 0.568 |
| F1-Score | 0.642 |
| ROC-AUC | 0.8899 |
- Business priority of maximizing recall ensures that potential churners are not missed
- XGBoost outperforms other models for this classification task
- Preprocessing (OHE + StandardScaler) ensures model consistency
- Hyperparameter tuning improves model metrics (especially ROC-AUC and recall)
- Deployment on Streamlit allows interactive, real-time predictions
Programming Language: Python 3.10
Libraries:
- pandas – data manipulation and analysis
- numpy – numerical computations
- scikit-learn – preprocessing, model selection, and evaluation
- xgboost – XGBoost classifier for modeling
- matplotlib & seaborn – data visualization
- streamlit - Web app deployment