Skip to content

Latest commit

 

History

History

readme.md

Bank Churn Prediction – End-to-End ML Project with Deployment

Deployed on Streamlit: View Live App

Python Scikit-Learn XGBoost


Table of Contents

  1. Project Overview
  2. Dataset
  3. Data Exploration & Cleaning
  4. Data Preprocessing
  5. Model Selection
  6. Hyperparameter Tuning
  7. Final Model Performance
  8. Predictions
  9. Key Takeaways
  10. Technologies Used

Project Overview

Predict whether a bank customer will churn (leave the bank) using historical customer data. The main goal is maximizing recall, since missing potential churners is costlier than giving false alarms.


Dataset

  • Source: Kaggle Playground Series S4E1
  • Files:
    • train.csv – 165,034 rows, 14 columns
    • test.csv – Test data for predictions
    • sample_submission.csv – Example submission format

Data Exploration & Cleaning

  • Dropped irrelevant columns: id, CustomerId, Surname
  • Categorical columns: Geography, Gender, HasCrCard, IsActiveMember
  • Numerical columns: CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary
  • Removed outliers: Age > 85
  • Converted categorical features:
    • Gender: Male = 1, Female = 0
    • HasCrCard & IsActiveMember → int

Data Preprocessing

  • One-Hot Encoding for categorical variables
  • StandardScaler for numerical features
  • Train and test datasets aligned for consistent features

Model Selection

  • Models evaluated:
    • Logistic Regression, Decision Tree, Random Forest, AdaBoost, Gradient Boost, XGBoost, KNN
  • Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
  • Top models: Gradient Boost and XGBoost
  • Recall prioritized → XGBoost selected

Hyperparameter Tuning

  • Gradient Boosting: n_estimators, learning_rate, max_depth, subsample
  • XGBoost: n_estimators, learning_rate, max_depth, subsample, colsample_bytree
  • Used GridSearchCV with StratifiedKFold (5 splits)
  • Optimal XGBoost parameters:
XGBClassifier(
    colsample_bytree=1,
    learning_rate=0.05,
    max_depth=5,
    n_estimators=200,
    subsample=0.7
)
id,Exited
165034,0.036570
165035,0.144758
165036,0.083989

Final Model Performance (Test Split)

Metric XGBoost
Accuracy 0.8663
Precision 0.739
Recall 0.568
F1-Score 0.642
ROC-AUC 0.8899

Key Takeaways

  1. Business priority of maximizing recall ensures that potential churners are not missed
  2. XGBoost outperforms other models for this classification task
  3. Preprocessing (OHE + StandardScaler) ensures model consistency
  4. Hyperparameter tuning improves model metrics (especially ROC-AUC and recall)
  5. Deployment on Streamlit allows interactive, real-time predictions

Technologies & Libraries Used

Programming Language: Python 3.10

Libraries:

  • pandas – data manipulation and analysis
  • numpy – numerical computations
  • scikit-learn – preprocessing, model selection, and evaluation
  • xgboost – XGBoost classifier for modeling
  • matplotlib & seaborn – data visualization
  • streamlit - Web app deployment