Binary classification model to predict customer creditworthiness, built for a fictional bank (Pro National Bank) to support credit card application decisions.
Given anonymized customer data, the model predicts whether a customer has a high credit rating (TARGET=1) or not (TARGET=0). The task presents a significant class imbalance: only 8.78% of observations belong to the positive class.
The project evaluates 8 base classifiers across 6 feature selection strategies, resulting in 291 model configurations compared systematically.
~338,000 customer records with 19 features including demographics, employment, income, and contact information.
https://proai-datasets.s3.eu-west-3.amazonaws.com/credit_scoring.csv
Key preprocessing decisions:
DAYS_BIRTHconverted to positive values (DAYS_BIRTH_POSITIVE)FLAG_MOBILdropped - 100% of values identical, zero predictive valueDAYS_EMPLOYEDanomalies handled - values exceeding human lifespan (~1,000 years) identified as Pensioners and treated separatelyOCCUPATION_TYPENaN values (13.44% of data) replaced with"unknown"- neither deletion nor imputation was justified after systematic analysis across all other features
Preprocessing (ColumnTransformer):
- Numerical features → StandardScaler
- Nominal categorical → OneHotEncoder
- Ordinal categorical (
NAME_EDUCATION_TYPE) → OrdinalEncoder with explicit category ordering
Feature selection strategies evaluated:
- Base (all features)
- VarianceThreshold
- Correlation Filtering
- VIF (Variance Inflation Factor)
- SelectKBest (k = 5, 10, 15, 20, 25, 30, 35, 40)
- RFE - Recursive Feature Elimination (k = 5–40)
- Embedded Methods - hyperparameter tuning with RandomizedSearchCV (regularization, tree depth, class weighting)
291 configurations evaluated (8 classifiers × 6 feature selection strategies + hyperparameter tuning). Final ranking filtered by Accuracy > 0.90 and sorted by Recall to minimize false negatives - failing to identify a creditworthy customer is the costliest error in this context.
| Model | Accuracy | F1 | Precision | Recall | ROC-AUC | FN |
|---|---|---|---|---|---|---|
| Random Forest emb | 0.96 | 0.823 | 0.699 | 1.000 | 0.979 | 0 |
| Gradient Boosting rfe k:35 | 0.96 | 0.823 | 0.699 | 0.999 | 0.979 | 5 |
| Gradient Boosting CorrelationFiltering | 0.96 | 0.823 | 0.699 | 0.999 | 0.979 | 5 |
| Gradient Boosting rfe k:30 | 0.96 | 0.823 | 0.699 | 0.999 | 0.979 | 5 |
| Gradient Boosting rfe k:20 | 0.96 | 0.823 | 0.699 | 0.999 | 0.979 | 5 |
Top 5 of 291 configurations. Full results in sorted_results_df.csv.
Best model: Random Forest (Embedded) - the only configuration achieving perfect Recall (1.000, 0 false negatives) with F1 = 0.823. Total training time: ~226 minutes.
- Tree-based ensemble methods (GBM, Random Forest, XGBoost) consistently outperform linear models on this dataset
- Feature selection strategies did not significantly improve performance over the base configuration - the full feature set was already informative
- Logistic Regression achieves high Recall only when Accuracy is sacrificed (0.828), making it unsuitable for production despite its interpretability
Python 3.9 · pandas 2.2.2 · NumPy 2.0.0 · scikit-learn 1.5.1 · XGBoost 2.1.0 · SciPy 1.13.1 · statsmodels 0.14.2 · Matplotlib 3.9.1 · Seaborn 0.13.2 · tqdm 4.66.4
The notebook runs in a local Anaconda environment.
conda env create -f environment.yml
conda activate creditworthiness_project
jupyter notebookDownload the dataset manually and place it in the project folder:
https://proai-datasets.s3.eu-west-3.amazonaws.com/credit_scoring.csv
Open Prediction_of_creditworthiness_for_issuing_a_credit_card.ipynb and run cells sequentially.
creditworthiness/
├── Prediction_of_creditworthiness_for_issuing_a_credit_card.ipynb # Main notebook
├── environment.yml # Conda environment
├── results_df.csv # All 401 model results
└── sorted_results_df.csv # Final ranked results