Creditworthiness Prediction for Credit Card Issuance

Binary classification model to predict customer creditworthiness, built for a fictional bank (Pro National Bank) to support credit card application decisions.

Overview

Given anonymized customer data, the model predicts whether a customer has a high credit rating (TARGET=1) or not (TARGET=0). The task presents a significant class imbalance: only 8.78% of observations belong to the positive class.

The project evaluates 8 base classifiers across 6 feature selection strategies, resulting in 291 model configurations compared systematically.

Dataset

~338,000 customer records with 19 features including demographics, employment, income, and contact information.

https://proai-datasets.s3.eu-west-3.amazonaws.com/credit_scoring.csv

Key preprocessing decisions:

DAYS_BIRTH converted to positive values (DAYS_BIRTH_POSITIVE)
FLAG_MOBIL dropped - 100% of values identical, zero predictive value
DAYS_EMPLOYED anomalies handled - values exceeding human lifespan (~1,000 years) identified as Pensioners and treated separately
OCCUPATION_TYPE NaN values (13.44% of data) replaced with "unknown" - neither deletion nor imputation was justified after systematic analysis across all other features

Pipeline

Preprocessing (ColumnTransformer):

Numerical features → StandardScaler
Nominal categorical → OneHotEncoder
Ordinal categorical (NAME_EDUCATION_TYPE) → OrdinalEncoder with explicit category ordering

Feature selection strategies evaluated:

Base (all features)
VarianceThreshold
Correlation Filtering
VIF (Variance Inflation Factor)
SelectKBest (k = 5, 10, 15, 20, 25, 30, 35, 40)
RFE - Recursive Feature Elimination (k = 5–40)
Embedded Methods - hyperparameter tuning with RandomizedSearchCV (regularization, tree depth, class weighting)

Models

291 configurations evaluated (8 classifiers × 6 feature selection strategies + hyperparameter tuning). Final ranking filtered by Accuracy > 0.90 and sorted by Recall to minimize false negatives - failing to identify a creditworthy customer is the costliest error in this context.

Model	Accuracy	F1	Precision	Recall	ROC-AUC	FN
Random Forest emb	0.96	0.823	0.699	1.000	0.979	0
Gradient Boosting rfe k:35	0.96	0.823	0.699	0.999	0.979	5
Gradient Boosting CorrelationFiltering	0.96	0.823	0.699	0.999	0.979	5
Gradient Boosting rfe k:30	0.96	0.823	0.699	0.999	0.979	5
Gradient Boosting rfe k:20	0.96	0.823	0.699	0.999	0.979	5

Top 5 of 291 configurations. Full results in sorted_results_df.csv.

Best model: Random Forest (Embedded) - the only configuration achieving perfect Recall (1.000, 0 false negatives) with F1 = 0.823. Total training time: ~226 minutes.

Key Findings

Tree-based ensemble methods (GBM, Random Forest, XGBoost) consistently outperform linear models on this dataset
Feature selection strategies did not significantly improve performance over the base configuration - the full feature set was already informative
Logistic Regression achieves high Recall only when Accuracy is sacrificed (0.828), making it unsuitable for production despite its interpretability

Tech Stack

Python 3.9 · pandas 2.2.2 · NumPy 2.0.0 · scikit-learn 1.5.1 · XGBoost 2.1.0 · SciPy 1.13.1 · statsmodels 0.14.2 · Matplotlib 3.9.1 · Seaborn 0.13.2 · tqdm 4.66.4

How to Run

The notebook runs in a local Anaconda environment.

conda env create -f environment.yml
conda activate creditworthiness_project
jupyter notebook

Download the dataset manually and place it in the project folder:

https://proai-datasets.s3.eu-west-3.amazonaws.com/credit_scoring.csv

Open Prediction_of_creditworthiness_for_issuing_a_credit_card.ipynb and run cells sequentially.

Project Structure

creditworthiness/
├── Prediction_of_creditworthiness_for_issuing_a_credit_card.ipynb   # Main notebook
├── environment.yml                                                    # Conda environment
├── results_df.csv                                                     # All 401 model results
└── sorted_results_df.csv                                              # Final ranked results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creditworthiness Prediction for Credit Card Issuance

Overview

Dataset

Pipeline

Models

Key Findings

Tech Stack

How to Run

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Prediction_of_creditworthiness_for_issuing_a_credit_card.ipynb		Prediction_of_creditworthiness_for_issuing_a_credit_card.ipynb
README.md		README.md
environment.yml		environment.yml
results_df.csv		results_df.csv
sorted_results_df.csv		sorted_results_df.csv

Folders and files

Latest commit

History

Repository files navigation

Creditworthiness Prediction for Credit Card Issuance

Overview

Dataset

Pipeline

Models

Key Findings

Tech Stack

How to Run

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages