This project shifts the focus from predicting continuous values to solving classification problems. I explored how to teach algorithms to categorize data, distinguish "lemon" cars from good ones, and evaluate these decisions using industry-standard metrics.
- Classification Models: Implementing Logistic Regression, Naive Bayes (NB), and K-Nearest Neighbors (KNN).
- Probabilistic Framework: Understanding Maximum Likelihood Estimation (MLE) and Sigmoid activation.
- Non-linear Hacks: Transforming features via binning, polynomial expansion, and group-based mapping to enhance linear models.
- Evaluation Matrix: Mastering Confusion Matrices, Precision-Recall trade-offs, and F-beta scores.
- Global Metrics: Implementing ROC AUC and Gini Coefficient from scratch.
I implemented and compared three distinct classification philosophies:
- Logistic Regression: Derived the Negative Log-Likelihood (NLL) gradient and built a custom SGD-based solver.
- Naive Bayes: Leveraged Bayesian inference and feature independence assumptions for rapid, count-based classification.
- KNN: Explored memory-based learning and the "curse of dimensionality," emphasizing the importance of feature scaling.
Using the Don’t Get Kicked Kaggle dataset, I implemented a strict chronological split (Train < Valid < Test). This ensured the model was tested on a genuine "future" timeframe, preventing temporal data leakage.
I practiced the "Credit Scorecard" approach: transforming continuous variables into categorical bins. This non-linear transformation allows simple logistic models to capture complex dependencies while remaining 100% explainable to stakeholders.
I moved beyond simple Accuracy to more nuanced metrics:
- Precision vs. Recall: Analyzed which error (False Positive vs. False Negative) is more costly for "lemon" car detection.
-
Gini Coefficient: Manually implemented the Gini score (
$2 \cdot AUC - 1$ ) and validated it againstsklearnbenchmarks, achieving the project's target of >0.15.
By comparing my custom implementations with sklearn models, I achieved consistent results across all datasets. The project concluded with a hyperparameter tuning phase where I optimized the balance between model complexity (via L1 regularization) and predictive power.
This module proved that the choice of metric is just as important as the choice of algorithm. I now understand how to align machine learning outputs with specific business goals, whether the priority is exhaustive detection (Recall) or pinpoint accuracy (Precision).
-
Clone the repository:
git clone https://github.com/knight99rus/ML4_Classification_problems.git cd ML4_Classification_problems -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # For Windows: venv\Scripts\activate
-
Install dependencies:
pip install jupyter pandas numpy scikit-learn
-
Download data:
- Read the task on the Kaggle competition page.
- Download
training.csvfile.
-
Launch Jupyter Notebook:
jupyter notebook
Open and execute the cells in the
project04.ipynbfile.