ML4: Classification Strategies

This project shifts the focus from predicting continuous values to solving classification problems. I explored how to teach algorithms to categorize data, distinguish "lemon" cars from good ones, and evaluate these decisions using industry-standard metrics.

Topics

Classification Models: Implementing Logistic Regression, Naive Bayes (NB), and K-Nearest Neighbors (KNN).
Probabilistic Framework: Understanding Maximum Likelihood Estimation (MLE) and Sigmoid activation.
Non-linear Hacks: Transforming features via binning, polynomial expansion, and group-based mapping to enhance linear models.
Evaluation Matrix: Mastering Confusion Matrices, Precision-Recall trade-offs, and F-beta scores.
Global Metrics: Implementing ROC AUC and Gini Coefficient from scratch.

Roadmap

1. The Algorithmic Triple Threat

I implemented and compared three distinct classification philosophies:

Logistic Regression: Derived the Negative Log-Likelihood (NLL) gradient and built a custom SGD-based solver.
Naive Bayes: Leveraged Bayesian inference and feature independence assumptions for rapid, count-based classification.
KNN: Explored memory-based learning and the "curse of dimensionality," emphasizing the importance of feature scaling.

2. Temporal Validation Strategy

Using the Don’t Get Kicked Kaggle dataset, I implemented a strict chronological split (Train < Valid < Test). This ensured the model was tested on a genuine "future" timeframe, preventing temporal data leakage.

3. Engineering Interpretability

I practiced the "Credit Scorecard" approach: transforming continuous variables into categorical bins. This non-linear transformation allows simple logistic models to capture complex dependencies while remaining 100% explainable to stakeholders.

4. Deep Dive into Metrics

I moved beyond simple Accuracy to more nuanced metrics:

Precision vs. Recall: Analyzed which error (False Positive vs. False Negative) is more costly for "lemon" car detection.
Gini Coefficient: Manually implemented the Gini score ($2 \cdot AUC - 1$) and validated it against sklearn benchmarks, achieving the project's target of >0.15.

Results

By comparing my custom implementations with sklearn models, I achieved consistent results across all datasets. The project concluded with a hyperparameter tuning phase where I optimized the balance between model complexity (via L1 regularization) and predictive power.

This module proved that the choice of metric is just as important as the choice of algorithm. I now understand how to align machine learning outputs with specific business goals, whether the priority is exhaustive detection (Recall) or pinpoint accuracy (Precision).

How to Run the Project

Clone the repository:

git clone https://github.com/knight99rus/ML4_Classification_problems.git
cd ML4_Classification_problems

Create and activate a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # For Windows: venv\Scripts\activate

Install dependencies:

pip install jupyter pandas numpy scikit-learn

Download data:
- Read the task on the Kaggle competition page.
- Download training.csv file.
Launch Jupyter Notebook:
```
jupyter notebook
```
Open and execute the cells in the project04.ipynb file.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
code-samples		code-samples
data-samples		data-samples
datasets		datasets
materials		materials
misc		misc
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML4: Classification Strategies

Topics

Roadmap

1. The Algorithmic Triple Threat

2. Temporal Validation Strategy

3. Engineering Interpretability

4. Deep Dive into Metrics

Results

How to Run the Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML4: Classification Strategies

Topics

Roadmap

1. The Algorithmic Triple Threat

2. Temporal Validation Strategy

3. Engineering Interpretability

4. Deep Dive into Metrics

Results

How to Run the Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages