Skip to content

knight99rus/ML4_Classification_problems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML4: Classification Strategies

This project shifts the focus from predicting continuous values to solving classification problems. I explored how to teach algorithms to categorize data, distinguish "lemon" cars from good ones, and evaluate these decisions using industry-standard metrics.

Topics

  • Classification Models: Implementing Logistic Regression, Naive Bayes (NB), and K-Nearest Neighbors (KNN).
  • Probabilistic Framework: Understanding Maximum Likelihood Estimation (MLE) and Sigmoid activation.
  • Non-linear Hacks: Transforming features via binning, polynomial expansion, and group-based mapping to enhance linear models.
  • Evaluation Matrix: Mastering Confusion Matrices, Precision-Recall trade-offs, and F-beta scores.
  • Global Metrics: Implementing ROC AUC and Gini Coefficient from scratch.

Roadmap

1. The Algorithmic Triple Threat

I implemented and compared three distinct classification philosophies:

  • Logistic Regression: Derived the Negative Log-Likelihood (NLL) gradient and built a custom SGD-based solver.
  • Naive Bayes: Leveraged Bayesian inference and feature independence assumptions for rapid, count-based classification.
  • KNN: Explored memory-based learning and the "curse of dimensionality," emphasizing the importance of feature scaling.

2. Temporal Validation Strategy

Using the Don’t Get Kicked Kaggle dataset, I implemented a strict chronological split (Train < Valid < Test). This ensured the model was tested on a genuine "future" timeframe, preventing temporal data leakage.

3. Engineering Interpretability

I practiced the "Credit Scorecard" approach: transforming continuous variables into categorical bins. This non-linear transformation allows simple logistic models to capture complex dependencies while remaining 100% explainable to stakeholders.

4. Deep Dive into Metrics

I moved beyond simple Accuracy to more nuanced metrics:

  • Precision vs. Recall: Analyzed which error (False Positive vs. False Negative) is more costly for "lemon" car detection.
  • Gini Coefficient: Manually implemented the Gini score ($2 \cdot AUC - 1$) and validated it against sklearn benchmarks, achieving the project's target of >0.15.

Results

By comparing my custom implementations with sklearn models, I achieved consistent results across all datasets. The project concluded with a hyperparameter tuning phase where I optimized the balance between model complexity (via L1 regularization) and predictive power.

This module proved that the choice of metric is just as important as the choice of algorithm. I now understand how to align machine learning outputs with specific business goals, whether the priority is exhaustive detection (Recall) or pinpoint accuracy (Precision).

How to Run the Project

  1. Clone the repository:

    git clone https://github.com/knight99rus/ML4_Classification_problems.git
    cd ML4_Classification_problems
  2. Create and activate a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # For Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install jupyter pandas numpy scikit-learn
  4. Download data:

  5. Launch Jupyter Notebook:

    jupyter notebook

    Open and execute the cells in the project04.ipynb file.

About

This project deals with classification problems and the machine learning algorithms used to solve them.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors