A custom Python implementation of the Naive Bayes algorithm, featuring Laplace smoothing and hyperparameter analysis.
This repository contains a manual implementation of the Naive Bayes classifier. It is designed to handle categorical datasets using discrete probability distributions and includes a workflow that averages accuracy over multiple random splits.
-
Custom Implementation: The Training & Prediction sections are built entirely with
NumPyandPandas, without relying on pre-built classifier libraries. -
Laplace Smoothing: Implements smoothing (controlled by parameter
$L$ ) to handle the zero-frequency problem in categorical data. -
Evaluation: The
iter_NBCfunction runs the model 100 times with different random train/test splits to calculate an average accuracy. -
Hyperparameter Tuning: Includes an experiment (
L_effect) to analyze how different smoothing values ($0.01, 1, 10, 100...$ ) impact model performance. -
Scikit-Learn Comparison: Contains a wrapper to compare the custom model against sklearn's
GaussianNB.
##Prerequisites
To run this script will need Python 3.x and the following data science libraries:
pip install numpy pandas scikit-learnThis script is a comparative benchmark of three different classification algorithms (Naive Bayes, Logistic Regression, and a Trivial Baseline) across three different types of datasets (Categorical, Continuous, and Mixed). It specifically analyzes how training set size affects model performance.
This project evaluates classification algorithms on Categorical, Continuous, and Mixed datasets. It implements a rigorous testing pipeline that analyzes how model accuracy changes as the size of the available training data increases (from 50% to 100%).
Handles three distinct data types:
* Categorical: Uses CategoricalNB and One-Hot Encoding.
* Continuous: Uses GaussianNB.
* Mixed: Uses a custom MixedNB implementation for hybrid data.
-
Algorithm Comparison:
- Naive Bayes: The primary generative model.
- Logistic Regression: The primary discriminative baseline.
- Trivial Classifier: A majority-class baseline to establish the minimum acceptable performance.
-
Learning Curve Analysis: Tests the models on
$K%$ of the training data ($K \in [50, 60, ..., 100]$ ) to visualize how data differences impacts performance. - Statistical Accuracy testing: Every experiment is repeated 100 times with random splits to report the average accuracy to ensure results.
The script requires Python 3.x and the following libraries.
Note: This script relies on a local module named mixed_naive_bayes. Ensure mixed_naive_bayes.py is present in your root directory.
pip install numpy pandas matplotlib scikit-learnScript: question_3.py
This script demonstrates the impact of L1 Regularization (Lasso) on Logistic Regression weights. It visually proves how increasing the regularization strength (
The script trains four different Logistic Regression models on the same binary classification task (Iris Setosa vs. Others):
-
No Regularization: (Approximated with
$C = 10^{10}$ ) - The baseline weights. -
Lasso (
$\lambda = 0.5$ ): Mild regularization. -
Lasso (
$\lambda = 10$ ): Strong regularization. -
Lasso (
$\lambda = 100$ ): Very strong regularization.
It then plots the magnitude of the learned weights (
Lasso adds a penalty term to the loss function equal to the absolute value of the magnitude of coefficients:
This script implements a Random Forest Classifier from scratch (using Bagging and Feature Randomness) and compares it against Scikit-Learn's implementation.
This project implements a Random Forest classifier using Bagging (Bootstrap Aggregating) logic wrapped around standard Decision Trees. It investigates the stability and accuracy of ensemble methods by visualizing how the Forest outperforms individual Decision Trees.
-
Custom Implements
TrainRFandPredictRFfunctions that handle:- Bootstrapping: Randomly sampling data with replacement to create diverse training sets.
-
Feature Subsetting: Restricting each tree to
$\sqrt{N}$ features to decorrelate the trees. - Majority Voting: Aggregating predictions from all trees to determine the final class.
-
Hyperparameter Analysis: Compares performance with different
min_samples_leafvalues (1 vs. 10) to observe the trade-off between overfitting and generalization. - Statistical Analysis: Calculates the exact probability that a single random tree could outperform the entire forest.
-
Benchmarking: Includes a direct comparison against
sklearn.ensemble.RandomForestClassifierto validate the custom implementation's accuracy.
You will need Python 3 and the following libraries:
pip install numpy pandas matplotlib scikit-learn