Skip to content

marievang/ML_exercises

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Exercises using ML models/algorithms

Assignment 2 Implementation of the Naive-Bayes classifier from scratch

A custom Python implementation of the Naive Bayes algorithm, featuring Laplace smoothing and hyperparameter analysis.

This repository contains a manual implementation of the Naive Bayes classifier. It is designed to handle categorical datasets using discrete probability distributions and includes a workflow that averages accuracy over multiple random splits.

Features

  • Custom Implementation: The Training & Prediction sections are built entirely with NumPy and Pandas, without relying on pre-built classifier libraries.
  • Laplace Smoothing: Implements smoothing (controlled by parameter $L$) to handle the zero-frequency problem in categorical data.
  • Evaluation: The iter_NBC function runs the model 100 times with different random train/test splits to calculate an average accuracy.
  • Hyperparameter Tuning: Includes an experiment (L_effect) to analyze how different smoothing values ($0.01, 1, 10, 100...$) impact model performance.
  • Scikit-Learn Comparison: Contains a wrapper to compare the custom model against sklearn's GaussianNB.

##Prerequisites

To run this script will need Python 3.x and the following data science libraries:

pip install numpy pandas scikit-learn

Assignment 3

This script is a comparative benchmark of three different classification algorithms (Naive Bayes, Logistic Regression, and a Trivial Baseline) across three different types of datasets (Categorical, Continuous, and Mixed). It specifically analyzes how training set size affects model performance.

This project evaluates classification algorithms on Categorical, Continuous, and Mixed datasets. It implements a rigorous testing pipeline that analyzes how model accuracy changes as the size of the available training data increases (from 50% to 100%).

Features

Handles three distinct data types: * Categorical: Uses CategoricalNB and One-Hot Encoding. * Continuous: Uses GaussianNB. * Mixed: Uses a custom MixedNB implementation for hybrid data.

  • Algorithm Comparison:
    1. Naive Bayes: The primary generative model.
    2. Logistic Regression: The primary discriminative baseline.
    3. Trivial Classifier: A majority-class baseline to establish the minimum acceptable performance.
  • Learning Curve Analysis: Tests the models on $K%$ of the training data ($K \in [50, 60, ..., 100]$) to visualize how data differences impacts performance.
  • Statistical Accuracy testing: Every experiment is repeated 100 times with random splits to report the average accuracy to ensure results.

Prerequisites

The script requires Python 3.x and the following libraries.

Note: This script relies on a local module named mixed_naive_bayes. Ensure mixed_naive_bayes.py is present in your root directory.

pip install numpy pandas matplotlib scikit-learn

Script: question_3.py This script demonstrates the impact of L1 Regularization (Lasso) on Logistic Regression weights. It visually proves how increasing the regularization strength ($\lambda$) forces model coefficients toward zero, effectively performing feature selection.

The script trains four different Logistic Regression models on the same binary classification task (Iris Setosa vs. Others):

  1. No Regularization: (Approximated with $C = 10^{10}$) - The baseline weights.
  2. Lasso ($\lambda = 0.5$): Mild regularization.
  3. Lasso ($\lambda = 10$): Strong regularization.
  4. Lasso ($\lambda = 100$): Very strong regularization.

It then plots the magnitude of the learned weights ($w_0, w_1, w_2, w_3$) for each model.

L1 Regularization (Lasso)

Lasso adds a penalty term to the loss function equal to the absolute value of the magnitude of coefficients:

$$Loss = \text{Likelihood Error} + \lambda \sum_{j=1}^{p} |w_j|$$

Assignment 4

This script implements a Random Forest Classifier from scratch (using Bagging and Feature Randomness) and compares it against Scikit-Learn's implementation.

This project implements a Random Forest classifier using Bagging (Bootstrap Aggregating) logic wrapped around standard Decision Trees. It investigates the stability and accuracy of ensemble methods by visualizing how the Forest outperforms individual Decision Trees.

Features

  • Custom Implements TrainRF and PredictRF functions that handle:
    • Bootstrapping: Randomly sampling data with replacement to create diverse training sets.
    • Feature Subsetting: Restricting each tree to $\sqrt{N}$ features to decorrelate the trees.
    • Majority Voting: Aggregating predictions from all trees to determine the final class.
  • Hyperparameter Analysis: Compares performance with different min_samples_leaf values (1 vs. 10) to observe the trade-off between overfitting and generalization.
  • Statistical Analysis: Calculates the exact probability that a single random tree could outperform the entire forest.
  • Benchmarking: Includes a direct comparison against sklearn.ensemble.RandomForestClassifier to validate the custom implementation's accuracy.

Prerequisites

You will need Python 3 and the following libraries:

pip install numpy pandas matplotlib scikit-learn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages