Skip to content

Swapnil4646/Stellar-Object-Classification-Using-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

🌌 Stellar Object Classification using Machine Learning

Classifying galaxies, quasars, and stars from SDSS photometric data using 7 different ML models

Python scikit-learn XGBoost Pandas Seaborn


📌 Overview

This project classifies celestial objects, Galaxies, Quasars, and Stars, using real photometric and spectroscopic data from the Sloan Digital Sky Survey (SDSS). Rather than relying on a single algorithm, 7 different classification models are trained and compared head to head, from simple decision trees to gradient boosted ensembles, to find out which approach actually separates these three object classes best.

The best performing model, Random Forest, reaches 97.66% test accuracy on a real 100,000 row astronomical dataset.


✨ Highlights

🎯 What 📋 Detail
Best Model Random Forest
Best Accuracy 97.66%
Dataset Size 100,000 stellar objects
Classes Galaxy, Quasar, Star (3 classes)
Models Compared 7
Features Used 15 photometric and positional features

🧠 How It Works

flowchart LR
    A[📂 Load star_classification.csv] --> B[🧹 Drop ID columns]
    B --> C[🏷️ Label encode class column]
    C --> D[⚖️ Standard scale features]
    D --> E[✂️ Train test split 70/30]
    E --> F[🤖 Train 7 classification models]
    F --> G[📊 Compare accuracy across models]
    G --> H[🎛️ Hyperparameter tune best model]
Loading

🗂️ Dataset

The dataset (star_classification.csv) comes from the Sloan Digital Sky Survey, containing real photometric measurements for 100,000 space objects.

  • 🧾 100,000 rows, 18 original columns
  • 🔭 Features: alpha and delta (sky coordinates), photometric magnitudes u, g, r, i, z (SDSS filter bands), redshift, plus survey metadata (run_ID, cam_col, field_ID, plate, MJD, fiber_ID)
  • 🆔 obj_ID and spec_obj_ID dropped, since they are just identifiers with no predictive value
  • 🎯 Target: class, encoded as GALAXY = 0, QSO = 1, STAR = 2

Class distribution:

Class Count Share
🌌 Galaxy 59,445 59.4%
⭐ Star 21,594 21.6%
🌀 Quasar (QSO) 18,961 19.0%

📊 Exploratory Analysis

The notebook includes a thorough exploratory pass before any modeling:

  • 📦 Boxplots of every photometric feature broken down by class
  • 🌊 KDE distribution plots comparing how each class spreads across feature values
  • 🔗 Pairplots to visualize interdependence between coordinates, redshift, and class
  • 🗺️ Mollweide sky projections plotting the actual celestial coordinates (alpha, delta) of every object, color coded by class
  • 🌈 Spectral band histograms comparing brightness across UV, green, red, near infrared, and infrared bands for each class
  • 📉 Redshift distribution analysis, since redshift is one of the strongest signals separating galaxies from quasars

🏗️ Models Compared

Seven classifiers were trained on the same scaled, split data for a fair comparison:

Model Test Accuracy
🌲 Random Forest 97.66%
🚀 Gradient Boosting 97.61%
XGBoost 97.56%
📈 Logistic Regression 95.71%
🌳 Decision Tree 95.12%
🔥 AdaBoost 92.81%
📍 K-Nearest Neighbors 89.99%

📝 Support Vector Machine was part of the original modeling sweep but is computationally expensive at this dataset size and is best run with a subsampled training set or a linear kernel if you want to include it yourself.


🎛️ Hyperparameter Tuning

A RandomizedSearchCV pass is run on the Gradient Boosting Classifier, searching across:

  • n_estimators: 100, 150, 200
  • learning_rate: 0.01, 0.05, 0.1
  • max_depth: 3, 4, 5
  • subsample: 0.8, 1.0
  • max_features: sqrt, log2

This explores whether a tuned Gradient Boosting model can close the gap with, or beat, the untuned Random Forest baseline.


✅ Results

🏆 Best Model: Random Forest — 97.66% Accuracy

Across all 7 models tested, the ensemble methods, Random Forest, Gradient Boosting, and XGBoost, clearly outperform simpler approaches like Logistic Regression, Decision Tree, AdaBoost, and KNN. The gap between the top 3 ensemble models is small (under half a percentage point), suggesting the dataset's signal is strong enough that several solid algorithms converge to similar performance.


🛠️ Tech Stack

Category Tools Used
🐍 Language Python 3
🤖 Machine Learning scikit learn, XGBoost
📊 Data Handling Pandas, NumPy
📈 Visualization Matplotlib, Seaborn
☁️ Environment Google Colab

📁 Project Structure

stellar-object-classification/
│
├── stellar_obj_classification.py   # Main script: EDA, modeling, evaluation, tuning
├── star_classification.csv          # SDSS dataset
└── README.md                        # You are here

▶️ Getting Started

  1. Clone the repository

    git clone https://github.com/sarthakNaikare/stellar-object-classification.git
    cd stellar-object-classification
  2. Install dependencies

    pip install pandas numpy scikit-learn xgboost seaborn matplotlib
  3. Run the script

    python stellar_obj_classification.py

    The script loads the data, runs the full exploratory analysis, trains all 7 models, prints classification reports for each, and finishes with a hyperparameter tuned Gradient Boosting model.


🔮 Future Improvements

  • 🧮 Add a confusion matrix per model to see exactly where Galaxy, Quasar, and Star predictions get confused
  • ⚡ Run SVM on a subsampled training set or with a linear kernel for a fair, faster comparison
  • 🎯 Apply the same RandomizedSearchCV tuning to Random Forest and XGBoost, not just Gradient Boosting
  • 🧬 Try feature selection to see if dropping survey metadata columns improves or hurts performance
  • 📊 Add SHAP values to explain which features drive each classifier's decisions

👤 Author

Swapnil Yadav

⭐ If you found this project useful or interesting, consider giving it a star!

About

Stellar object classification using SDSS photometric data. Compares 7 ML models including Random Forest, XGBoost, and Gradient Boosting, best model hits 97.66% accuracy across 100,000 real astronomical objects.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages