🌌 Stellar Object Classification using Machine Learning

Classifying galaxies, quasars, and stars from SDSS photometric data using 7 different ML models

📌 Overview

This project classifies celestial objects, Galaxies, Quasars, and Stars, using real photometric and spectroscopic data from the Sloan Digital Sky Survey (SDSS). Rather than relying on a single algorithm, 7 different classification models are trained and compared head to head, from simple decision trees to gradient boosted ensembles, to find out which approach actually separates these three object classes best.

The best performing model, Random Forest, reaches 97.66% test accuracy on a real 100,000 row astronomical dataset.

✨ Highlights

🎯 What	📋 Detail
Best Model	Random Forest
Best Accuracy	97.66%
Dataset Size	100,000 stellar objects
Classes	Galaxy, Quasar, Star (3 classes)
Models Compared	7
Features Used	15 photometric and positional features

🧠 How It Works

flowchart LR
    A[📂 Load star_classification.csv] --> B[🧹 Drop ID columns]
    B --> C[🏷️ Label encode class column]
    C --> D[⚖️ Standard scale features]
    D --> E[✂️ Train test split 70/30]
    E --> F[🤖 Train 7 classification models]
    F --> G[📊 Compare accuracy across models]
    G --> H[🎛️ Hyperparameter tune best model]

🗂️ Dataset

The dataset (star_classification.csv) comes from the Sloan Digital Sky Survey, containing real photometric measurements for 100,000 space objects.

🧾 100,000 rows, 18 original columns
🔭 Features: alpha and delta (sky coordinates), photometric magnitudes u, g, r, i, z (SDSS filter bands), redshift, plus survey metadata (run_ID, cam_col, field_ID, plate, MJD, fiber_ID)
🆔 obj_ID and spec_obj_ID dropped, since they are just identifiers with no predictive value
🎯 Target: class, encoded as GALAXY = 0, QSO = 1, STAR = 2

Class distribution:

Class	Count	Share
🌌 Galaxy	59,445	59.4%
⭐ Star	21,594	21.6%
🌀 Quasar (QSO)	18,961	19.0%

📊 Exploratory Analysis

The notebook includes a thorough exploratory pass before any modeling:

📦 Boxplots of every photometric feature broken down by class
🌊 KDE distribution plots comparing how each class spreads across feature values
🔗 Pairplots to visualize interdependence between coordinates, redshift, and class
🗺️ Mollweide sky projections plotting the actual celestial coordinates (alpha, delta) of every object, color coded by class
🌈 Spectral band histograms comparing brightness across UV, green, red, near infrared, and infrared bands for each class
📉 Redshift distribution analysis, since redshift is one of the strongest signals separating galaxies from quasars

🏗️ Models Compared

Seven classifiers were trained on the same scaled, split data for a fair comparison:

Model	Test Accuracy
🌲 Random Forest	97.66%
🚀 Gradient Boosting	97.61%
⚡ XGBoost	97.56%
📈 Logistic Regression	95.71%
🌳 Decision Tree	95.12%
🔥 AdaBoost	92.81%
📍 K-Nearest Neighbors	89.99%

📝 Support Vector Machine was part of the original modeling sweep but is computationally expensive at this dataset size and is best run with a subsampled training set or a linear kernel if you want to include it yourself.

🎛️ Hyperparameter Tuning

A RandomizedSearchCV pass is run on the Gradient Boosting Classifier, searching across:

n_estimators: 100, 150, 200
learning_rate: 0.01, 0.05, 0.1
max_depth: 3, 4, 5
subsample: 0.8, 1.0
max_features: sqrt, log2

This explores whether a tuned Gradient Boosting model can close the gap with, or beat, the untuned Random Forest baseline.

✅ Results

🏆 Best Model: Random Forest — 97.66% Accuracy

Across all 7 models tested, the ensemble methods, Random Forest, Gradient Boosting, and XGBoost, clearly outperform simpler approaches like Logistic Regression, Decision Tree, AdaBoost, and KNN. The gap between the top 3 ensemble models is small (under half a percentage point), suggesting the dataset's signal is strong enough that several solid algorithms converge to similar performance.

🛠️ Tech Stack

Category	Tools Used
🐍 Language	Python 3
🤖 Machine Learning	scikit learn, XGBoost
📊 Data Handling	Pandas, NumPy
📈 Visualization	Matplotlib, Seaborn
☁️ Environment	Google Colab

📁 Project Structure

stellar-object-classification/
│
├── stellar_obj_classification.py   # Main script: EDA, modeling, evaluation, tuning
├── star_classification.csv          # SDSS dataset
└── README.md                        # You are here

▶️ Getting Started

Clone the repository

git clone https://github.com/sarthakNaikare/stellar-object-classification.git
cd stellar-object-classification

Install dependencies

pip install pandas numpy scikit-learn xgboost seaborn matplotlib

Run the script
```
python stellar_obj_classification.py
```
The script loads the data, runs the full exploratory analysis, trains all 7 models, prints classification reports for each, and finishes with a hyperparameter tuned Gradient Boosting model.

🔮 Future Improvements

🧮 Add a confusion matrix per model to see exactly where Galaxy, Quasar, and Star predictions get confused
⚡ Run SVM on a subsampled training set or with a linear kernel for a fair, faster comparison
🎯 Apply the same RandomizedSearchCV tuning to Random Forest and XGBoost, not just Gradient Boosting
🧬 Try feature selection to see if dropping survey metadata columns improves or hurts performance
📊 Add SHAP values to explain which features drive each classifier's decisions

👤 Author

Swapnil Yadav

⭐ If you found this project useful or interesting, consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
star_classification.csv		star_classification.csv
stellar_obj_classification_sarthak_final.py		stellar_obj_classification_sarthak_final.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌌 Stellar Object Classification using Machine Learning

Classifying galaxies, quasars, and stars from SDSS photometric data using 7 different ML models

📌 Overview

✨ Highlights

🧠 How It Works

🗂️ Dataset

📊 Exploratory Analysis

🏗️ Models Compared

🎛️ Hyperparameter Tuning

✅ Results

🏆 Best Model: Random Forest — 97.66% Accuracy

🛠️ Tech Stack

📁 Project Structure

▶️ Getting Started

🔮 Future Improvements

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌌 Stellar Object Classification using Machine Learning

Classifying galaxies, quasars, and stars from SDSS photometric data using 7 different ML models

📌 Overview

✨ Highlights

🧠 How It Works

🗂️ Dataset

📊 Exploratory Analysis

🏗️ Models Compared

🎛️ Hyperparameter Tuning

✅ Results

🏆 Best Model: Random Forest — 97.66% Accuracy

🛠️ Tech Stack

📁 Project Structure

▶️ Getting Started

🔮 Future Improvements

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages