This project classifies celestial objects, Galaxies, Quasars, and Stars, using real photometric and spectroscopic data from the Sloan Digital Sky Survey (SDSS). Rather than relying on a single algorithm, 7 different classification models are trained and compared head to head, from simple decision trees to gradient boosted ensembles, to find out which approach actually separates these three object classes best.
The best performing model, Random Forest, reaches 97.66% test accuracy on a real 100,000 row astronomical dataset.
| 🎯 What | 📋 Detail |
|---|---|
| Best Model | Random Forest |
| Best Accuracy | 97.66% |
| Dataset Size | 100,000 stellar objects |
| Classes | Galaxy, Quasar, Star (3 classes) |
| Models Compared | 7 |
| Features Used | 15 photometric and positional features |
flowchart LR
A[📂 Load star_classification.csv] --> B[🧹 Drop ID columns]
B --> C[🏷️ Label encode class column]
C --> D[⚖️ Standard scale features]
D --> E[✂️ Train test split 70/30]
E --> F[🤖 Train 7 classification models]
F --> G[📊 Compare accuracy across models]
G --> H[🎛️ Hyperparameter tune best model]
The dataset (star_classification.csv) comes from the Sloan Digital Sky Survey, containing real photometric measurements for 100,000 space objects.
- 🧾 100,000 rows, 18 original columns
- 🔭 Features:
alphaanddelta(sky coordinates), photometric magnitudesu,g,r,i,z(SDSS filter bands),redshift, plus survey metadata (run_ID,cam_col,field_ID,plate,MJD,fiber_ID) - 🆔
obj_IDandspec_obj_IDdropped, since they are just identifiers with no predictive value - 🎯 Target:
class, encoded as GALAXY = 0, QSO = 1, STAR = 2
Class distribution:
| Class | Count | Share |
|---|---|---|
| 🌌 Galaxy | 59,445 | 59.4% |
| ⭐ Star | 21,594 | 21.6% |
| 🌀 Quasar (QSO) | 18,961 | 19.0% |
The notebook includes a thorough exploratory pass before any modeling:
- 📦 Boxplots of every photometric feature broken down by class
- 🌊 KDE distribution plots comparing how each class spreads across feature values
- 🔗 Pairplots to visualize interdependence between coordinates, redshift, and class
- 🗺️ Mollweide sky projections plotting the actual celestial coordinates (
alpha,delta) of every object, color coded by class - 🌈 Spectral band histograms comparing brightness across UV, green, red, near infrared, and infrared bands for each class
- 📉 Redshift distribution analysis, since redshift is one of the strongest signals separating galaxies from quasars
Seven classifiers were trained on the same scaled, split data for a fair comparison:
| Model | Test Accuracy |
|---|---|
| 🌲 Random Forest | 97.66% |
| 🚀 Gradient Boosting | 97.61% |
| ⚡ XGBoost | 97.56% |
| 📈 Logistic Regression | 95.71% |
| 🌳 Decision Tree | 95.12% |
| 🔥 AdaBoost | 92.81% |
| 📍 K-Nearest Neighbors | 89.99% |
📝 Support Vector Machine was part of the original modeling sweep but is computationally expensive at this dataset size and is best run with a subsampled training set or a linear kernel if you want to include it yourself.
A RandomizedSearchCV pass is run on the Gradient Boosting Classifier, searching across:
n_estimators: 100, 150, 200learning_rate: 0.01, 0.05, 0.1max_depth: 3, 4, 5subsample: 0.8, 1.0max_features: sqrt, log2
This explores whether a tuned Gradient Boosting model can close the gap with, or beat, the untuned Random Forest baseline.
Across all 7 models tested, the ensemble methods, Random Forest, Gradient Boosting, and XGBoost, clearly outperform simpler approaches like Logistic Regression, Decision Tree, AdaBoost, and KNN. The gap between the top 3 ensemble models is small (under half a percentage point), suggesting the dataset's signal is strong enough that several solid algorithms converge to similar performance.
| Category | Tools Used |
|---|---|
| 🐍 Language | Python 3 |
| 🤖 Machine Learning | scikit learn, XGBoost |
| 📊 Data Handling | Pandas, NumPy |
| 📈 Visualization | Matplotlib, Seaborn |
| ☁️ Environment | Google Colab |
stellar-object-classification/
│
├── stellar_obj_classification.py # Main script: EDA, modeling, evaluation, tuning
├── star_classification.csv # SDSS dataset
└── README.md # You are here
-
Clone the repository
git clone https://github.com/sarthakNaikare/stellar-object-classification.git cd stellar-object-classification -
Install dependencies
pip install pandas numpy scikit-learn xgboost seaborn matplotlib
-
Run the script
python stellar_obj_classification.py
The script loads the data, runs the full exploratory analysis, trains all 7 models, prints classification reports for each, and finishes with a hyperparameter tuned Gradient Boosting model.
- 🧮 Add a confusion matrix per model to see exactly where Galaxy, Quasar, and Star predictions get confused
- ⚡ Run SVM on a subsampled training set or with a linear kernel for a fair, faster comparison
- 🎯 Apply the same
RandomizedSearchCVtuning to Random Forest and XGBoost, not just Gradient Boosting - 🧬 Try feature selection to see if dropping survey metadata columns improves or hurts performance
- 📊 Add SHAP values to explain which features drive each classifier's decisions