This repository explores pulsar candidate data using unsupervised learning techniques. The dataset contains examples of real pulsars and noise caused by RFI (Radio Frequency Interference). The project focuses on:
- Extensive Exploratory Data Analysis (EDA)
- Dimensionality reduction using UMAP, PCA, and t-SNE
- Clustering and performance evaluation in reduced dimensions
The dataset consists of 17,898 examples:
- 16,259 noise examples (class 0)
- 1,639 pulsar examples (class 1)
Each example is described by 8 continuous features:
- Mean of the integrated profile
- Standard deviation of the integrated profile
- Excess kurtosis of the integrated profile
- Skewness of the integrated profile
- Mean of the DM-SNR curve
- Standard deviation of the DM-SNR curve
- Excess kurtosis of the DM-SNR curve
- Skewness of the DM-SNR curve
The dataset is highly imbalanced, with significantly more noise examples than pulsars.
The repository contains the following components:
- EDA Notebook: Extensive exploratory analysis of the dataset.
- Dimensionality Reduction Notebook: Implementation of UMAP, PCA, and t-SNE for reducing dimensions.
- Clustering Notebook: Clustering in reduced dimensions using K-Means and DBSCAN.
- Performance Evaluation: Comparison of dimensionality reduction techniques based on visualizations and clustering metrics.
To run this project locally:
- Clone this repository: