Skip to content

gohft/MALLORN_Astronomical_Binary_Classification

Repository files navigation

Problem

Dataset is from the MALLORN Astronomical Classification Challenge.

The 10-Year Legacy Survey of Space and Time (LSST) is expected to detect more astronomical events. The task is to develop a classifier to identify Tidal Disruption Events (TDE) among the vast amount of light curve data (multi wavelength time series that record the variation in light intensity over time) expected from the LSST. TDE is a rare phenomemon when a star gets ripped apart by a supermassive black hole when it passes too close to the black hole, generating an exceptionally luminous radiation. TDE are useful to detect black holes and could help physicist better understand general relativity and extreme gravity.

Summary

Large datasets were combined as a parquet file to reduce storage space for easier processing. Two methods of feature extraction from multi-variate time series was employed: MiniRocket (1D CNN) and light_curve (time series feature extraction for astrophysics). Resampling of irregular time series to regularly sampled time series across objects and across variables was done using Gaussian process regressor with intrinsic coregionalization model (multi-output Gaussian process ) to ensure features extracted from MiniRocket is equivalent across objects. Two types of binary classifiers were hyperparameter tuned (stratified kfold cross validation) and evaluated: Logistic regression (SGDClassifer with binary log loss) and LightGBM (boosting algorithm). The best model was the one that used features extracted from the original irregularly sampled data using the light_curve feature extractor and passing it to the LightGBM, resulting in a F1 score of 0.519 for the test set. From feature importance plot, the skewness of the time series in r wavelength was the most important feature in the best model.

Note: Claude AI was used to help with the coding.

EDA

There are two datasets:

  1. Log data: Contains the redshift data, the extinction coefficient data and classification for each object.
  2. Light curve data: Multi-wavelength flux (value and error) across time per object.

Only the train dataset is used as it had the classification class necessary for evaluation.

Dataset

  1. Log data
  • The total number of object is 3043 and there are no duplicates, missing data nor wrong labels.
  • All redshift and extinct coefficient values are non-negative and within expected range of values, hence there is no anomalous data.
  • The dataset is highly imbalanced with roughly 5% (148/3043) TDE events.
  • Redshift causes time dilation (observed interval is stretched) and longer wavelengths (wavelength observed by the filter is actually longer than the actual wavelength).
  • Applying the redshift correction on flux data will cause the wavelength range to be inconsistent across objects making it difficult to compare and extract the features across all objects (each object has different redshift value), therefore no redshift correction is done on the light curve but the redshift value will be added as one of the features to help in classification.
  1. Light curve data
  • Observation had 6 wavelengths: g, i, r, u, y, z.
  • There is missing data in flux values and those rows are dropped.
  • There are 21 objects (less than 1% of overall data) with missing data for at least one wavelength, and they are removed as it does not affect the overall target distribution.
  • Objects IDs with too few number of data points (less than 5) for a wavelength are filtered out since no meaningful features can be extracted from short time series.
  • After filtering, there is 2868 objects left.
  • From plotting some flux data, the multi-wavelength light curve are irregularly sampled, with inconsistent length and time period across all objects and across the different wavelength for a single object. Hence, this property has to be considered when doing feature extraction. Scatter plot of multi-wavelength light curve for an object
  • Extinction coefficient from log data represents the extent of the dimming of the light curve due to scattering/absorption as it cross space to reach the observation sensor. Flux correction is applied on all wavelengths to get back the actual flux and flux error values of the light curve following the code provided along with the dataset which uses the Fitzpatrick99 extinction law from the extinction library. After correction, the flux values changes slightly but the time stamps remain unchanged.

Classifier

The trained classifier is obtained by fitting to the train set using the best hyperparameters obtained from stratified Kfold cross validation hyperparameter tuning process. There are two types of binary classifier to evaluate. Both models include the early stopping mechanism by monitoring the validation set loss to prevent overfitting and accelerate the hyperparameter tuning process as well as weighted binary log loss (weights are the inverse frequency of class size) to handle the class imbalance.

1) Logistic Regression

Logistic regression is one of the fast and effective binary classifier with high interpretability (from feature weights). The SGDClassifier model from sklearn is used instead of the LogisticRegression model since it allows for per epoch loss/metric monitoring which is required for early stopping mechanism. Elastic net regularization is included to prevent overfitting. The logistic regression class can be found in src/train/logreg_trainer.py.

2) LightGBM (Light Gradient Boosting Machine)

The LGBMClassifier model from lightgbm library is used for classification. Compared to another high performing boosting algorithm, XGBoost, LightGBM allows for faster training through histogram based splitting, Exclusive Feature Bundling and Gradient-based One-Side Sampling which is suitable for large dataset. Regularization and subsampling of features/data points are applied to prevent overfitting. The LighGBM class can be found in src/train/lgbm_trainer.py.

Metrics

Due to the high class imbalance, accuracy score is misleading. Binary F1 score is used as the metric for evaluation as it focuses on the minority class (TDE class) and provide a balance between precision and recall.

Hyperparameter tuning

The original train set is split to a train and test set. Hyperparameter tuning is done on the new train set using the Optuna library with multivariate Tree-structured Parzen Estimator (TPE) as the Bayesian search algorithm that can handle hyperparameters with correlation and median pruner as the pruner to stop unpromising trials (trial whose performance is poorer than the median score of completed trials are pruned before fitting/evaluating on all folds). A pre-determined number of random trials is done to set the initial search space before TPE is applied.

Stratified Kfold cross validation is used to select the hyperparameters to handle the class imbalance and also ensure robust evaluation on validation set. The average binary F1 score across all the k fold validation set (obtained by early stopping/reach maximum iterations) is the metric to optimize in the Bayesian search as the binary classifier aims to have the optimal F1 score.

Each set of hyperparamters and their corresponding loss and metric on the kfold train and validation set is recorded in MLflow for tracking and logging purpose. The hyperparameter search space for each classifier, the optuna search parameters and the training parameters can be modified from the training configuration file config/train/default.yaml or updated using Hydra override when executing the training script.

Feature Extractor

Feature extractor aims to extract features from the original irregularly sampled multi-wavelength light curve or the resampled regularly sampled multi-wavelength light curve and subsequently input them into the classifier for classification. After extracting the features from the light curve data, it appends the redshift data from the log data as one of the features.

Since the number of features extracted can be large, feature selection is applied to reduce the number of features.

  • Variance thresholding: Remove features with low variance (less than 0.01) since features that are almost constant would not be useful in distinguishing between classes. This is the default feature selection.
  • Absolute Spearman's pairwise correlation thresholding: Use Spearman's correlation instead of Pearson's correlation since some features might not follow a normal distribution and there could be some outliers in the data. For a pair of features that have strong absolute correlation (above 0.95), remove the feature that has lower correlation to the target class. This is applied after variance thresholding if the number of features extracted is still too large.

For classifiers that are either distance-based or uses stochastic gradient descent, standardization (mean of 0 and variance of 1) is applied on the features after feature selection (fitted on the train set before applying transformation on the whole data) to ensure that all features are comparable and for fast convergence.

1) MiniRocket

1D convolutional kernels with different kernel and stride sizes can be applied along the time dimension of multi-variate time series to extract local temporal patterns across the different variables. However, there some issues with the light curve data.

  • Objects have time series of different lengths: The resulting time series output from the 1D CNN kernels will have different lengths across objects and hence not comparable across different objects. Hence, there is a need to do some aggregation to transform the resulting time series output to a single feature to be comparable across all objects.

  • Objects can have different sampling rates and lengths across the different variables for a time series: 1D convolutional kernels requires multi-variate time series to have the same length across all variables for it to function.

  • Objects have different sampling rates for time series: Features extracted from each 1D CNN kernel will not be equivalent across different objects. For example, given the same 1D CNN filter with kernel and stride size of 1 that passes through all the time series, it will extract daily features from an object with daily sampling rate, while, it will extract monthly features from an object with monthly sampling rate. If these non-equivalent features are group together as the same feature and input into a classifier, it could confuse the classifier, leading to poor performance. Therefore, there is a need to resample the objects to have the same sampling rate across all objects and all variables.

Therefore, there is need to resample the light curve data such that all objects and all variables have the same sampling rate before using convolution kernels to extract features from each multi-variate light curve. The resampling process will be elaborated later in the subsequent chapter.

Assuming that the light curve has be transformed to be regularly sampled, MiniRocketMultivariateVariable (MINImally RandOm Convolutional KErnel Transform (MiniRocket) multivariate) from sktime library is used extract features from each light curve. Rocket (RandOm Convolutional KErnel Transform) models does train the CNN kernels but instead use a fixed set of large number of random kernels to extract large number of features in hope that a few of the features extracted can be useful for the model. Since large number of features is expected to be extracted, variance thresholding and correlation thresholding is applied after the features are extracted.

MiniRocketMultivariateVariable algorithm is chosen for the following reasons:

  1. Able to handle multi-variate time series.
  2. Does aggregation to transform the resulting time series from each kernel to a single feature (Proportion of Positive Values).
  3. Able to handle objects with different time series lengths.
  4. Ablel to handle objects with different lengths across the variables in the multi-variate time series by padding shorter time series with default values to match the longer time series.

There exists alternative 1D CNN feature extractors like Rocket and MultiRocket from the same sktime library. Even though the alternative algorithm extract more features per kernel and that MultiRocket also calculate features from the first order differenced time series (useful for time series where the rate change is important), they are not chosen as they are unable to handle the different lengths across the variables in a multi-variate time series.

The MiniRocket feature extractor class can be found in src/data_processing/minirocket_feature_extractor.py and its corresponding configurations can be updated from the minirocket section in the data processing configuration file config/data_processing/default.yaml.

2) Light curve extractor

Features like variability, amplitude, shape, trends and periodicity can be extracted from time series. The light_curve extractor from the light_curve library is used to do that. The is another open source python library feets (feATURE eXTRACTOR FOR tIME SERIES) that is widely used in astronomy to extract features from light curve data, but light_curve is chosen as it uses Rust which is accelerates the computational process.

Light curve extractor independently extracts features for each of the wavelength in the multi-wavelength light curve using both the flux and flux error values. The extractor is able to handle irregularly sampled time series and since each wavelength is processed independently, there is no need to resample light curve data to the same sampling rate across all objects and across all wavelenghts. Besides the standard time series features, light curve extractor is able to do light curves specific single band parametric curve fit and extract the parametric values as features (fitting algorithm is deterministic for reproducibility). There are three types of parametric curve fits: VillarFit, BazinFit and LinexpFit, and each make some assumption about the nature of the growth and decay of the light curves. The parametric curve fit to apply depends on the provided light curve data, since some objects in the original data has too few data points required for the fitting. One limitation of this light curve extractir is that it does not extract patterns/features across the different wavelengths, missing out on multi-variate features.

The light_curve feature extractor class can be found in src/data_processing/lightcurve_feature_extractor.py and its corresponding configurations can be updated from the lightcurve section in the data processing configuration file config/data_processing/default.yaml.

Resampling using Gaussian Process Regressor

Resampling the multi-wavelength light curve data to regular sampling rate across objects and across wavelengths is necessary in order to use convolutional kernels to extract relevant features. The Gaussian process regressor from the gpflow library is used to fit to the multi-wavelength light curve data for each object and than resampled to get the flux values following the specified samping rate and time period. The resampling for each object is done independently of one another. Gaussian process regression is suitable for this resampling process because it provides uncertainty estimates which can be converted to flux errors, necessary for light curve feature extractors and it is very effective for sparse data which is important since some objects have short time series.

Gaussian process regression used to resample the light curves includes a Intrinsic Coregionalization Model (ICM) and uses the Matern 3/2 kernel.

  • ICM: This model allows the Gaussian process regressor to learn relationship between the different wavelengths in the multi-wavelength light curve and allow it to predict values across the multiple wavelengths simultaneously. This is very useful as the objects in the data have irregular sampling rates across the wavelengths and ICM can use other wavelengths' flux values to predict fluc values for those missing wavelength. ICM is a simple model that uses one coregionalization matrix to capture covariance between the different wavelengths in the multi-wavelength light curve. Using only one matrix is appropriate for this data since the flux values for the different wavelengths of an object should be influenced by the same underlying physical process.
  • Matern 3/2 kernel: The kernel used should be general so as to fit to as many types of astronomical phenomenons as possible, since the resampling process is done before the classification is known. Matern 3/2 kernel restricts the fitting curves to be once differentiable smooth functions, which is appropriate for general astronomical phenomenons since some events has sharp spikes and hence should not be overly smooth.

The Gaussian process regression fitted to use both the wavelength and time step to output the flux value. The fitting depends on the initialization values and may not converge the first time, hence reattempts are included to increase the chance of convergence. Seeding and setting of the environment variables for the deterministic optimizer ensures reproducible resampled light curves.

The Gaussian process regressor class can be found in src/data_processing/gpregressor.py and the corrected filtered light curve data is processed to get a regularly sampled data parquet with daily samplng rate before any feature extraction is done.

Scatter plot of corrected vs resampled multi-wavelength light curve for object: mithril_dam_agarwaen Scatter plot of corrected vs resampled multi-wavelength light curve for object: nidh_nim_firen

Comparison between corrected and resampled multi-wavelength light curve

Resampled data might contain inaccurate/unrealistic data which will affect the reliability of the features extracted from resampled data.

  • When there are gaps in corrected data as observed in first plot for object mithril_dam_agarwaen, the resampled data assumed that it has value of 0 across the gaps which might not be true, thus creating artificial values.
  • As observed in the second plot in object nidh_nim_firen, the resampled flux data may have less extreme points as the corrected flux data (though the trend follows the corrected data), causing a mismatch between the resampled data and the corrected data, thus the resampled data might not be representative of the original data.

Results

The provided train dataset was split to a train and test set (30%). Hyperparameter tuning and training the model using the best hyperparameters on the train set were done on all possible pairs of feature extractor (applied in original or resampled light curve data) and binary classifer. MiniRocket feature extractor can only be applied on the resampled data while Light curve extractor could be applied to both original and resampled data. For training with the best hyperparameters, the maximum iteration (number of epochs for logistic regression and number of estimators for LightGBM) to used was the average maximum iteration used across all kfolds that correspond to the best hyperparameters.

The final trained model was evaluated on the test set to get the best model. All the training and evaluation results (f1 score, binary log loss, confusion matrix and feature importance plot) as well as the trained model weights are logged to MLflow (local storage) for reference and future use.

Table 1: Evaluation of the Test Set for the Different Pairs of Feature Extractor and Binary Classifier

Model MiniRocket + LR MiniRocket + LightGBM Light curve (original data) + LR Light curve (original data) + LightGBM Light curve (resampled data) + LR Light curve (resampled data) + LightGBM
Number of Features 2531 2531 219 219 306 306
Best Hyperparameters 'alpha': 0.00244
'l1_ratio': 0.0582
'learning_rate': 'constant'
'eta0': 0.000398
'num_leaves': 47
'max_depth': 12
'learning_rate': 0.174
'min_child_samples': 96
'subsample': 0.649
'colsample_bytree': 0.619
'reg_alpha': 4.19e-08
'reg_lambda': 2.13e-07
'alpha': 0.0107
'l1_ratio': 0.459
'learning_rate': 'optimal'
'eta0': 0.000105
'num_leaves': 153
'max_depth': 11
'learning_rate': 0.120
'min_child_samples': 54
'subsample': 0.737
'colsample_bytree': 0.859
'reg_alpha': 0.00996
'reg_lambda': 4.32e-05
'alpha': 0.000313
'l1_ratio': 0.870
'learning_rate': 'optimal'
'eta0': 0.0162
'num_leaves': 163
'max_depth': 12
'learning_rate': 0.197
'min_child_samples': 80
'subsample': 0.542
'colsample_bytree': 0.599
'reg_alpha': 0.000213
'reg_lambda': 0.868
F1 score (train set) 0.838 1.00 0.412 0.990 0.699 1.0
F1 score (test set) 0.362 0.444 0.284 0.519 0.290 0.465
Precision (test set) 0.311 0.486 0.183 0.568 0.225 0.476
Recall (test set) 0.432 0.409 0.636 0.477 0.409 0.455

Plot 1: Confusion Matrix of Test Set for Best Model (Light curve (original data) + LightGBM)

Confusion matrix of test set for best model

Plot 2: Top 10 Features for Best Model (Light curve (original data) + LightGBM)

Top 10 features for best model

  • The model with the best F1 score on the test set is the one with light curve feature extractor applied on the original data followed by the LightGBM binary classifier. There is slight difference in the precision and recall score hence the best model does not favor precision nor recall.
  • For all 3 LightGBM models, the train set has a F1 score of almost 1 which is perfect score, indicating overfitting to train set, despite regularization and subsampling included to prevent overfitting, indicating that the maximum depth of trees and number of leaves could be too large.
  • Logistic regression models which are simpler models performed poorer than the corresponding LightGBM models with the same features extracted.
  • Applying light curve feature extractor on resampled data led to slightly more features (306 vs 219) but did not improve the final fitted model.
  • The three most important features (largest gains) of the best model are obtained from the r wavelength and the most important feature which has more than twice the gain than other feature is the skewness of the time series in r wavelength.

Further Exploration

  1. Physics informed features: Features extracted using MiniRocket are temporal features but the aggregation could mask the finer details sequence of data. Features extracted from light curve extractor is also mainly temporal features with only the parametric fit values that are specific to light curves. The quality of features extracted affects the quality of the final trained model, hence more research can be done to derive useful physics informed light curve features.
  2. Explore other binary classifiers: Feature extractors produce tabular data so other tree-based binary classifiers like Random Forest or XGBoost can be explored.

About

ML pipeline for classifying Tidal Disruption Events from LSST astronomical light curves — including multi-output Gaussian process resampling of time series, using feature extractors like MiniRocket and light_curve, and binary classifiers like LightGBM and logistic regression.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors