Rishit Kar1, Ved Ambular1, Sargam Nagar1, Varun Shenai1
1 Department of Computer Engineering, DJ Sanghvi College of Engineering
OMNIGATE is a deep learning framework designed for robust multi-modal cancer subtype classification. Unlike traditional fusion methods that simply concatenate features, OMNIGATE utilizes a dynamic context gating mechanism that learns to weigh the importance of specific omics layers (mRNA, miRNA, CNV, Methylation) on a per-sample basis.
The dataset used in this study is extracted from the MLOmics dataset, a publicly available multi-omics benchmark designed for cancer subtype classification tasks. It integrates heterogeneous molecular data collected from large-scale cancer genomics projects.
The core model learns a latent representation for each omics modality and then applies a context-aware gate to each latent block before final classification. This lets the network dynamically emphasize the most informative modality for each sample instead of relying on static concatenation alone.
The current src pipeline supports:
- Multi-cancer training across
GS-BRCA,GS-LGG,GS-OV,GS-COAD, andGS-GBM - Gated multi-omics neural fusion with focal loss and regularization terms
- Dynamic fold selection based on the minimum class count
- Classifier-head ablation with
Base_MLP,SVM,XGBoost, andDeeper_MLP - Aggregated gate-importance plots and Top-20 feature sensitivity plots
- Fold-wise and global CSV export for downstream analysis
OMNIGATE/
├── preprocessing/
│ └── processed_multicancer/
│ └── GS-*/ # Per-cancer processed arrays and feature-name files
├── results_aggregated/ # Generated outputs after training
├── src/
│ ├── config.py # Global settings, paths, runtime configuration
│ ├── data.py # Dataset and feature-name loading
│ ├── models.py # Losses and gated fusion network
│ ├── training.py # Fold training and classifier ablation
│ ├── reporting.py # Plotting and CSV export
│ └── main.py # End-to-end training entrypoint
├── docs/
│ └── assets/ # README figures
├── final_ablation_summary_all_cancers.csv
└── requirements.txt
Each cancer directory under preprocessing/processed_multicancer/ is expected to contain:
mRNA_processed.npymiRNA_processed.npyCNV_processed.npyMethy_processed.npylabels.npymRNA_features.jsonmiRNA_features.jsonCNV_features.jsonMethy_features.json
Example:
preprocessing/processed_multicancer/GS-BRCA/
├── mRNA_processed.npy
├── miRNA_processed.npy
├── CNV_processed.npy
├── Methy_processed.npy
├── labels.npy
├── mRNA_features.json
├── miRNA_features.json
├── CNV_features.json
└── Methy_features.json
The easiest way to run this project reproducibly is with Docker. The container installs all dependencies, includes the training code, and runs the pipeline in a headless environment that is already configured for matplotlib.
docker build -t omnigate .docker run --rm -it omnigateThis starts the full multi-cancer training pipeline and writes outputs inside the container at:
/app/results_aggregated
If you want the generated plots and CSV files to appear directly in your local project folder, mount the results directory when running the container:
docker run --rm -it \
-v "$(pwd)/results_aggregated:/app/results_aggregated" \
omnigateThis is the most practical way to work with the project because the exported files will remain available on your host system after the container exits.
Before running the container, make sure the processed dataset is already present in:
preprocessing/processed_multicancer/
Each cancer directory should include:
mRNA_processed.npymiRNA_processed.npyCNV_processed.npyMethy_processed.npylabels.npymRNA_features.jsonmiRNA_features.jsonCNV_features.jsonMethy_features.json
For most users, Docker is the recommended training path:
docker run --rm -it \
-v "$(pwd)/results_aggregated:/app/results_aggregated" \
omnigateThis command will:
- Load each cancer dataset from
preprocessing/processed_multicancer/ - Train the gated fusion model with stratified cross-validation
- Evaluate alternative classifier heads on the learned fused representation
- Compute sensitivity-based Top-20 feature rankings for each omics modality
- Save all aggregated figures and CSV summaries to
results_aggregated/
If you do not want to use Docker, you can still run the code locally with a Python environment and requirements.txt. That path is optional now, and Docker should be preferred when you want the easiest reproducible setup.
After training, the pipeline writes outputs such as:
results_aggregated/final_ablation_summary_all_cancers.csvresults_aggregated/<CANCER>/detailed_ablation_results.csvresults_aggregated/<CANCER>/aggregated_gate_importance.pngresults_aggregated/<CANCER>/aggregated_top20_mRNA.csvresults_aggregated/<CANCER>/aggregated_top20_mRNA.png
Equivalent Top-20 feature files are also produced for miRNA, CNV, and Methy.
Main runtime settings are defined in src/config.py, including:
MAX_EPOCHSMIN_EPOCHSPATIENCELRWEIGHT_DECAYALIGN_WORTHO_WGATE_ENT_WSPARSITY_WOMICS_DROPOUT_PLATENT_DIM
If you want to adapt the pipeline for new experiments, this is the first file to modify.
The model trains one encoder per modality, concatenates latent vectors to build global context, and then predicts modality-specific gates from that context. The gated latent vectors are fused and passed into a classifier head. Training combines focal loss with alignment, orthogonality, gate-entropy, and sparsity terms to improve robustness and reduce redundant modality usage.
The ablation workflow reuses learned fused embeddings and compares:
- Neural baseline head
- SVM head
- XGBoost head
- Deeper MLP head
This design makes it easier to test whether performance gains come from the representation itself, the classifier head, or both.
- Random seeds are fixed in
src/config.py - Fold generation uses
StratifiedKFold - CUDA is used automatically when available
- Output directories are created automatically on startup
This codebase is structured for research experimentation, internal benchmarking, and figure generation around multi-omics cancer subtype classification. For production or clinical deployment, additional dataset validation, calibration, uncertainty estimation, and external evaluation would be required.
