A hybrid implementation combining multi-task learning with Boltz-2 datasets for comprehensive protein property prediction.
APAS-SB extends the PEARL (Protein structure prediction) model with:
- Multi-task learning for diverse biochemical properties
- Boltz-2 datasets for state-of-the-art binding affinity prediction
- ΞΞG prediction for protein mutations
- Uncertainty quantification for confidence estimation
- Density-aware training for improved structure quality
- Binding Affinity: Protein-ligand interactions (ChEMBL, BindingDB, PDBbind)
- Protein-Protein ΞΞG: Interaction energy changes (SKEMPI 2.0)
- Enzyme Catalysis: kcat predictions (BRENDA)
- Fitness Scores: Deep mutational scanning (ProteinGym)
- 7.25M training examples across 11 datasets
- Original multi-task datasets: 2.6M examples
- Boltz-2 datasets: 4.6M examples
- 200 GB processed data with embeddings
- Uncertainty-aware losses with confidence estimation
- Density-aware training with electron density maps
- Multi-task learning with shared representations
- Three-phase training (structure β confidence β affinity)
| Dataset | Examples | Task | Source |
|---|---|---|---|
| ChEMBL | 600K | Binding affinity | Boltz-2 |
| BindingDB | 600K | Binding affinity | Boltz-2 |
| PubChem HTS | 2.0M | Binary classification | Boltz-2 |
| ProteinGym | 2.5M | Fitness scores | Original |
| BRENDA | 100K | Enzyme kcat | Original |
| PDBbind | 20K | Binding affinity | Original |
| SKEMPI 2.0 | 8K | Protein-protein ΞΞG | Original |
| Others | 1.4M | Various | Boltz-2 |
# Clone the repository
git clone https://github.com/acadev/APAS-SB.git
cd APAS-SB
# Install dependencies
pip install -r pearl/requirements.txtpython scripts/test_all_boltz2_datasets.pypython scripts/test_boltz2_losses.pypython scripts/test_md_loaders.pypython scripts/download_datasets.py --datasets mdcath atlas chembl bindingdb# Phase 2A: Baseline Training (48 GPUs, 13 days)
torchrun --nproc_per_node=8 --nnodes=6 \
scripts/train_oracle_cloud.py --config scripts/oracle_cloud_config.yaml --phase 2a
# Phase 2B: Multi-task Training (56 GPUs, 17 days)
torchrun --nproc_per_node=8 --nnodes=7 \
scripts/train_oracle_cloud.py --config scripts/oracle_cloud_config.yaml --phase 2b
# Phase 2C: Uncertainty-Aware Training (64 GPUs, 20 days)
torchrun --nproc_per_node=8 --nnodes=8 \
scripts/train_oracle_cloud.py --config scripts/oracle_cloud_config.yaml --phase 2c- FEP+ (OpenFE): 0.64-0.66 Pearson R (Boltz-2: 0.62)
- CASP16: 0.66-0.68 Pearson R (Boltz-2: 0.65)
- MF-PCBA: 0.026-0.028 AP (Boltz-2: 0.0248)
- Protein-Protein ΞΞG: 0.55-0.60 Pearson R
- Enzyme kcat: 0.45-0.50 Pearson R
- Fitness Scores: 0.50-0.55 Spearman Ο
| Phase | Time | Cost | Description |
|---|---|---|---|
| Data Preparation | 40 days | $8K | Download + processing |
| Training (Pretrained) | 24 days | $20M | 32Γ A100 GPUs |
| Training (From Scratch) | 75 days | $87M | 64Γ A100 GPUs |
| Storage | - | $10/month | 200 GB processed data |
- Implementation Complete Summary - Latest implementation status (Steps 1-5)
- Development Roadmap - 85-day training plan for Oracle Cloud (64 H100 GPUs)
- Documentation Index - Complete documentation navigation
- Quick Start Guide - Getting started
- Deployment Guide - HPC deployment
- Architecture Overview - Technical design documents
- Cost Analysis - Training cost estimates
APAS-SB/
βββ pearl/ # Core library
β βββ models/ # Model architectures
β β βββ pearl.py # Base PEARL model
β β βββ multitask_pearl.py # Multi-task extension
β β βββ ddg_predictor.py # ΞΞG prediction
β β βββ ...
β βββ training/ # Training utilities
β β βββ losses.py # Loss functions
β β βββ boltz2_losses.py # Boltz-2 loss functions (NEW)
β β βββ ...
β βββ data/ # Data loaders
β βββ multitask_datasets.py # 11 dataset loaders (NEW)
β βββ mdcath_loader.py # mdCATH MD trajectories (NEW)
β βββ atlas_loader.py # ATLAS MD trajectories (NEW)
β βββ ...
βββ scripts/ # Training & testing scripts
β βββ train_oracle_cloud.py # Oracle Cloud training (NEW)
β βββ download_datasets.py # Dataset downloader (NEW)
β βββ test_all_boltz2_datasets.py # Dataset tests (NEW)
β βββ test_boltz2_losses.py # Loss function tests (NEW)
β βββ test_md_loaders.py # MD loader tests (NEW)
β βββ ...
βββ docs/ # Organized documentation
βββ guides/ # User guides
βββ architecture/ # Technical architecture
βββ summaries/ # Cost & scaling analysis
βββ archive/ # Historical documents
- Drug Discovery: Binding affinity prediction, hit-to-lead optimization
- Antibody Design: Protein-protein interaction engineering
- Metabolic Engineering: Enzyme catalysis optimization
- Protein Design: Fitness-guided directed evolution
If you use this code in your research, please cite:
@software{apas_sb_2024,
title={APAS-SB: Advanced Protein Analysis System with Structure-Based Learning},
author={acadev},
year={2024},
url={https://github.com/acadev/APAS-SB}
}[Add your license here]
Contributions are welcome! Please feel free to submit a Pull Request.
For questions or issues, please open an issue on GitHub.
Current Status: β Steps 1-5 Complete - Ready for Oracle Cloud deployment
- β 7 Boltz-2 datasets implemented (ChEMBL, BindingDB, PubChem, CeMM, MIDAS, Decoys)
- β 3 Boltz-2 loss functions (Huber, Pairwise Ranking, Focal)
- β Download infrastructure for all datasets (mdCATH, ATLAS, etc.)
- β MD trajectory loaders (mdCATH: 135K trajectories, ATLAS: 4K trajectories)
- β Oracle Cloud training scripts (64 H100 GPUs, 3-phase training)
- β All components tested with synthetic data
- Download real datasets (Days 1-12 of roadmap)
- Test with real data
- Launch Phase 2A training on Oracle Cloud (48 GPUs, 13 days)
- Scale to Phase 2B and 2C (56-64 GPUs)
Last Updated: December 2024