Skip to content

acadev/APAS-SB

Repository files navigation

APAS-SB: Advanced Protein Analysis System with Structure-Based Learning

A hybrid implementation combining multi-task learning with Boltz-2 datasets for comprehensive protein property prediction.

🎯 Overview

APAS-SB extends the PEARL (Protein structure prediction) model with:

  • Multi-task learning for diverse biochemical properties
  • Boltz-2 datasets for state-of-the-art binding affinity prediction
  • ΔΔG prediction for protein mutations
  • Uncertainty quantification for confidence estimation
  • Density-aware training for improved structure quality

πŸš€ Key Features

1. Multi-Task Predictions

  • Binding Affinity: Protein-ligand interactions (ChEMBL, BindingDB, PDBbind)
  • Protein-Protein ΔΔG: Interaction energy changes (SKEMPI 2.0)
  • Enzyme Catalysis: kcat predictions (BRENDA)
  • Fitness Scores: Deep mutational scanning (ProteinGym)

2. Hybrid Dataset Integration

  • 7.25M training examples across 11 datasets
  • Original multi-task datasets: 2.6M examples
  • Boltz-2 datasets: 4.6M examples
  • 200 GB processed data with embeddings

3. Advanced Training Strategies

  • Uncertainty-aware losses with confidence estimation
  • Density-aware training with electron density maps
  • Multi-task learning with shared representations
  • Three-phase training (structure β†’ confidence β†’ affinity)

πŸ“Š Dataset Summary

Dataset Examples Task Source
ChEMBL 600K Binding affinity Boltz-2
BindingDB 600K Binding affinity Boltz-2
PubChem HTS 2.0M Binary classification Boltz-2
ProteinGym 2.5M Fitness scores Original
BRENDA 100K Enzyme kcat Original
PDBbind 20K Binding affinity Original
SKEMPI 2.0 8K Protein-protein ΔΔG Original
Others 1.4M Various Boltz-2

πŸ› οΈ Installation

# Clone the repository
git clone https://github.com/acadev/APAS-SB.git
cd APAS-SB

# Install dependencies
pip install -r pearl/requirements.txt

πŸŽ“ Quick Start

1. Test All Boltz-2 Datasets

python scripts/test_all_boltz2_datasets.py

2. Test Boltz-2 Loss Functions

python scripts/test_boltz2_losses.py

3. Test MD Trajectory Loaders

python scripts/test_md_loaders.py

4. Download Datasets

python scripts/download_datasets.py --datasets mdcath atlas chembl bindingdb

5. Train on Oracle Cloud (64 H100 GPUs)

# Phase 2A: Baseline Training (48 GPUs, 13 days)
torchrun --nproc_per_node=8 --nnodes=6 \
    scripts/train_oracle_cloud.py --config scripts/oracle_cloud_config.yaml --phase 2a

# Phase 2B: Multi-task Training (56 GPUs, 17 days)
torchrun --nproc_per_node=8 --nnodes=7 \
    scripts/train_oracle_cloud.py --config scripts/oracle_cloud_config.yaml --phase 2b

# Phase 2C: Uncertainty-Aware Training (64 GPUs, 20 days)
torchrun --nproc_per_node=8 --nnodes=8 \
    scripts/train_oracle_cloud.py --config scripts/oracle_cloud_config.yaml --phase 2c

πŸ“ˆ Expected Performance

Binding Affinity (vs. Boltz-2)

  • FEP+ (OpenFE): 0.64-0.66 Pearson R (Boltz-2: 0.62)
  • CASP16: 0.66-0.68 Pearson R (Boltz-2: 0.65)
  • MF-PCBA: 0.026-0.028 AP (Boltz-2: 0.0248)

Multi-Task (New Capabilities)

  • Protein-Protein ΔΔG: 0.55-0.60 Pearson R
  • Enzyme kcat: 0.45-0.50 Pearson R
  • Fitness Scores: 0.50-0.55 Spearman ρ

πŸ’° Cost Estimates

Phase Time Cost Description
Data Preparation 40 days $8K Download + processing
Training (Pretrained) 24 days $20M 32Γ— A100 GPUs
Training (From Scratch) 75 days $87M 64Γ— A100 GPUs
Storage - $10/month 200 GB processed data

πŸ“š Documentation

Essential Documents

Quick Links

πŸ—οΈ Architecture

APAS-SB/
β”œβ”€β”€ pearl/                          # Core library
β”‚   β”œβ”€β”€ models/                     # Model architectures
β”‚   β”‚   β”œβ”€β”€ pearl.py               # Base PEARL model
β”‚   β”‚   β”œβ”€β”€ multitask_pearl.py     # Multi-task extension
β”‚   β”‚   β”œβ”€β”€ ddg_predictor.py       # ΔΔG prediction
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ training/                  # Training utilities
β”‚   β”‚   β”œβ”€β”€ losses.py              # Loss functions
β”‚   β”‚   β”œβ”€β”€ boltz2_losses.py       # Boltz-2 loss functions (NEW)
β”‚   β”‚   └── ...
β”‚   └── data/                      # Data loaders
β”‚       β”œβ”€β”€ multitask_datasets.py  # 11 dataset loaders (NEW)
β”‚       β”œβ”€β”€ mdcath_loader.py       # mdCATH MD trajectories (NEW)
β”‚       β”œβ”€β”€ atlas_loader.py        # ATLAS MD trajectories (NEW)
β”‚       └── ...
β”œβ”€β”€ scripts/                       # Training & testing scripts
β”‚   β”œβ”€β”€ train_oracle_cloud.py      # Oracle Cloud training (NEW)
β”‚   β”œβ”€β”€ download_datasets.py       # Dataset downloader (NEW)
β”‚   β”œβ”€β”€ test_all_boltz2_datasets.py # Dataset tests (NEW)
β”‚   β”œβ”€β”€ test_boltz2_losses.py      # Loss function tests (NEW)
β”‚   β”œβ”€β”€ test_md_loaders.py         # MD loader tests (NEW)
β”‚   └── ...
└── docs/                          # Organized documentation
    β”œβ”€β”€ guides/                    # User guides
    β”œβ”€β”€ architecture/              # Technical architecture
    β”œβ”€β”€ summaries/                 # Cost & scaling analysis
    └── archive/                   # Historical documents

πŸ”¬ Research Applications

  1. Drug Discovery: Binding affinity prediction, hit-to-lead optimization
  2. Antibody Design: Protein-protein interaction engineering
  3. Metabolic Engineering: Enzyme catalysis optimization
  4. Protein Design: Fitness-guided directed evolution

πŸ“ Citation

If you use this code in your research, please cite:

@software{apas_sb_2024,
  title={APAS-SB: Advanced Protein Analysis System with Structure-Based Learning},
  author={acadev},
  year={2024},
  url={https://github.com/acadev/APAS-SB}
}

πŸ“„ License

[Add your license here]

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“§ Contact

For questions or issues, please open an issue on GitHub.


βœ… Implementation Status

Current Status: βœ… Steps 1-5 Complete - Ready for Oracle Cloud deployment

Completed Features

  • βœ… 7 Boltz-2 datasets implemented (ChEMBL, BindingDB, PubChem, CeMM, MIDAS, Decoys)
  • βœ… 3 Boltz-2 loss functions (Huber, Pairwise Ranking, Focal)
  • βœ… Download infrastructure for all datasets (mdCATH, ATLAS, etc.)
  • βœ… MD trajectory loaders (mdCATH: 135K trajectories, ATLAS: 4K trajectories)
  • βœ… Oracle Cloud training scripts (64 H100 GPUs, 3-phase training)
  • βœ… All components tested with synthetic data

Next Steps

  1. Download real datasets (Days 1-12 of roadmap)
  2. Test with real data
  3. Launch Phase 2A training on Oracle Cloud (48 GPUs, 13 days)
  4. Scale to Phase 2B and 2C (56-64 GPUs)

Last Updated: December 2024

About

Advanced Protein Analysis System with Structure-Based Learning - Hybrid multi-task model combining PEARL with Boltz-2 datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors