Skip to content

ShuklaGroup/esmdynamic

Repository files navigation

ESMDynamic

bioRxiv Open In Colab Download Data

This is the code repository for ESMDynamic: Fast and Accurate Prediction of Protein Dynamic Contact Maps from Single Sequences. This repository is based on Evolutionary Scale Modeling, which has been archived.

model

Table of contents

Usage

Quick Start

If you wish to use the model to predict a small number of sequences, we recommend you simply use our Google Colab Notebook with manual sequence entry.

Otherwise, building a Docker image with the Dockerfile is the simplest option to get started. Within the container, run_esmdynamic can be used to predict sequences in batches from a FASTA or CSV file using flags --fasta or --csv.

Installation

We recommend using the Dockerfile method to create an image with all required packages. Due to package deprecations, it may be difficult to install all requirements in a Python (e.g., Conda) environment. Additionally, the Docker setup process conveniently downloads the model weights. The only downside is that the Docker image takes relatively more space (~20 GB).

Docker

First, make sure you have installed Docker.

Since a GPU is recommended to run the model, you should have installed the NVIDIA Container Toolkit as well.

Next, follow the commands:

git clone https://github.com/ShuklaGroup/esmdynamic.git # Clone repo
cd esmdynamic
docker build -t esmdynamic .
docker run --rm -it --gpus all -v "$PWD":/workspace esmdynamic # Run container in current dir w/GPU access
run_esmdynamic -h # Print help for prediction script 

Conda

Install Conda if not available. Create an environment and install packages (this is using Python 3.11, CUDA 12.9, torch 2.8.0).

conda create -n esmdynamic python=3.11.13
conda activate esmdynamic
conda install -c nvidia cuda-nvcc=12.9.86 cuda-toolkit=12.9.1
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu129
pip3 install mdtraj scipy omegaconf pytorch_lightning biopython ml_collections einops py3Dmol modelcif matplotlib plotly[express] dm-tree tensorboard
pip3 install git+https://github.com/NVIDIA/dllogger.git
pip3 install --no-build-isolation 'git+https://github.com/sokrypton/openfold.git' # Use the ColabFold fork!
pip install git+https://github.com/ShuklaGroup/esmdynamic.git

You can then run the run_esmdynamic script for inference:

run_esmdynamic -h # Print docs, will download weights when needed

Bulk Prediction

The predict.py script is the implementation for the executable run_esmdynamic. These are the docs:

usage: run_esmdynamic [-h] (--sequence SEQUENCE | --fasta FASTA | --csv CSV) [--batch_size BATCH_SIZE] [--chunk_size CHUNK_SIZE] [--device {cpu,cuda}] [--output_dir OUTPUT_DIR]
                      [--chain_ids CHAIN_IDS] [--low_memory] [--save_html] [--save_png] [--save_txt] [--save_raw_pt] [--num_recycles NUM_RECYCLES]

Predict dynamic contacts, frequency, and kinetics using ESMDynamic.

options:
  -h, --help            show this help message and exit
  --sequence SEQUENCE   Single sequence string.
  --fasta FASTA         Path to FASTA file with sequences.
  --csv CSV             CSV file with sequences (first column ID, second column sequence).
  --batch_size BATCH_SIZE
                        Batch size.
  --chunk_size CHUNK_SIZE
                        Model chunk size.
  --device {cpu,cuda}   Device to use.
  --output_dir OUTPUT_DIR
                        Directory where outputs will be written.
  --chain_ids CHAIN_IDS
                        Chain IDs to use for labels (e.g. ABCDEF). Default: A-Z.
  --low_memory          Use low-memory inference mode.
  --save_html           Also save interactive HTML heatmaps.
  --save_png            Save PNG heatmaps/plots.
  --save_txt            Save text/CSV outputs.
  --save_raw_pt         Save a .pt bundle with all cropped outputs for each sequence.
  --num_recycles NUM_RECYCLES
                        Optional number of recycles to pass to the model.

With FASTA file input, the headers will be used as protein IDs. With CSV input, the first row are column headers, the first column contains protein IDs, and the second column contains the protein sequences.

Use : to separate chains (unless using the Colab Notebook, then use /).

To recreate the dynamic contact maps in our publication, use either of the files in examples:

run_esmdynamic --csv example.csv --output_dir example

To interpret the output, please see next section.

Depending on your system's memory, you may change the default values for --batch_size or --chunk_size to trade off between speed and VRAM. You may also experiment with the --low_memory flag, which runs each head sequentially instead of in parallel, but this is considerably slower.

Output Interpretation

For a detailed breakdown of model outputs, please read our accompanying documentation: ESMDynamic Output Interpretation

Visualization

If you use the run_esmdynamic script or the Colab Notebook, you will obtain interactive HTML files that make visualization easier. Open the file(s) with a browser. Functionality includes zooming in and creating screen captures.

viz

Avilable Models and Datasets

Pretrained Model

The ESMDynamic model weights are available at the Illinois Data Bank under DOI:10.13012/B2IDB-3773897_V2. Note you must still obtain the ESMFold weights to run the model. A simple way to download the weights is with:

import esm
model = esm.pretrained.esmdynamic()

Weights will be found in the path given by torch.hub.get_dir().

Datasets

Three datasets are available at DOI:10.13012/B2IDB-3773897_V2. Follow the instructions in the README at the Data Bank to convert the files to the format needed for training. Each directory contains information about the data splits (list of identifiers in CSV format).

Dataset Name Original Data Source Related Publication
ATLAS (Test Set) ATLAS Database ATLAS
mdCATH mdCATH Dataset mdCATH
RCSB Clusters RCSB RCSB

Warning

Datasets expand into large directories (>20 GB).

Human Proteome

You can access predictions for most of the proteins in the human proteome (UniProt Proteome ID UP000005640) on the data repository. See this table to find what archive fragment contains the predictions you need.

Training

These instructions apply for the training on mdCATH. First download and convert the required dataset from DOI:10.13012/B2IDB-3773897_V2 following the README from the Data Bank. Then, you can use the train.py script from this repository. You will need to write a file with training parameters, named something like train_params.txt, for example (to fit the kinetics heads only):

--loss_heads=kinetic_logits,kinetic_confidence
--kin_class_weights=kinetic_weights.pt # Class weights, bundled with dataset
--train_identifiers_file=train.csv
--val_identifiers_file=val.csv
--data_dir=./mdcath/
--outpath=./training_data_kinetics
--batch_size=4
--batch_accum=16
--epochs=100
--train_samples_per_epoch=1000
--val_samples_per_epoch=100
--alpha=0.85
--gamma=2
--device=cuda
--lr=0.001
--pretrained=previous_weights_kinetics.pt # Path to a full state dict

Then, training can be run with:

python esm/esmdynamic/training/train.py @train_params.txt

Citations

If you use this code or its related datasets, please cite:

@article {Kleiman2025.08.20.671365,
	author = {Kleiman, Diego E and Feng, Jiangyan and Xue, Zhengyuan and Shukla, Diwakar},
	title = {ESMDynamic: A Fast and Accurate Prediction of Protein Dynamic Contact Maps from Single Sequences},
	elocation-id = {2025.08.20.671365},
	year = {2025},
	doi = {10.1101/2025.08.20.671365},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.20.671365},
	eprint = {https://www.biorxiv.org/content/early/2025/08/24/2025.08.20.671365.full.pdf},
	journal = {bioRxiv}
}

You should also include citations to the related publications if appropriate:

License

Code is shared under the MIT License.

Code from ESM is also shared under the MIT License (see THIRD_PARTY_NOTICES.txt).

About

ESMDynamic repo

Resources

License

Stars

Watchers

Forks

Contributors

Languages