Skip to content

Tomasz-Lab/FRIdata

Repository files navigation

FRIdata

License Python CI Implements Dask Source GitHub issues

Instalation and activation

  1. Download the repo
git clone https://github.com/Tomasz-Lab/FRIdata.git
cd FRIdata
  1. Install miniconda

  2. Install mamba

## prioritize 'conda-forge' channel
conda config --add channels conda-forge

## update existing packages to use 'conda-forge' channel
conda update -n base --all

## install 'mamba'
conda install -n base mamba
  1. Create a mamba environment
mamba create -f toolbox_env_conda.yml
  1. Activate mamba shell hook
# Choose your shell type. Could be one of these: {bash,cmd.exe,dash,fish,nu,posix,powershell,tcsh,xonsh,zsh}
eval "$(mamba shell hook --shell <replace with shell type>)"
  1. Activate the mamba environment
mamba activate tbe

Running tests

pytest ./tests

Running on AFDB structures locally

Requires having a directory with AFDB structures and a text file containing list of AFDB IDs with \n delimeter.

#
# Assuming all steps from `Instalation and activation` succeded
#
FRIDATA_PATH="<repository path>"
AFDB_PATH="<AFDB structures directory path>"
IDS_PATH="<AFDB IDs file path>"

cd ${FRIDATA_PATH}

EMBEDDER_TYPE=esm2_t33_650M_UR50D

# (MACOS only) Fix for OpenMP multiple runtime error
export KMP_DUPLICATE_LIB_OK=TRUE

PYTHONPATH='.' python3 -u ${FRIDATA_PATH}/fridata.py \
 generate_data \
 -t sequences,coordinates,distograms,embeddings \
 -d AFDB \
 -c subset \
 --version test  \
 -i ${IDS_PATH} \
 --input-path ${AFDB_PATH} \
 -e ${EMBEDDER_TYPE}

For subset runs with --input-path, new datasets store canonical keys as {line_from_ids_file}_{chain} (for example A0A2K6V5L6_A), not the full AlphaFold CIF filename stem. The dataset’s input_structures.idx maps each canonical key to the source structure filename. Older datasets created before this convention may still use long AF-style keys.

Running as a CLI tool

Assuming all Instalation and activation steps succeeded.

  1. Go into FRIdata directory
cd <path into FRIdata>
  1. Install as a CLI tool
python3 -m pip install -e .
  1. Now FRIdata can be run as a CLI tool
$ fridata <...>
```3dc54 (Use ids_file tokens (e.g. plain UniProt) plus chain as the canonical dataset index keys)

## Running on HPC

Running FRIdata on HPC differs on CPU and GPU nodes. This instruction set is valid for HPC hosted in PLGrid infrastructure. Running on other infrastructures may require additional adjustments.

Prerequisites:
- Having active grant valid on the HPC
- Having a full list of mandatory ENV vars set (ideally in .bashrc):
    - `DEEPFRI_PATH`: should always refer to a parent directory of this repo
    - `IDS_PATH`: path to a text file with AFDB indexes listed
    - `AFDB_PATH`: path to AFDB structures (can be empty directory - structures will be fetched there)
    - `DATA_PATH`: path to the parent diretory of all generated output data
    - Optional ENV vars with default values:
        - `COMMON_SLURM_PATH`: path to common_slurn.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/common_slurm.sh`
        - `LAUNCH_WORKER_SLURM_PATH`: path to launch_worker_slurm.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/launch_workers_slurm.sh`
        - `MEMORY_LIMIT`: memory limit per Dask worker, defaults to `288GiB`
        - `IP_INTERFACE`: network unix interface, where dask workers are connected. Defaults to `ens1f0`
        - `CONDA_ENV_PATH`: path to conda environment, defaults to `$DEEPFRI_PATH/conda_dev`
- Have installed module miniconda3
- Have installed module gcc

Steps:

1. Download the repo

git clone https://github.com/Tomasz-Lab/FRIdata.git cd FRIdata


2. Update run permissions

chmod u+x -R scripts/hpc/cpu


3. Run `initialize_slurm.sh`. As an argument put the path into directory, where `.conda` directory should be installed and specify `--cpu` flag if the script is run on CPU cluster.

./scripts/hpc/initialize_slurm.sh <path to .conda> [--cpu]


4. Schedule SBatch script into the HPC with all the args specified. Operations to be chosen are: `sequences`, `coordinates`, `embeddings`


For CPU:

sbatch --cpus-per-task= --time=HH:MM:SS --nodes= --account= scripts/hpc/run_slurm.sh sequences,coordinates


For GPU:

sbatch --gres=gpu[:gpu-number] --time=HH:MM:SS --account= --nodes=1 --partition= --cpus-per-task= scripts/hpc/run_slurm.sh embeddings

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors