FRIdata

Instalation and activation

Download the repo

git clone https://github.com/Tomasz-Lab/FRIdata.git
cd FRIdata

Install miniconda
Install mamba

## prioritize 'conda-forge' channel
conda config --add channels conda-forge

## update existing packages to use 'conda-forge' channel
conda update -n base --all

## install 'mamba'
conda install -n base mamba

Create a mamba environment

mamba create -f toolbox_env_conda.yml

Activate mamba shell hook

# Choose your shell type. Could be one of these: {bash,cmd.exe,dash,fish,nu,posix,powershell,tcsh,xonsh,zsh}
eval "$(mamba shell hook --shell <replace with shell type>)"

Activate the mamba environment

mamba activate tbe

Running tests

pytest ./tests

Running on AFDB structures locally

Requires having a directory with AFDB structures and a text file containing list of AFDB IDs with \n delimeter.

#
# Assuming all steps from `Instalation and activation` succeded
#
FRIDATA_PATH="<repository path>"
AFDB_PATH="<AFDB structures directory path>"
IDS_PATH="<AFDB IDs file path>"

cd ${FRIDATA_PATH}

EMBEDDER_TYPE=esm2_t33_650M_UR50D

# (MACOS only) Fix for OpenMP multiple runtime error
export KMP_DUPLICATE_LIB_OK=TRUE

PYTHONPATH='.' python3 -u ${FRIDATA_PATH}/fridata.py \
 generate_data \
 -t sequences,coordinates,distograms,embeddings \
 -d AFDB \
 -c subset \
 --version test  \
 -i ${IDS_PATH} \
 --input-path ${AFDB_PATH} \
 -e ${EMBEDDER_TYPE}

For subset runs with --input-path, new datasets store canonical keys as {line_from_ids_file}_{chain} (for example A0A2K6V5L6_A), not the full AlphaFold CIF filename stem. The dataset’s input_structures.idx maps each canonical key to the source structure filename. Older datasets created before this convention may still use long AF-style keys.

Running as a CLI tool

Assuming all Instalation and activation steps succeeded.

Go into FRIdata directory

cd <path into FRIdata>

Install as a CLI tool

python3 -m pip install -e .

Now FRIdata can be run as a CLI tool

$ fridata <...>
```3dc54 (Use ids_file tokens (e.g. plain UniProt) plus chain as the canonical dataset index keys)

## Running on HPC

Running FRIdata on HPC differs on CPU and GPU nodes. This instruction set is valid for HPC hosted in PLGrid infrastructure. Running on other infrastructures may require additional adjustments.

Prerequisites:
- Having active grant valid on the HPC
- Having a full list of mandatory ENV vars set (ideally in .bashrc):
    - `DEEPFRI_PATH`: should always refer to a parent directory of this repo
    - `IDS_PATH`: path to a text file with AFDB indexes listed
    - `AFDB_PATH`: path to AFDB structures (can be empty directory - structures will be fetched there)
    - `DATA_PATH`: path to the parent diretory of all generated output data
    - Optional ENV vars with default values:
        - `COMMON_SLURM_PATH`: path to common_slurn.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/common_slurm.sh`
        - `LAUNCH_WORKER_SLURM_PATH`: path to launch_worker_slurm.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/launch_workers_slurm.sh`
        - `MEMORY_LIMIT`: memory limit per Dask worker, defaults to `288GiB`
        - `IP_INTERFACE`: network unix interface, where dask workers are connected. Defaults to `ens1f0`
        - `CONDA_ENV_PATH`: path to conda environment, defaults to `$DEEPFRI_PATH/conda_dev`
- Have installed module miniconda3
- Have installed module gcc

Steps:

1. Download the repo

git clone https://github.com/Tomasz-Lab/FRIdata.git cd FRIdata


2. Update run permissions

chmod u+x -R scripts/hpc/cpu


3. Run `initialize_slurm.sh`. As an argument put the path into directory, where `.conda` directory should be installed and specify `--cpu` flag if the script is run on CPU cluster.

./scripts/hpc/initialize_slurm.sh <path to .conda> [--cpu]


4. Schedule SBatch script into the HPC with all the args specified. Operations to be chosen are: `sequences`, `coordinates`, `embeddings`


For CPU:

sbatch --cpus-per-task= --time=HH:MM:SS --nodes= --account= scripts/hpc/run_slurm.sh sequences,coordinates


For GPU:

sbatch --gres=gpu[:gpu-number] --time=HH:MM:SS --account= --nodes=1 --partition= --cpus-per-task= scripts/hpc/run_slurm.sh embeddings

Name		Name	Last commit message	Last commit date
Latest commit History 413 Commits
.github/workflows		.github/workflows
scripts/hpc		scripts/hpc
tests		tests
toolbox		toolbox
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
example.config.json		example.config.json
fridata.py		fridata.py
pyproject.toml		pyproject.toml
toolbox_env_conda.yml		toolbox_env_conda.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FRIdata

Instalation and activation

Running tests

Running on AFDB structures locally

Running as a CLI tool

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FRIdata

Instalation and activation

Running tests

Running on AFDB structures locally

Running as a CLI tool

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages