Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 13 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,6 @@ PYTHONPATH='.' python3 -u ${FRIDATA_PATH}/fridata.py \

Running FRIdata on HPC differs on CPU and GPU nodes. This instruction set is valid for HPC hosted in PLGrid infrastructure. Running on other infrastructures may require additional adjustments.

### CPU

Prerequisites:
- Having active grant valid on the HPC
- Having a full list of mandatory ENV vars set (ideally in .bashrc):
Expand All @@ -100,8 +98,8 @@ Prerequisites:
- `AFDB_PATH`: path to AFDB structures (can be empty directory - structures will be fetched there)
- `DATA_PATH`: path to the parent diretory of all generated output data
- Optional ENV vars with default values:
- `COMMON_SLURM_PATH`: path to common_slurn_cpu.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/cpu/common_slurm_cpu.sh`
- `LAUNCH_WORKER_SLURM_PATH`: path to launch_worker_slurm_cpu.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/cpu/launch_workers_slurm_cpu.sh`
- `COMMON_SLURM_PATH`: path to common_slurn.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/common_slurm.sh`
- `LAUNCH_WORKER_SLURM_PATH`: path to launch_worker_slurm.sh, defaults to `$DEEPFRI_PATH/FRIdata/scripts/hpc/launch_workers_slurm.sh`
- `MEMORY_LIMIT`: memory limit per Dask worker, defaults to `288GiB`
- `IP_INTERFACE`: network unix interface, where dask workers are connected. Defaults to `ens1f0`
- `CONDA_ENV_PATH`: path to conda environment, defaults to `$DEEPFRI_PATH/conda_dev`
Expand All @@ -123,14 +121,21 @@ cd FRIdata
chmod u+x -R scripts/hpc/cpu
```

3. Run `initialize_slurm_cpu.sh`. As an argument put the path into directory, where `.conda` directory should be installed and specify `--cpu` flag
3. Run `initialize_slurm.sh`. As an argument put the path into directory, where `.conda` directory should be installed and specify `--cpu` flag if the script is run on CPU cluster.

```
./scripts/hpc/cpu/initialize_slurm_cpu.sh <path to .conda> --cpu
./scripts/hpc/initialize_slurm.sh <path to .conda> [--cpu]
```

4. Schedule SBatch script into the HPC with all the args specified
4. Schedule SBatch script into the HPC with all the args specified. Operations to be chosen are: `sequences`, `coordinates`, `embeddings`


For CPU:
```
sbatch --cpus-per-task=<cpus> --time=<HH:MM:SS> --nodes=<nodes> --account=<grant name> scripts/hpc/run_slurm.sh sequences,coordinates
```

For GPU:
```
sbatch --cpus-per-task=<cpus> --time=<HH:MM:SS> --nodes=<nodes> --account=<grant name> scripts/hpc/cpu/run_slurm_cpu.sh
sbatch --gres=gpu[:gpu-number] --time=<HH:MM:SS> --account=<grant name> --nodes=1 --partition=<partition name> --cpus-per-task=<cpus> scripts/hpc/run_slurm.sh embeddings
```
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,39 @@ start_computation() {

cd $DEEPFRI_PATH

module load gcc
module load miniconda3
# Robustly try to load GCC and a Conda/Miniconda module (handle varied names)
LOADED_GCC=false
LOADED_CONDA=false
if command -v module >/dev/null 2>&1; then
GCC_CANDIDATES=(gcc GCC)
for MOD in "${GCC_CANDIDATES[@]}"; do
if module load "$MOD" >/dev/null 2>&1; then
echo "Loaded module: $MOD"
LOADED_GCC=true
break
fi
done

CONDA_CANDIDATES=(miniconda3 Miniconda3 miniconda Anaconda3 anaconda3)
for MOD in "${CONDA_CANDIDATES[@]}"; do
if module load "$MOD" >/dev/null 2>&1; then
echo "Loaded module: $MOD"
LOADED_CONDA=true
break
fi
done
fi

if [ "$LOADED_GCC" = false ]; then
echo "Error: Could not load a GCC module."
exit 1
fi

if [ "$LOADED_CONDA" = false ]; then
echo "Error: Could not load a Conda module."
exit 1
fi

eval "$(conda shell.bash hook)"
conda activate $CONDA_ENV_PATH

Expand All @@ -45,20 +76,20 @@ start_computation() {
done

if [[ ! -v LAUNCH_WORKER_SLURM_PATH ]]; then
LAUNCH_WORKER_SLURM_PATH="$DEEPFRI_PATH/FRIdata/scripts/hpc/cpu/launch_workers_slurm_cpu.sh"
LAUNCH_WORKER_SLURM_PATH="$DEEPFRI_PATH/FRIdata/scripts/hpc/launch_workers_slurm.sh"
fi

chmod +x $LAUNCH_WORKER_SLURM_PATH/launch_workers_slurm_cpu.sh
chmod +x $LAUNCH_WORKER_SLURM_PATH

$LAUNCH_WORKER_SLURM_PATH/launch_workers_slurm_cpu.sh $SLURM_CPUS_PER_TASK ${nodes_array[0]} &
$LAUNCH_WORKER_SLURM_PATH $SLURM_CPUS_PER_TASK ${nodes_array[0]} &

echo "Head node workers"

worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
srun -w "$node_i" -c $SLURM_CPUS_PER_TASK $LAUNCH_WORKER_SLURM_PATH/launch_workers_slurm_cpu.sh $SLURM_CPUS_PER_TASK $node_i &
srun -w "$node_i" -c $SLURM_CPUS_PER_TASK $LAUNCH_WORKER_SLURM_PATH $SLURM_CPUS_PER_TASK $node_i &
echo "$node_i started srun workers"
done

Expand Down
13 changes: 0 additions & 13 deletions scripts/hpc/cpu/run_slurm_cpu.sh

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,33 @@ fi

CONDA_DIR="$GROUP_DIR/.conda"

module load miniconda3
# Try loading a Conda/Miniconda module in a robust way (handle varied names)
LOADED_MODULE=false
if command -v module >/dev/null 2>&1; then
MODULE_CANDIDATES=(miniconda3 Miniconda3 miniconda Anaconda3 anaconda3)
for MOD in "${MODULE_CANDIDATES[@]}"; do
if module load "$MOD" >/dev/null 2>&1; then
echo "Loaded module: $MOD"
LOADED_MODULE=true
break
fi
done
fi

if [ "$LOADED_MODULE" = false ]; then
echo "Error: Could not load a Conda module."
exit 1
fi

conda config --add pkgs_dirs "$CONDA_DIR"

# Create environment from base YAML (without PyTorch)
conda env create --prefix $CONDA_ENV_PATH --file "$DEEPFRI_PATH/FRIdata/toolbox_env_conda.yml"

conda config --set auto_activate_base false

source activate $CONDA_ENV_PATH
eval "$(conda shell.bash hook)"
conda activate $CONDA_ENV_PATH

# Install PyTorch based on mode
if [ "$CPU_ONLY" = true ]; then
Expand Down
13 changes: 13 additions & 0 deletions scripts/hpc/run_slurm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

EMBEDDER_TYPE=esm2_t33_650M_UR50D

if [[ ! -v COMMON_SLURM_PATH ]]; then
COMMON_SLURM_PATH="$DEEPFRI_PATH/FRIdata/scripts/hpc/common_slurm.sh"
fi

source $COMMON_SLURM_PATH

PYTHON_COMMAND="PYTHONPATH='.' python3 -u ${DEEPFRI_PATH}/FRIdata/fridata.py input_generation -t $1 -d AFDB -c subset --overwrite --version 1_test_dask -i ${IDS_PATH} --input-path ${AFDB_PATH} -e ${EMBEDDER_TYPE} --slurm --verbose"

start_computation "$PYTHON_COMMAND"
6 changes: 5 additions & 1 deletion toolbox/worker_setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

dotenv.load_dotenv()
data_path = os.getenv("DATA_PATH")
data_path = pathlib.Path(data_path).parent / "fridata"

sys.path.append(str(data_path))

here = pathlib.Path(__file__).resolve()
repo_root = here.parents[1]
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
Loading