cvd-biopathnet is a local-plus-HPC workflow for converting a cardiovascular guideline knowledge graph into BioPathNet datasets, then running training, prediction, and path visualization either directly on a laptop or through Slurm on HPC.
The project is designed around the first prediction task:
- head type:
Gene - relation:
ASSOCIATED_WITH_CONDITION - tail type:
Condition
All other relations are exported into train1.txt as the BioPathNet background regulatory graph (BRG). Training positives for the target task are exported into train2.txt, with deterministic valid.txt and test.txt splits.
- A format detector and raw graph loader that supports node-link JSON, GraphML, paired JSON / JSONL files, and paired CSV / TSV node+edge tables.
- A normalization pipeline for entity IDs, relation names, deduplication, and task extraction.
- BioPathNet dataset export with
train1,train2,valid,test,test_vis,entity_types,entity_names,node_colors_dict, andmetadata. - Validation tooling for duplicate checks, split overlap checks, exported entity map checks, and target-schema checks.
- Rendered BioPathNet config templates in configs/cvd_assoc_run.yaml and configs/cvd_assoc_vis.yaml.
- Local run scripts, Slurm batch scripts, laptop-to-HPC sync scripts, and a small prediction-inspection notebook.
- Vendored upstream BioPathNet code in third_party/BioPathNet.
Install the project package first so the CLI and helper modules are available:
python3 -m pip install -e .Inspect the raw graph:
python3 -m cvd_biopathnet.cli inspect-raw --input data/raw/guidelines_graphConvert the graph into a pilot BioPathNet dataset:
python3 -m cvd_biopathnet.cli convert \
--input data/raw/guidelines_graph \
--output data/processed/cvd_guidelines_assoc \
--reports-dir data/reports \
--target-relation ASSOCIATED_WITH_CONDITION \
--seed 42 \
--mode pilot \
--pilot-target-limit 256 \
--pilot-background-limit 2000Validate the exported dataset:
python3 -m cvd_biopathnet.cli validate-dataset --dataset-dir data/processed/cvd_guidelines_assocRender the BioPathNet config templates:
python3 -m cvd_biopathnet.cli render-configs \
--dataset-dir data/processed/cvd_guidelines_assoc \
--output-dir configsRun local training, prediction, and visualization:
bash scripts/run_local_train.sh
bash scripts/run_local_predict.sh
bash scripts/run_local_visualize.shThe local run scripts default to CPU mode with --gpus null. Set BIOPATHNET_GPUS="[0]" in your environment if you want to target a local GPU.
Copy the committed environment template first:
cp .env.example .envThe project scripts load .env first and then configs/paths.env if it exists, so you can either keep everything in .env or use configs/paths.env for local machine-specific overrides. A second template also exists at configs/paths.example.env.
Recommended IU HPC variables:
PROJECT_ROOT_HPC=/N/u/kvand/BigRed200/HCG-Path-Prediction
RAW_GRAPH_DIR_HPC=/N/u/kvand/BigRed200/HCG-Path-Prediction/data/raw/guidelines_graph
DATASET_DIR_HPC=/N/u/kvand/BigRed200/HCG-Path-Prediction/data/processed/cvd_guidelines_assoc
OUTPUT_DIR_HPC=/N/u/kvand/BigRed200/HCG-Path-Prediction/outputs/cvd_assoc/checkpoints
PREDICTION_OUTPUT_DIR_HPC=/N/u/kvand/BigRed200/HCG-Path-Prediction/outputs/cvd_assoc/predictions
VIS_OUTPUT_DIR_HPC=/N/u/kvand/BigRed200/HCG-Path-Prediction/outputs/cvd_assoc/visualizations
LOG_DIR_HPC=/N/scratch/kvand/hbp/logs/datahub
ENV_MANAGER=venv
VENV_PATH=/N/u/kvand/BigRed200/HCG-Path-Prediction/.venv
BIOPATHNET_INSTALL_MODE=cpu
BIOPATHNET_GPUS=null
BIOPATHNET_BATCH_SIZE=4
BIOPATHNET_NUM_EPOCHS=5
BIOPATHNET_SEED=1024
BIOPATHNET_VIS_BATCH_SIZE=1
BIOPATHNET_VIS_TEST_LIMIT=3
BIOPATHNET_VISUALIZE_TEXT=0
BIOPATHNET_VISUALIZE_GRAPH=1
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
TORCH_EXTENSIONS_DIR=/N/scratch/kvand/hbp/torch_extensions
BIOPATHNET_CLEAR_TORCH_EXTENSIONS=0
SBATCH_ACCOUNT=r01806
SBATCH_CPUS_PER_TASK=4
SBATCH_MEM=48G
SBATCH_TIME=12:00:00
SBATCH_OUTPUT=/N/scratch/kvand/hbp/logs/datahub/cvd_train_%j.out
SBATCH_ERROR=/N/scratch/kvand/hbp/logs/datahub/cvd_train_%j.errFor GPU submission, opt in explicitly:
SBATCH_GRES=gpu:1
BIOPATHNET_GPUS='[0]'Sync the repo to HPC:
bash scripts/sync_to_hpc.shOn the cluster:
cd <PROJECT_ROOT_HPC>
bash scripts/setup_runtime_env.sh
bash scripts/run_train.shSubmit the full dependent pipeline:
bash scripts/run_pipeline.shThe canonical run_* wrappers default to sbatch on HPC and can pass through account, CPU, memory, time, output, error, partition, QOS, and GPU flags from .env or configs/paths.env. The Slurm job scripts now also honor ENV_MANAGER=venv with VENV_PATH, so they no longer assume Conda is installed on the compute nodes. The pipeline wrapper can optionally preprocess first, then submit train, predict, and visualization jobs using Slurm dependencies.
For CPU smoke runs, the runtime activation layer now also honors:
OMP_NUM_THREADSMKL_NUM_THREADSTORCH_EXTENSIONS_DIRBIOPATHNET_CLEAR_TORCH_EXTENSIONS
Visualization is intentionally limited by default because BioPathNet path explanation uses beam search over the graph and can be much slower than prediction. The wrapper skips the text-only explanation pass unless BIOPATHNET_VISUALIZE_TEXT=1, runs the HTML/JSON graph pass by default with BIOPATHNET_VISUALIZE_GRAPH=1, uses BIOPATHNET_VIS_BATCH_SIZE=1, and limits test_vis.txt to the first BIOPATHNET_VIS_TEST_LIMIT=3 rows. Set BIOPATHNET_VIS_TEST_LIMIT=0 or all only if you really want every visualization query.
train1.txt: all non-target relations, used as the BRG / fact graph.train2.txt: training positives only forGene --ASSOCIATED_WITH_CONDITION--> Condition.valid.txt: validation positives for the same target relation.test.txt: test positives for the same target relation.test_vis.txt: a small subset of test triplets used by the default predict / visualize workflow.entity_types.txt: entity ID to integer type ID.entity_names.txt: entity ID to readable display name.node_colors_dict.txt: integer type ID to visualization color, plus the query highlight color.metadata.json: source files, format, counts, split seed, target relation, and export metadata.
The exported entity IDs are namespaced and type-aware, for example GENE:SCN5A and CONDITION:c4a2be2e451c52d3.
--mode pilotkeeps the workflow lightweight by sampling a deterministic subset of target edges and a focused background graph.--mode fullexports all target positives and all non-target BRG edges.
Pilot mode is useful for laptop testing and for smoke-checking the HPC pipeline before a full run.
The provided networkx_graph.json was converted successfully with the pilot command above. The resulting validated pilot dataset contains:
train1.txt: 2,000 BRG tripletstrain2.txt: 204 training target tripletsvalid.txt: 26 validation tripletstest.txt: 26 test tripletstest_vis.txt: 25 visualization triplets
Validation output is written to data/processed/cvd_guidelines_assoc/validation_summary.json, and graph / split summaries are written under data/reports.
Committed templates are provided in:
You can also render fresh templates programmatically:
python3 -m cvd_biopathnet.cli build-slurm --job train --output scripts/submit_train.sbatch
python3 -m cvd_biopathnet.cli build-slurm --job predict --output scripts/submit_predict.sbatch
python3 -m cvd_biopathnet.cli build-slurm --job visualize --output scripts/submit_visualize.sbatchModuleNotFoundError: cvd_biopathnet: install the project withpython3 -m pip install -e ., or run from the repo root after installation.No checkpoint found: run training first, or setCHECKPOINT_PATHbefore prediction / visualization.torchdrug/ PyG install failures on HPC: use scripts/setup_runtime_env.sh and adjustHPC_MODULES, CUDA wheel URLs, orBIOPATHNET_INSTALL_MODE=cpu.- Slurm account or partition errors: uncomment and edit the optional
#SBATCHlines in the batch templates for your cluster. - Training loads the dataset but stays at
Epoch 0 begin: this is a known NBFNet symptom of a broken Torch JIT cache. SetBIOPATHNET_CLEAR_TORCH_EXTENSIONS=1, keepTORCH_EXTENSIONS_DIRon scratch if possible, and rerun a 1-epoch smoke test. - Visualization runs for hours with only
log.txt: cancel it, pull the latest repo, and rerun with the default graph-only limited settings. The text-only path pass can be much slower than the HTML/JSON graph output and is now opt-in viaBIOPATHNET_VISUALIZE_TEXT=1. - Raw-format detection errors: ensure the input directory contains a supported schema with node IDs, node types, edge sources, edge targets, and edge relations.
For HPC .venv setup, use:
ENV_MANAGER=venv BIOPATHNET_INSTALL_MODE=cpu bash scripts/setup_runtime_env.shUse BIOPATHNET_INSTALL_MODE=gpu for CUDA installs. This workflow also pins setuptools<82 because torch==2.0.1 still imports pkg_resources.
If a job appears stuck immediately after Epoch 0 begin, clear the Torch JIT cache before retrying:
rm -rf "${TORCH_EXTENSIONS_DIR:-$HOME/.cache/torch_extensions}"Then submit a tiny smoke job with:
export BIOPATHNET_CLEAR_TORCH_EXTENSIONS=1
export BIOPATHNET_NUM_EPOCHS=1
export BIOPATHNET_BATCH_SIZE=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
bash scripts/run_train.shThe canonical run wrappers are now:
bash scripts/run_train.sh
bash scripts/run_predict.sh
bash scripts/run_visualize.sh
bash scripts/run_pipeline.shWhen sbatch is available, these wrappers default to Slurm submission instead of running directly on the login node. To force the older direct-run path for smoke tests, use:
RUN_MODE=local bash scripts/run_train.shOn IU HPC, set your RT project account before using the sbatch-first wrappers:
export SBATCH_ACCOUNT=r01806You can also store that value in configs/paths.env.
The submitted Slurm templates now default to CPU-compatible runs. To match a CPU-style submission similar to your previous jobs, you can also set:
export SBATCH_CPUS_PER_TASK=4
export SBATCH_MEM=48G
export SBATCH_TIME=12:00:00
export BIOPATHNET_GPUS=nullIf you want a GPU job instead, opt in explicitly with:
export SBATCH_GRES=gpu:1
export BIOPATHNET_GPUS='[0]'