diff --git a/.claude/commands/init-cluster.md b/.claude/commands/init-cluster.md index 2246ecb..0e2b5cf 100644 --- a/.claude/commands/init-cluster.md +++ b/.claude/commands/init-cluster.md @@ -44,19 +44,28 @@ which apptainer 2>/dev/null && apptainer --version which docker 2>/dev/null && docker --version 2>/dev/null docker info 2>/dev/null | grep -i "amd\|runtime" -# 8. Common scratch / project directories +# 8. Common scratch / project directories (shared storage) for d in /scratch/$USER /scratch /lustre /gpfs /tmp/$USER; do [[ -d "$d" ]] && echo "scratch: $d" done -# 9. Internet connectivity from this node +# 9. Node-local fast storage (for MIOpen cache, tmp writes during training) +for d in /scratch /local /nvme /tmp; do + [[ -d "$d" && -w "$d" ]] && echo "scratch_local: $d" && break +done + +# 10. RCCL / network interface discovery (for multi-node) +ip -o link show up 2>/dev/null | awk -F': ' '{print $2}' | grep -v lo +ibstat 2>/dev/null | grep "CA '" | awk -F"'" '{print $2}' + +# 11. Internet connectivity from this node curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://huggingface.co 2>/dev/null -# 10. Proxy settings +# 12. Proxy settings echo "HTTP_PROXY=${HTTP_PROXY:-}" echo "HTTPS_PROXY=${HTTPS_PROXY:-}" -# 11. Existing SIF files +# 13. Existing SIF files find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20 ``` @@ -131,6 +140,7 @@ slurm: paths: home: "<$HOME value>" scratch: "" # also export as AI4S_SHARED_DIR + scratch_local: "" # for MIOpen cache, tmp writes (e.g. /scratch, /local, /tmp) projects: "/models" # per-model artifact root sif_cache: "/images" # shared base SIF files @@ -147,6 +157,10 @@ network: internet_access: proxy: "" +rccl: + socket_ifname: "" # discover: ip link show up + ib_hca: "" # discover: ibstat | grep CA + discovered_sifs: # SIF files found during init (informational — confirm they are under sif_cache) # - /images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif diff --git a/.cluster-config.example.yaml b/.cluster-config.example.yaml index 978c2c6..6068434 100644 --- a/.cluster-config.example.yaml +++ b/.cluster-config.example.yaml @@ -36,3 +36,7 @@ gpu: network: internet_access: true # can compute nodes reach the internet? proxy: "" # HTTP proxy if required (e.g. "http://proxy:8080") + mgmt_iface: "" # Management NIC for MPI/OOB (discover: ip -o link show up) + ib_hca: "" # IB HCA device list for RCCL (discover: ibstat | grep 'CA ') + rccl_anp_plugin: "" # Path to librccl-anp.so on host (default: /opt/rocm/lib/librccl-anp.so) + libionic_path: "" # Path to libionic.so.1 on host (default: /usr/lib/x86_64-linux-gnu/libionic.so.1) diff --git a/.cursor/skills/ai4science-material-science/SKILL.md b/.cursor/skills/ai4science-material-science/SKILL.md index fbca556..2648676 100755 --- a/.cursor/skills/ai4science-material-science/SKILL.md +++ b/.cursor/skills/ai4science-material-science/SKILL.md @@ -21,8 +21,29 @@ The `material_science/` domain covers **materials**, **chemistry**, and related - Keep recipes **explicit** about input representations (graphs, crystals, SMILES, etc.) and any **unit conventions** (including **energy** vs **force** targets and graph vs node outputs where relevant). - Prefer citing **benchmarks** and **baseline** numbers from literature or model cards when adding evaluation snippets. - Do not commit large **proprietary structure databases**; link to public sources or describe how users supply their own data. -- **Institutional AMD** clusters and **staging** large artifacts (**Hugging Face CLI**, **Globus** / Constellation mirrors, OLCF copy-out when applicable) belong in `data-access.md`–style runbooks for **ADIOS** or similar scientific I/O when models use them. +- **Institutional AMD** clusters and **staging** large artifacts (**Hugging Face CLI**, **Globus** / Constellation mirrors, OLCF copy-out when applicable) belong in `data-access.md`-style runbooks for **ADIOS** or similar scientific I/O when models use them. + +## HydraGNN multi-node training pattern + +HydraGNN uses MPI-based distributed training with ADIOS datasets. Key patterns for AMD clusters: + +- **Launch:** `srun --mpi=pmix apptainer exec --rocm --overlay :ro $SIF bash `. The `--mpi=pmix` is required because `mpi4py` calls `MPI_Init` inside the container. +- **Non-DDStore path (phase 1):** Use `--multi --multi_model_list=` without `--ddstore`. Each rank opens ADIOS files directly via `AdiosMultiDataset`. Simpler and sufficient for moderate scales (1-8 nodes). +- **DDStore path (phase 2):** Add `--ddstore` flag plus env vars `HYDRAGNN_AGGR_BACKEND=mpi`, `HYDRAGNN_DDSTORE_METHOD=1`, `HYDRAGNN_CUSTOM_DATALOADER=1`. Required for large-scale (32+ nodes) where per-rank ADIOS I/O becomes a bottleneck. +- **Dataset convention:** ADIOS datasets are expected at `./dataset/-v2.bp` relative to the training script. Symlink from shared storage rather than copying. +- **Config file:** The upstream `gfm_mlip.json` is hardware-agnostic and works on any cluster. Use CLI flags (`--batch_size`, `--num_epoch`, `--precision`) to override parameters at runtime without modifying the file. +- **Env vars inside container:** Set `OMP_NUM_THREADS` to match `--cpus-per-task`, `MIOPEN_DISABLE_CACHE=1`, `MIOPEN_USER_DB_PATH=$SCRATCH_LOCAL//miopen` (node-local fast storage from `.cluster-config.yaml`), `HYDRAGNN_USE_VARIABLE_GRAPH_SIZE=1`. +- **Multi-node MPI transport (ob1/tcp):** On Pensando/ionic fabrics, data NICs use `/31` subnets that don't route between nodes — IB verbs cannot work for MPI. Use `OMPI_MCA_pml=ob1`, `OMPI_MCA_btl=tcp,self`, `OMPI_MCA_btl_tcp_if_include=$MGMT_NIC`, `MPI4PY_RC_THREADS=false`. RCCL uses ANP plugin (`librccl-anp.so`) over ionic native transport (RoCEv2/GDRDMA) for GPU allreduce, independent of MPI. The ANP plugin and `libionic.so.1` must be bind-mounted from the host into the container. +- **Convergence capture:** Run with `HYDRAGNN_VALTEST=1` to get epoch-level loss reporting. Parse with `examples/parse_convergence.py --log `. + +## HydraGNN inference pattern + +GFM 2024 checkpoints use an older model architecture (branch `Predictive_GFM_2024`) that is structurally incompatible with the training-pinned SHA. The inference script auto-clones the correct branch and uses `sys.path.insert(0, ...)` to load matching code. Key points: + +- **Code version mismatch:** Never load old GFM checkpoints with new HydraGNN code — state_dict keys differ (e.g. `heads_NN.0.0.weight` vs `heads_NN.0.energy.0.weight`). +- **Config format:** Old configs use `model_type`; new code expects `mpnn_type`. Old `output_heads` is a flat dict; new expects list-of-dicts. +- **Inference repo:** Set `HG_INFER_REPO` or let `run_inference.sh` clone `Predictive_GFM_2024` to `$HG_OUTPUT_DIR/HydraGNN-infer` on first run. ## Safety and licensing -- Chemistry and materials models may have **use-based** restrictions; mirror the model card’s limitations in the local `README.md`. +- Chemistry and materials models may have **use-based** restrictions; mirror the model card's limitations in the local `README.md`. diff --git a/.cursor/skills/ai4science-studio/SKILL.md b/.cursor/skills/ai4science-studio/SKILL.md index 78c986b..7355375 100755 --- a/.cursor/skills/ai4science-studio/SKILL.md +++ b/.cursor/skills/ai4science-studio/SKILL.md @@ -315,6 +315,40 @@ srun --mpi=pmix apptainer exec \ **Multi-node scaling:** Change `--nodes=N --ntasks=N*` in the SBATCH header; the `srun` command is unchanged. +## GPU visibility: `--gpus-per-node` NOT `--gpus-per-task` (MI355X / RCCL) + +**Use `--gpus-per-node=8`, NOT `--gpus-per-task=1`** for multi-GPU distributed training with RCCL/NCCL backend on AMD Instinct MI355X. + +**Why:** With `--gpus-per-task=1`, SLURM's cgroup restricts each rank to seeing only 1 GPU (`torch.cuda.device_count()=1`). RCCL still tries to read the full KFD topology (`/sys/class/kfd/kfd/topology/nodes/*/io_links/*/properties`) to discover XGMI links, but cannot reconcile 8-GPU topology info with 1-GPU access. This produces: +``` +NCCL WARN Could not read node # N +ncclUnhandledCudaError: Call to CUDA function failed. +``` + +**Fix:** Use `--gpus-per-node=8` (all GPUs visible to all ranks). Frameworks like HydraGNN, DeepSpeed, and plain PyTorch DDP use `SLURM_LOCALID` to select the correct device per rank when `device_count() > 1`: +```python +local_rank = int(os.environ["SLURM_LOCALID"]) +torch.cuda.set_device(local_rank) # rank 0→GPU 0, rank 1→GPU 1, ... +``` + +**MI355X RCCL env vars (from AMD documentation):** +- Single-node: only `HSA_NO_SCRATCH_RECLAIM=1` needed +- Multi-node: full set (`NCCL_NET_PLUGIN`, `NCCL_IB_HCA`, `NCCL_IB_GID_INDEX`, etc.) — see `sbatch_train_amd.sh` in HydraGNN examples + +**Multi-node MPI transport (ob1/tcp — correct for Pensando/ionic fabric):** +- Pensando ionic data NICs use `/31` point-to-point subnets; nodes cannot route to each other on data interfaces. Standard IB verbs do NOT work for MPI inter-node traffic. +- MPI uses ob1 PML with TCP BTL over the management NIC: `OMPI_MCA_pml=ob1`, `OMPI_MCA_btl=tcp,self`, `OMPI_MCA_btl_tcp_if_include=`, `MPI4PY_RC_THREADS=false`. +- RCCL uses ANP plugin (`librccl-anp.so`) + `libionic.so.1` (bind-mounted from host) over ionic native transport (RoCEv2/GDRDMA) for GPU allreduce — independent of MPI/UCX. +- Without the ANP bind-mounts, RCCL falls back to socket transport over the management NIC (works but much slower). + +## PMIx shared memory fix for Apptainer + +When using `srun --mpi=pmix` with Apptainer, ranks may segfault with `PMIx ERROR: UNREACHABLE` if `/dev/shm` or `/tmp` is shared from the host. Fix: +```bash +--env PMIX_MCA_gds=hash +--env PMIX_MCA_psec=native +``` + --- # Docker-specific patterns diff --git a/material_science/models/HydraGNN/examples/build_overlay_amd.sh b/material_science/models/HydraGNN/examples/build_overlay_amd.sh index 1fac739..d005b34 100755 --- a/material_science/models/HydraGNN/examples/build_overlay_amd.sh +++ b/material_science/models/HydraGNN/examples/build_overlay_amd.sh @@ -4,13 +4,16 @@ # skip the pip install phase on every submission. # # ── Quick start ────────────────────────────────────────────────────────────── -# export HG_SIF=${AI4S_SHARED_DIR:-/your/shared/dir}/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif -# sbatch build_overlay_amd.sh -# # then in your inference job: -# export HG_OVERLAY=/path/to/hydragnn-overlay.img -# sbatch sbatch_infer_amd.sh +# # From an interactive compute node (has /opt/ompi with mpicc): +# export AI4S_SHARED_DIR=/path/to/shared # set via /init-cluster +# bash build_overlay_amd.sh # -# ── torch-scatter / torch-sparse / torch-cluster / torch-spline-conv ───────── +# ── MPI-enabled adios2 ────────────────────────────────────────────────────── +# The compute nodes have /opt/ompi (OpenMPI 5.x with mpicc). We bind-mount +# it into the container during the build so pip can compile adios2 from +# source with ADIOS2_USE_MPI=ON. This is required for parallel dataset I/O. +# +# ── torch-scatter / torch-sparse / torch-cluster / torch-spline-conv ──────── # No official ROCm wheels exist at data.pyg.org. We use pre-built ROCm # wheels from https://github.com/Looong01/pyg-rocm-build (release 15): # - PyTorch 2.10.x + ROCm 7.1 wheels run fine on ROCm 7.2.x (ABI compat) @@ -19,187 +22,224 @@ # # ── Why staging is needed ──────────────────────────────────────────────────── # pip install --target does not see the SIF's /opt/venv, so torch gets -# re-downloaded (~6 GB) and fills the overlay. We stage into a writable -# scratch dir, strip torch/nvidia/triton, then copy ~1-2 GB into the overlay. -# -# ── Why the overlay is built on $TMPDIR ────────────────────────────────────── -# The overlay image is an ext3 filesystem served via FUSE. Writing thousands -# of small Python package files through FUSE into an image that lives on NFS -# is extremely slow (ext3-on-NFS via FUSE). We instead: -# 1. Create and populate the ext3 image on $TMPDIR (node-local fast disk). -# 2. Copy the finished image file (one large sequential write) to the NFS -# destination ($HG_OVERLAY). -# This makes the file-intensive install step fast and leaves a single large -# file on NFS. +# re-downloaded (~6 GB) and fills the overlay. We stage into /tmp (node- +# local fast disk), strip torch/nvidia/triton, then copy into the overlay. # -# ── Key environment variables ───────────────────────────────────────────────── -# HG_SIF Path to the Apptainer SIF image (required) +# ── Key environment variables ──────────────────────────────────────────────── +# AI4S_SHARED_DIR Shared filesystem root (required) +# HG_SIF Path to the Apptainer SIF image # HG_OVERLAY Output overlay path -# (default: /hydragnn-overlay.img) -# HG_STAGE_DIR Writable scratch dir for staging install (~6 GB) -# (default: /hydragnn-stage) # HG_OVERLAY_SIZE_MB Size of the ext3 overlay in MB (default: 4096) +# HG_HYDRAGNN_SHA Git SHA to pin HydraGNN at (default: current main) # PYG_ROCM_RELEASE Looong01 release tag to use (default: 15) # PYG_PYTHON_TAG Python tag in zip filename (default: py312) -#SBATCH --job-name=hydragnn-overlay -#SBATCH --partition=YOUR_PARTITION_HERE -#SBATCH --account=YOUR_ACCOUNT_HERE -#SBATCH --nodes=1 -#SBATCH --ntasks=1 -#SBATCH --gres=gpu:1 -#SBATCH --cpus-per-task=8 -#SBATCH --time=03:00:00 -#SBATCH --output=hydragnn-overlay-build-%j.out -#SBATCH --error=hydragnn-overlay-build-%j.out - set -euo pipefail -HG_BASE="${AI4S_SHARED_DIR:-/your/shared/dir}/models/HydraGNN" -HG_SIF="${HG_SIF:-${HG_BASE}/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif}" +HG_BASE="${AI4S_SHARED_DIR:?AI4S_SHARED_DIR must be set}/models/HydraGNN" +HG_SIF="${HG_SIF:-${AI4S_SHARED_DIR}/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif}" OVERLAY="${HG_OVERLAY:-${HG_BASE}/overlays/hydragnn-overlay.img}" -STAGE_DIR="${HG_STAGE_DIR:-${HG_BASE}/stage/hydragnn-stage}" OVERLAY_SIZE_MB="${HG_OVERLAY_SIZE_MB:-4096}" +HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-6c45f1682783e66dc89e9e23009f61716186432b}" PYG_ROCM_RELEASE="${PYG_ROCM_RELEASE:-15}" PYG_PYTHON_TAG="${PYG_PYTHON_TAG:-py312}" PYG_ZIP_NAME="torch-2.10.0-rocm-7.1-${PYG_PYTHON_TAG}-linux_x86_64.zip" PYG_ZIP_URL="https://github.com/Looong01/pyg-rocm-build/releases/download/${PYG_ROCM_RELEASE}/${PYG_ZIP_NAME}" -PYG_WHEELS_DIR="${STAGE_DIR}/pyg-rocm-wheels" -# Build the overlay on local disk to avoid ext3-on-NFS small-file I/O penalty; -# copy the finished image file to NFS destination as one sequential write. -LOCAL_OVERLAY="${TMPDIR:-/tmp}/hydragnn-overlay-${SLURM_JOB_ID:-$$}.img" +# Use node-local fast storage for build I/O (SCRATCH_LOCAL from .cluster-config.yaml) +LOCAL_TMP="${SCRATCH_LOCAL:-/scratch}/${USER:?}/hydragnn-overlay-build-$$" +LOCAL_OVERLAY="${LOCAL_TMP}/hydragnn-overlay.img" +STAGE="${LOCAL_TMP}/stage" +PYG_WHEELS_DIR="${LOCAL_TMP}/pyg-rocm-wheels" echo "=== HydraGNN overlay build ===" echo " SIF : $HG_SIF" -echo " Overlay (NFS): $OVERLAY (${OVERLAY_SIZE_MB} MB)" -echo " Build on : $LOCAL_OVERLAY (node-local $TMPDIR)" -echo " Stage dir : $STAGE_DIR" +echo " Overlay dest : $OVERLAY (${OVERLAY_SIZE_MB} MB)" +echo " Build tmp : $LOCAL_TMP" +echo " HydraGNN SHA : $HG_HYDRAGNN_SHA" echo " PyG wheels : Looong01 release ${PYG_ROCM_RELEASE} (torch 2.10, ROCm 7.1, ${PYG_PYTHON_TAG})" echo " Node : $(hostname) Date: $(date)" echo "" if [[ -f "$OVERLAY" ]]; then - echo "Overlay already exists: $OVERLAY" - echo "Delete it to rebuild: rm $OVERLAY" - exit 0 + echo "WARNING: Overlay already exists: $OVERLAY" + echo " Delete it to rebuild: rm $OVERLAY" + echo " Proceeding with rebuild (will overwrite)..." + rm -f "$OVERLAY" fi -mkdir -p "$STAGE_DIR" "$PYG_WHEELS_DIR" - # --------------------------------------------------------------------------- -# Download and extract Looong01 PyG ROCm wheels (on the host, before container) +# Ensure MPI is available on this compute node # --------------------------------------------------------------------------- -PYG_ZIP="${STAGE_DIR}/${PYG_ZIP_NAME}" -if [[ ! -f "$PYG_ZIP" ]]; then - echo "--- Downloading PyG ROCm wheels from Looong01 release ${PYG_ROCM_RELEASE} ---" - curl -fL "$PYG_ZIP_URL" -o "$PYG_ZIP" -else - echo "--- PyG ROCm zip already downloaded: $PYG_ZIP ---" +OMPI_DIR="/opt/ompi" +if [[ ! -x "${OMPI_DIR}/bin/mpicc" ]]; then + echo "ERROR: /opt/ompi/bin/mpicc not found. Run this script on a compute node." >&2 + exit 1 fi +echo " MPI: ${OMPI_DIR}/bin/mpicc ($(${OMPI_DIR}/bin/ompi_info --version 2>&1 | head -1))" +echo "" -echo "--- Extracting wheels ---" +mkdir -p "$LOCAL_TMP" "$STAGE" "$PYG_WHEELS_DIR" + +# --------------------------------------------------------------------------- +# Download and extract Looong01 PyG ROCm wheels +# --------------------------------------------------------------------------- +PYG_ZIP="${LOCAL_TMP}/${PYG_ZIP_NAME}" +echo "--- Downloading PyG ROCm wheels ---" +curl -fL "$PYG_ZIP_URL" -o "$PYG_ZIP" unzip -q -o "$PYG_ZIP" -d "$PYG_WHEELS_DIR" -echo "Wheels extracted:" -ls "$PYG_WHEELS_DIR"/*.whl 2>/dev/null || ls "$PYG_WHEELS_DIR" +echo " Wheels: $(ls "$PYG_WHEELS_DIR"/*.whl 2>/dev/null | wc -l) files" # --------------------------------------------------------------------------- # Write the install script that runs inside the container # --------------------------------------------------------------------------- -SCRIPTS_DIR="${TMPDIR:-/tmp}/hydragnn-overlay-scripts-${SLURM_JOB_ID:-$$}" -mkdir -p "$SCRIPTS_DIR" - -cat > "$SCRIPTS_DIR/overlay_install.sh" << INNEREOF +cat > "${LOCAL_TMP}/overlay_install.sh" << 'INNEREOF' #!/usr/bin/env bash set -euo pipefail source /opt/venv/bin/activate PKG=/opt/hydragnn-pkgs -STAGE="${STAGE_DIR}/pkgs" -PYG_WHEELS="${PYG_WHEELS_DIR}" +STAGE="$BUILD_STAGE" +PYG_WHEELS="$BUILD_PYG_WHEELS" +OMPI="/opt/ompi" -rm -rf "\$STAGE" && mkdir -p "\$STAGE" -echo "--- torch: \$(python3 -c 'import torch; print(torch.__version__)') ---" -echo "--- Installing to staging dir: \$STAGE ---" +rm -rf "$STAGE" && mkdir -p "$STAGE" -echo "--- Step 1: PyG ROCm wheels (torch-scatter, torch-sparse, torch-cluster, etc.) ---" -# --no-deps: these wheels depend on torch, which is already in the SIF venv -pip install -q --no-cache-dir --no-deps --target "\$STAGE" "\$PYG_WHEELS"/*.whl -echo " installed: \$(ls \$STAGE | grep torch | tr '\n' ' ')" +echo "--- torch version: $(python3 -c 'import torch; print(torch.__version__)') ---" +echo "--- Installing to staging dir: $STAGE ---" + +echo "--- Step 1: PyG ROCm wheels (torch-scatter, torch-sparse, torch-cluster) ---" +pip install -q --no-cache-dir --no-deps --target "$STAGE" "$PYG_WHEELS"/*.whl +echo " installed: $(ls $STAGE | grep torch | tr '\n' ' ')" echo "--- Step 2: torch-geometric (pure Python) ---" -pip install -q --no-cache-dir --target "\$STAGE" torch-geometric 2>&1 | tail -3 +pip install -q --no-cache-dir --target "$STAGE" torch-geometric 2>&1 | tail -3 echo "--- Step 3: HydraGNN runtime deps ---" -pip install -q --no-cache-dir --target "\$STAGE" \ +pip install -q --no-cache-dir --target "$STAGE" \ mpi4py tqdm pyyaml tensorboard matplotlib \ - scikit-learn scipy h5py ase 2>&1 | tail -5 - -echo "--- Step 4: HydraGNN package (clone + install, no deps) ---" + scikit-learn scipy h5py ase vesin e3nn 2>&1 | tail -5 + +echo "--- Step 4a: install build deps for adios2 (cmake, nanobind, mpi4py) ---" +export PATH="${OMPI}/bin:${PATH}" +export LD_LIBRARY_PATH="${OMPI}/lib:${LD_LIBRARY_PATH:-}" +echo " mpicc: $(which mpicc) ($(mpicc --showme:version 2>&1 | head -1))" +pip install -q --no-cache-dir cmake nanobind mpi4py 2>&1 | tail -3 + +echo "--- Step 4b: adios2 with MPI (cmake from source) ---" +pip download --no-deps --no-binary adios2 -d /tmp/adios2-dl adios2 2>&1 | tail -2 +cd /tmp/adios2-dl && tar xf adios2*.tar.gz +ADIOS2_SRC=$(ls -d /tmp/adios2-dl/adios2-*/) +ADIOS2_BUILD=/tmp/adios2-build +ADIOS2_INST=/tmp/adios2-install +mkdir -p "$ADIOS2_BUILD" +cmake "$ADIOS2_SRC" -B "$ADIOS2_BUILD" \ + -DCMAKE_INSTALL_PREFIX="$ADIOS2_INST" \ + -DCMAKE_C_COMPILER="${OMPI}/bin/mpicc" \ + -DCMAKE_CXX_COMPILER="${OMPI}/bin/mpicxx" \ + -DADIOS2_USE_MPI=ON \ + -DADIOS2_USE_Python=ON \ + -DADIOS2_USE_SST=OFF -DADIOS2_USE_OpenSSL=OFF -DADIOS2_USE_CURL=OFF \ + -DADIOS2_USE_Sodium=OFF -DADIOS2_USE_DAOS=OFF -DADIOS2_USE_IME=OFF \ + -DADIOS2_USE_MGARD=OFF -DADIOS2_USE_Fortran=OFF -DADIOS2_USE_Table=OFF \ + -DADIOS2_USE_Profiling=OFF -DBUILD_TESTING=OFF \ + 2>&1 | grep -E "MPI|Python|nanobind|Configuring done" +cmake --build "$ADIOS2_BUILD" -j$(nproc) 2>&1 | tail -3 +cmake --install "$ADIOS2_BUILD" 2>&1 | tail -3 +# Copy Python package + libs to staging +cp -r "$ADIOS2_INST"/lib/python3.12/site-packages/adios2 "$STAGE/adios2" +cp "$ADIOS2_INST"/lib/libadios2*.so* "$STAGE/adios2/" 2>/dev/null || true +echo " adios2 installed (MPI + Python bindings)" +cd / + +echo "--- Step 5: HydraGNN source (main branch) ---" if [[ ! -d /tmp/HydraGNN-src ]]; then - git clone -q --depth=1 --branch Predictive_GFM_2024 \ - https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src + git clone -q --depth=1 https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src + cd /tmp/HydraGNN-src && git fetch --depth=1 origin "$BUILD_SHA" && git checkout "$BUILD_SHA" + cd / fi -pip install -q --no-cache-dir --target "\$STAGE" --no-deps /tmp/HydraGNN-src 2>&1 | tail -3 +# Copy the source tree directly rather than pip-installing (avoids partial package issues) +cp -r /tmp/HydraGNN-src/hydragnn "$STAGE/hydragnn" +echo " HydraGNN source copied (SHA: $BUILD_SHA)" -echo "--- Step 5: strip torch / nvidia / triton from staging dir ---" -for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do - rm -rf "\${STAGE}/\${pkg}" "\${STAGE}/\${pkg}"-*.dist-info 2>/dev/null || true +echo "--- Step 6: strip torch / nvidia / triton / cuda from staging dir ---" +for pkg in torch torchvision torchaudio torchgen functorch nvidia triton cuda sympy; do + rm -rf "${STAGE}/${pkg}" "${STAGE}/${pkg}"-*.dist-info 2>/dev/null || true done -echo " Staging size after strip: \$(du -sh \$STAGE | cut -f1)" - -echo "--- Step 6: copy stripped packages → overlay (local disk, fast) ---" -mkdir -p "\$PKG" -cp -r "\$STAGE/." "\$PKG/" -# Final safety strip -for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do - rm -rf "\${PKG}/\${pkg}" "\${PKG}/\${pkg}"-*.dist-info 2>/dev/null || true +rm -rf "${STAGE}"/nvidia_*.dist-info "${STAGE}"/cuda_*.dist-info 2>/dev/null || true +echo " Staging size after strip: $(du -sh $STAGE | cut -f1)" + +echo "--- Step 7: copy stripped packages → overlay ---" +mkdir -p "$PKG" +cp -r "$STAGE/." "$PKG/" +for pkg in torch torchvision torchaudio torchgen functorch nvidia triton cuda sympy; do + rm -rf "${PKG}/${pkg}" "${PKG}/${pkg}"-*.dist-info 2>/dev/null || true done -echo " Overlay size: \$(du -sh \$PKG | cut -f1)" - -echo "--- Step 7: verify ---" -PYTHONPATH="\$PKG" python3 -c " -import torch, torch_scatter, torch_geometric, hydragnn +rm -rf "${PKG}"/nvidia_*.dist-info "${PKG}"/cuda_*.dist-info 2>/dev/null || true +# Keep adios2 .so libs accessible +echo " Overlay contents: $(du -sh $PKG | cut -f1)" +echo " adios2 libs: $(ls $PKG/adios2/libadios2_core_mpi.so 2>/dev/null && echo 'present' || echo 'MISSING')" + +echo "--- Step 8: verify ---" +export PYTHONPATH="$PKG" +export LD_LIBRARY_PATH="$PKG/adios2:${LD_LIBRARY_PATH:-}" +python3 -c " +import torch, torch_scatter, torch_geometric print('torch :', torch.__version__) print('torch_scatter : ok') print('torch_geometric:', torch_geometric.__version__) -print('hydragnn : ok') assert torch.cuda.is_available() or 'rocm' in torch.__version__, 'ROCm torch expected' -print('Verification passed.') +print('ROCm torch : ok') +" + +python3 -c " +import adios2 +from mpi4py import MPI +a = adios2.Adios(MPI.COMM_SELF) +print('adios2 :', adios2.__version__, '(MPI: OK)') " + +python3 -c "import vesin; print('vesin : ok')" +python3 -c "import e3nn; print('e3nn :', e3nn.__version__)" +python3 -c "import hydragnn; print('hydragnn : ok')" + echo "=== Overlay install complete ===" INNEREOF -chmod +x "$SCRIPTS_DIR/overlay_install.sh" +chmod +x "${LOCAL_TMP}/overlay_install.sh" # --------------------------------------------------------------------------- -# Create overlay on local disk, run install, then copy image to NFS +# Create overlay on local disk, run install, then copy to NFS # --------------------------------------------------------------------------- -echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..." +echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay: $LOCAL_OVERLAY ..." apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY" -echo "Running install inside container ..." +echo "Running install inside container (binding /opt/ompi for MPI) ..." apptainer exec \ --rocm \ --overlay "${LOCAL_OVERLAY}:rw" \ - --bind "$SCRIPTS_DIR":/scripts \ - --bind "${STAGE_DIR}:${STAGE_DIR}" \ - --bind "${PYG_WHEELS_DIR}:${PYG_WHEELS_DIR}" \ + --bind "${OMPI_DIR}:${OMPI_DIR}:ro" \ + --bind "${LOCAL_TMP}:${LOCAL_TMP}" \ + --env BUILD_STAGE="$STAGE" \ + --env BUILD_PYG_WHEELS="$PYG_WHEELS_DIR" \ + --env BUILD_SHA="$HG_HYDRAGNN_SHA" \ "$HG_SIF" \ - bash /scripts/overlay_install.sh + bash "${LOCAL_TMP}/overlay_install.sh" echo "" -echo "--- Copying finished overlay to NFS: $OVERLAY ---" +echo "--- Copying finished overlay to shared storage: $OVERLAY ---" +mkdir -p "$(dirname "$OVERLAY")" echo " Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)" cp "$LOCAL_OVERLAY" "$OVERLAY" -rm -f "$LOCAL_OVERLAY" echo " Done." +# Cleanup local tmp +rm -rf "$LOCAL_TMP" + echo "" echo "=== Overlay ready ===" ls -lh "$OVERLAY" echo "" -echo "To use in inference jobs:" -echo " export HG_OVERLAY=$OVERLAY" -echo " sbatch sbatch_infer_amd.sh" +echo "SHA pinned: $HG_HYDRAGNN_SHA" +echo "To use: export HG_OVERLAY=$OVERLAY" diff --git a/material_science/models/HydraGNN/examples/parse_convergence.py b/material_science/models/HydraGNN/examples/parse_convergence.py new file mode 100755 index 0000000..f5fadaa --- /dev/null +++ b/material_science/models/HydraGNN/examples/parse_convergence.py @@ -0,0 +1,360 @@ +#!/usr/bin/env python3 +"""parse_convergence.py — Extract training metrics from HydraGNN SLURM logs or TensorBoard event files. + +Usage: + # From SLURM stdout log: + python parse_convergence.py --log hydragnn-train-6294.out --output convergence.csv + + # From TensorBoard log directory: + python parse_convergence.py --tbdir logs/hydragnn-train-6294-N1/ --output convergence.csv + + # Generate a plot: + python parse_convergence.py --log hydragnn-train-6294.out --plot convergence.png + +Does NOT modify HydraGNN source code. Parses whatever HydraGNN already emits. +""" + +import argparse +import csv +import json +import os +import re +import sys +from pathlib import Path + + +def parse_slurm_log(log_path: str) -> list[dict]: + """Parse HydraGNN training metrics from SLURM stdout or run.log. + + HydraGNN (pinned SHA 6c45f168) emits these patterns: + - Epoch summary (requires HYDRAGNN_VALTEST=1): + 0: Epoch: 01, Train Loss: 0.12345678, Val Loss: 0.23456789, Test Loss: 0.34567890 + - Per-task losses: + 0: Tasks Train Loss: [0.123, 0.456] + - Timing: + 0: Process 0 - Local timer: train_validate_test : 183.3 + - tqdm progress (always emitted): + Train: 100%|...|50/50 [03:03<00:00, 3.66s/it] + """ + records = [] + + # Epoch summary: "Epoch: XX, Train Loss: X.XX, Val Loss: X.XX, Test Loss: X.XX" + epoch_loss_pattern = re.compile( + r"^0:\s*Epoch:\s*(\d+),\s*Train Loss:\s*([\d.eE+-]+),\s*" + r"Val Loss:\s*([\d.eE+-]+),\s*Test Loss:\s*([\d.eE+-]+)", + ) + + # Per-task losses: "Tasks Train Loss: [0.123, 0.456]" + tasks_loss_pattern = re.compile( + r"^0:\s*Tasks\s+(Train|Val|Test)\s+Loss:\s*\[(.*?)\]", + ) + + # Timer lines: "0: Process 0 - Local timer: train_validate_test : 183.3" + timer_pattern = re.compile( + r"^0:\s*(?:Process 0 - )?(?:Local )?timer:\s*([\w]+)\s*:\s*([\d.]+)", + ) + + # tqdm progress bar: "Train: 100%|...|50/50 [03:03<00:00, 3.66s/it]" + tqdm_pattern = re.compile( + r"Train:\s*(\d+)%\|.*?\|\s*(\d+)/(\d+)\s*\[([^\]]+)\]", + ) + + # Memory line (per-batch indicator): "0: Max memory allocated after optimizer step: X.XX GB Y.YY GB" + mem_pattern = re.compile( + r"^0:\s*Max memory allocated after optimizer step:\s*([\d.]+)\s*GB\s*([\d.]+)\s*GB", + ) + + current_epoch = 0 + batch_count = 0 + last_tasks_epoch = None + + with open(log_path) as f: + for line in f: + line = line.rstrip() + + m = epoch_loss_pattern.match(line) + if m: + rec = { + "epoch": int(m.group(1)), + "batch": None, + "train_loss": float(m.group(2)), + "val_loss": float(m.group(3)), + "test_loss": float(m.group(4)), + "tasks_train": None, + "tasks_val": None, + "wall_time_s": None, + "source": "epoch_summary", + } + records.append(rec) + last_tasks_epoch = rec + continue + + m = tasks_loss_pattern.match(line) + if m and last_tasks_epoch: + split_name = m.group(1).lower() + values = [float(x.strip()) for x in m.group(2).split(",")] + key = f"tasks_{split_name}" + last_tasks_epoch[key] = values + continue + + m = tqdm_pattern.search(line) + if m: + pct = int(m.group(1)) + done = int(m.group(2)) + total = int(m.group(3)) + timing_str = m.group(4) + rate_match = re.search(r"([\d.]+)(s|ms)/it", timing_str) + rate_s = None + if rate_match: + rate_s = float(rate_match.group(1)) + if rate_match.group(2) == "ms": + rate_s /= 1000.0 + if pct == 100: + rec = { + "epoch": current_epoch, + "batch": done, + "train_loss": None, + "val_loss": None, + "test_loss": None, + "tasks_train": None, + "tasks_val": None, + "wall_time_s": rate_s, + "source": "tqdm_final", + } + records.append(rec) + current_epoch += 1 + continue + + m = mem_pattern.match(line) + if m: + batch_count += 1 + continue + + m = timer_pattern.match(line) + if m: + timer_name = m.group(1) + timer_val = float(m.group(2)) + if timer_name == "train_validate_test" and records: + for rec in reversed(records): + if rec.get("source") in ("epoch_summary", "tqdm_final"): + rec["wall_time_s"] = timer_val + break + continue + + # Add batch count metadata if available + if batch_count > 0 and records: + records[0].setdefault("total_batches_observed", batch_count) + + return records + + +def parse_tensorboard(tb_dir: str) -> list[dict]: + """Parse TensorBoard event files for scalar metrics.""" + try: + from tensorboard.backend.event_processing.event_accumulator import EventAccumulator + except ImportError: + print("ERROR: tensorboard package required for --tbdir mode.", file=sys.stderr) + print(" pip install tensorboard", file=sys.stderr) + sys.exit(1) + + ea = EventAccumulator(tb_dir) + ea.Reload() + + tags = ea.Tags().get("scalars", []) + if not tags: + print(f"WARNING: No scalar tags found in {tb_dir}", file=sys.stderr) + return [] + + records = [] + for tag in tags: + events = ea.Scalars(tag) + for ev in events: + records.append({ + "tag": tag, + "step": ev.step, + "value": ev.value, + "wall_time_s": ev.wall_time, + }) + + records.sort(key=lambda r: (r["step"], r["tag"])) + return records + + +def write_csv(records: list[dict], output_path: str, mode: str = "slurm"): + """Write records to CSV.""" + if not records: + print("WARNING: No records to write.", file=sys.stderr) + return + + if mode == "slurm": + fieldnames = [ + "epoch", "batch", "train_loss", "val_loss", "test_loss", + "tasks_train", "tasks_val", "wall_time_s", "source", + ] + else: + fieldnames = ["step", "tag", "value", "wall_time_s"] + + with open(output_path, "w", newline="") as f: + writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore") + writer.writeheader() + for rec in records: + row = {k: rec.get(k) for k in fieldnames} + if row.get("tasks_train") and isinstance(row["tasks_train"], list): + row["tasks_train"] = ";".join(f"{v:.8f}" for v in row["tasks_train"]) + if row.get("tasks_val") and isinstance(row["tasks_val"], list): + row["tasks_val"] = ";".join(f"{v:.8f}" for v in row["tasks_val"]) + writer.writerow(row) + + print(f"Wrote {len(records)} records to {output_path}") + + +def plot_convergence(records: list[dict], plot_path: str, mode: str = "slurm"): + """Generate a convergence plot.""" + try: + import matplotlib + matplotlib.use("Agg") + import matplotlib.pyplot as plt + except ImportError: + print("ERROR: matplotlib required for --plot mode.", file=sys.stderr) + print(" pip install matplotlib", file=sys.stderr) + sys.exit(1) + + if mode == "slurm": + epoch_records = [r for r in records if r.get("source") == "epoch_summary"] + if not epoch_records: + epoch_records = [r for r in records if r.get("train_loss") is not None] + + if not epoch_records: + print("WARNING: No training loss values found for plotting.", file=sys.stderr) + return + + fig, ax = plt.subplots(figsize=(10, 6)) + epochs = [r["epoch"] for r in epoch_records] + train_loss = [r["train_loss"] for r in epoch_records] + ax.plot(epochs, train_loss, "b-o", label="Train Loss", markersize=4) + + val_data = [(r["epoch"], r["val_loss"]) for r in epoch_records + if r.get("val_loss") is not None] + if val_data: + val_x, val_y = zip(*val_data) + ax.plot(val_x, val_y, "r-s", label="Val Loss", markersize=4) + + test_data = [(r["epoch"], r["test_loss"]) for r in epoch_records + if r.get("test_loss") is not None] + if test_data: + test_x, test_y = zip(*test_data) + ax.plot(test_x, test_y, "g-^", label="Test Loss", markersize=4) + + ax.set_xlabel("Epoch") + ax.set_ylabel("Loss (MAE)") + ax.set_title("HydraGNN Training Convergence") + ax.legend() + ax.grid(True, alpha=0.3) + if all(v > 0 for v in train_loss): + ax.set_yscale("log") + + else: + tags = sorted(set(r["tag"] for r in records)) + fig, ax = plt.subplots(figsize=(10, 6)) + for tag in tags: + tag_data = [(r["step"], r["value"]) for r in records if r["tag"] == tag] + if tag_data: + steps, values = zip(*tag_data) + ax.plot(steps, values, label=tag, markersize=2) + ax.set_xlabel("Step") + ax.set_ylabel("Value") + ax.set_title("HydraGNN Training Convergence (TensorBoard)") + ax.legend(fontsize=8) + ax.grid(True, alpha=0.3) + + plt.tight_layout() + plt.savefig(plot_path, dpi=150) + print(f"Plot saved to {plot_path}") + + +def main(): + parser = argparse.ArgumentParser( + description="Extract convergence metrics from HydraGNN training runs." + ) + parser.add_argument( + "--log", type=str, help="Path to SLURM stdout log file (hydragnn-train-*.out)" + ) + parser.add_argument( + "--tbdir", type=str, help="Path to TensorBoard log directory" + ) + parser.add_argument( + "--output", "-o", type=str, default="convergence.csv", + help="Output CSV file path (default: convergence.csv)" + ) + parser.add_argument( + "--plot", type=str, default=None, + help="If set, generate a convergence plot at this path (e.g. convergence.png)" + ) + parser.add_argument( + "--json", action="store_true", + help="Also write a JSON summary alongside the CSV" + ) + + args = parser.parse_args() + + if not args.log and not args.tbdir: + parser.error("Must specify either --log or --tbdir") + + if args.log and args.tbdir: + parser.error("Specify only one of --log or --tbdir") + + if args.log: + if not os.path.isfile(args.log): + print(f"ERROR: Log file not found: {args.log}", file=sys.stderr) + sys.exit(1) + records = parse_slurm_log(args.log) + mode = "slurm" + else: + if not os.path.isdir(args.tbdir): + print(f"ERROR: TensorBoard dir not found: {args.tbdir}", file=sys.stderr) + sys.exit(1) + records = parse_tensorboard(args.tbdir) + mode = "tensorboard" + + if not records: + print("WARNING: No metrics found. The log may not contain training output.", + file=sys.stderr) + print(" (Jobs that were cancelled before training started won't have metrics.)", + file=sys.stderr) + sys.exit(0) + + write_csv(records, args.output, mode=mode) + + if args.json: + json_path = Path(args.output).with_suffix(".json") + with open(json_path, "w") as f: + json.dump(records, f, indent=2, default=str) + print(f"JSON written to {json_path}") + + if args.plot: + plot_convergence(records, args.plot, mode=mode) + + # Print summary + if mode == "slurm" and records: + train_values = [r["train_loss"] for r in records + if r.get("train_loss") is not None] + if train_values: + print(f"\nSummary:") + print(f" Total records: {len(records)}") + print(f" Epochs with loss data: {len(train_values)}") + print(f" Train Loss range: {min(train_values):.8f} -> {max(train_values):.8f}") + if len(train_values) > 1: + reduction = (train_values[0] - train_values[-1]) / train_values[0] * 100 + print(f" Reduction: {reduction:.1f}%") + else: + tqdm_recs = [r for r in records if r.get("source") == "tqdm_final"] + if tqdm_recs: + print(f"\nSummary (no epoch-level loss — run with HYDRAGNN_VALTEST=1):") + print(f" Epochs completed: {len(tqdm_recs)}") + rates = [r["wall_time_s"] for r in tqdm_recs if r.get("wall_time_s")] + if rates: + print(f" Avg time per batch: {sum(rates)/len(rates):.2f} s/it") + + +if __name__ == "__main__": + main() diff --git a/material_science/models/HydraGNN/examples/run_inference.sh b/material_science/models/HydraGNN/examples/run_inference.sh index b73bacd..043d5a5 100755 --- a/material_science/models/HydraGNN/examples/run_inference.sh +++ b/material_science/models/HydraGNN/examples/run_inference.sh @@ -1,24 +1,26 @@ #!/usr/bin/env bash # run_inference.sh — Load a HydraGNN Predictive GFM 2024 checkpoint and run predictions # -# Run inside the container launched by docker_run.sh: +# Run inside the container (via sbatch_infer_amd.sh or docker_run.sh): # docker exec -it hydragnn bash /examples/run_inference.sh # # Prerequisites: -# - HydraGNN installed at /workspace/HydraGNN (docker_run.sh handles this) # - A checkpoint downloaded from HuggingFace mlupopa/HydraGNN_Predictive_GFM_2024 # Set HG_CHECKPOINT and HG_CONFIG to matching .pk and config.json files. # -# OPTIONAL EDITS: +# Environment variables: # HG_CHECKPOINT — path to a .pk checkpoint file (required) # HG_CONFIG — path to the matching config.json (required; same Hub subfolder) # HG_OUTPUT_DIR — where to write prediction outputs (default: /workspace/results) +# HG_INFER_REPO — HydraGNN clone (Predictive_GFM_2024 branch) for inference code. +# If not set, clones to $HG_OUTPUT_DIR/HydraGNN-infer on first run. set -euo pipefail HG_CHECKPOINT="${HG_CHECKPOINT:-}" HG_CONFIG="${HG_CONFIG:-}" HG_OUTPUT_DIR="${HG_OUTPUT_DIR:-/workspace/results}" +HG_INFER_REPO="${HG_INFER_REPO:-${HG_OUTPUT_DIR}/HydraGNN-infer}" if [[ -z "${HG_CHECKPOINT}" || -z "${HG_CONFIG}" ]]; then echo "ERROR: HG_CHECKPOINT and HG_CONFIG must both be set." @@ -48,16 +50,27 @@ fi mkdir -p "$HG_OUTPUT_DIR" +# Clone the Predictive_GFM_2024 branch (matches checkpoint architecture) +if [[ ! -d "${HG_INFER_REPO}/hydragnn" ]]; then + echo "--- Cloning HydraGNN (Predictive_GFM_2024 branch) for inference ---" + git clone --depth=1 --branch Predictive_GFM_2024 \ + https://github.com/ORNL/HydraGNN.git "$HG_INFER_REPO" +fi + echo "=== HydraGNN Predictive GFM 2024 — Inference ===" echo " Checkpoint : $HG_CHECKPOINT" echo " Config : $HG_CONFIG" +echo " Infer code : $HG_INFER_REPO" echo " Output : $HG_OUTPUT_DIR" echo "" python - <-v2.bp dirs (default: /data) +# HG_BATCH_SIZE Override batch size from config +# HG_NUM_EPOCH Override epoch count from config +# HG_PRECISION fp32 | fp64 | bf16 (default: fp64) +# HG_CONFIG JSON config filename (default: gfm_mlip.json) set -euo pipefail -HG_EXAMPLE="${HG_EXAMPLE:-multidataset_hpo}" -HG_CONFIG="${HG_CONFIG:-}" +HG_REPO_DIR="${HG_REPO_DIR:-/workspace/HydraGNN}" +HG_DATASETS="${HG_DATASETS:-ANI1x,Alexandria}" HG_DATA_DIR="${HG_DATA_DIR:-/data}" -HG_OUTPUT_DIR="${HG_OUTPUT_DIR:-/workspace/checkpoints}" +HG_BATCH_SIZE="${HG_BATCH_SIZE:-}" +HG_NUM_EPOCH="${HG_NUM_EPOCH:-}" +HG_PRECISION="${HG_PRECISION:-fp64}" +HG_CONFIG="${HG_CONFIG:-gfm_mlip.json}" -mkdir -p "$HG_DATA_DIR" "$HG_OUTPUT_DIR" -cd /workspace/HydraGNN - -echo "=== HydraGNN — Training ===" -echo " Example : $HG_EXAMPLE" -echo " Data dir : $HG_DATA_DIR" -echo " Output dir : $HG_OUTPUT_DIR" -echo "" -echo "NOTE: Full GFM 2024 pretraining requires staged ADIOS data from Constellation" -echo " and large-scale compute (see recipes/train/README.md)." -echo " For local experiments, use one of the smaller datasets bundled in the repo." -echo "" +EXAMPLE_DIR="${HG_REPO_DIR}/examples/multidataset_hpo_sc26" -if [[ -z "${HG_CONFIG}" ]]; then - echo "ERROR: HG_CONFIG is not set." - echo " Point it at a JSON config file for your chosen example, e.g.:" - echo " HG_CONFIG=/workspace/HydraGNN/examples/${HG_EXAMPLE}/config.json \\" - echo " bash /examples/run_train.sh" - exit 1 +# --------------------------------------------------------------------------- +# Clone HydraGNN if not present +# --------------------------------------------------------------------------- +if [[ ! -d "$EXAMPLE_DIR" ]]; then + echo "--- Cloning HydraGNN ---" + git clone --depth=1 https://github.com/ORNL/HydraGNN.git "$HG_REPO_DIR" fi -echo "Using config: $HG_CONFIG" +# --------------------------------------------------------------------------- +# Symlink datasets into expected location +# --------------------------------------------------------------------------- +DATASET_DIR="${EXAMPLE_DIR}/dataset" +mkdir -p "$DATASET_DIR" + +IFS=',' read -ra DATASET_ARRAY <<< "$HG_DATASETS" +for ds in "${DATASET_ARRAY[@]}"; do + target="${DATASET_DIR}/${ds}-v2.bp" + source="${HG_DATA_DIR}/${ds}-v2.bp" + if [[ ! -e "$target" ]] && [[ -d "$source" ]]; then + ln -sfn "$source" "$target" + fi +done + +# --------------------------------------------------------------------------- +# Print configuration +# --------------------------------------------------------------------------- +echo "=== HydraGNN Training ===" +echo " Repo dir : $HG_REPO_DIR" +echo " Example : $EXAMPLE_DIR" +echo " Config : $HG_CONFIG" +echo " Datasets : $HG_DATASETS" +echo " Precision : $HG_PRECISION" +echo " Batch size : ${HG_BATCH_SIZE:-config default}" +echo " Num epochs : ${HG_NUM_EPOCH:-config default}" echo "" -python -u "examples/${HG_EXAMPLE}/train.py" \ - --config "$HG_CONFIG" \ - --log_dir "$HG_OUTPUT_DIR" +cd "$EXAMPLE_DIR" + +# --------------------------------------------------------------------------- +# Build training command +# --------------------------------------------------------------------------- +TRAIN_ARGS=( + python -u gfm_mlip_all_mpnn.py + --inputfile="$HG_CONFIG" + --multi + --multi_model_list="$HG_DATASETS" + --precision="$HG_PRECISION" + --log="hydragnn-train-$(date +%Y%m%d-%H%M%S)" +) + +[[ -n "$HG_BATCH_SIZE" ]] && TRAIN_ARGS+=(--batch_size="$HG_BATCH_SIZE") +[[ -n "$HG_NUM_EPOCH" ]] && TRAIN_ARGS+=(--num_epoch="$HG_NUM_EPOCH") + +echo "Running: ${TRAIN_ARGS[*]}" +echo "" +exec "${TRAIN_ARGS[@]}" diff --git a/material_science/models/HydraGNN/examples/sbatch_infer_amd.sh b/material_science/models/HydraGNN/examples/sbatch_infer_amd.sh index 79d09a7..16c7218 100755 --- a/material_science/models/HydraGNN/examples/sbatch_infer_amd.sh +++ b/material_science/models/HydraGNN/examples/sbatch_infer_amd.sh @@ -101,15 +101,18 @@ echo " Config : $HG_CONFIG" echo " Output : $HG_OUTPUT_DIR" echo "" +HG_INFER_REPO="${HG_INFER_REPO:-${HG_OUTPUT_DIR}/HydraGNN-infer}" + apptainer exec --rocm \ --overlay "${HG_OVERLAY}:ro" \ --bind "$(dirname "$HG_CHECKPOINT"):$(dirname "$HG_CHECKPOINT"):ro" \ --bind "$(dirname "$HG_CONFIG"):$(dirname "$HG_CONFIG"):ro" \ --bind "${HG_OUTPUT_DIR}:${HG_OUTPUT_DIR}" \ --bind "$SCRIPT_DIR":/examples:ro \ - --env PYTHONPATH="/opt/hydragnn-pkgs:${PYTHONPATH:-}" \ --env HG_CHECKPOINT="$HG_CHECKPOINT" \ --env HG_CONFIG="$HG_CONFIG" \ --env HG_OUTPUT_DIR="$HG_OUTPUT_DIR" \ + --env HG_INFER_REPO="$HG_INFER_REPO" \ + --env PYTHONPATH="/opt/hydragnn-pkgs" \ "$HG_SIF" \ bash -c "source /opt/venv/bin/activate && bash /examples/run_inference.sh" diff --git a/material_science/models/HydraGNN/examples/sbatch_train_amd.sh b/material_science/models/HydraGNN/examples/sbatch_train_amd.sh new file mode 100755 index 0000000..20dfe2c --- /dev/null +++ b/material_science/models/HydraGNN/examples/sbatch_train_amd.sh @@ -0,0 +1,342 @@ +#!/usr/bin/env bash +# HydraGNN multi-dataset training on AMD Instinct MI355X via SLURM (Apptainer). +# +# Emulates the upstream HydraGNN-scaling-test.sh workflow: +# https://github.com/ORNL/HydraGNN/blob/main/run-scripts/HydraGNN-scaling-test.sh +# +# Uses the repo-bundled gfm_mlip.json config with ANI1x and Alexandria ADIOS +# datasets via MPI-enabled AdiosMultiDataset (no DDStore). +# +# Prerequisites: +# 1. Apptainer SIF with PyTorch ROCm (default: rocm7.2.2 + py3.12) +# 2. Pre-built overlay with hydragnn, torch-geometric, mpi4py, adios2 (MPI) +# Built via: build_overlay_amd.sh +# 3. ADIOS datasets staged on shared filesystem +# +# Quick-start (1 node, 8 GPUs): +# export AI4S_SHARED_DIR=/path/to/shared # set via /init-cluster +# sbatch sbatch_train_amd.sh +# +# Multi-node (4 nodes, 32 GPUs): +# export AI4S_SHARED_DIR=/path/to/shared +# sbatch --nodes=4 sbatch_train_amd.sh +# +# Key environment variables: +# AI4S_SHARED_DIR Base path for shared assets (required, no default) +# HG_SIF Path to Apptainer SIF image +# HG_OVERLAY Path to pre-built ext3 overlay +# HG_DATASETS Comma-separated dataset names (default: ANI1x,Alexandria) +# HG_DATA_DIR Directory containing -v2.bp dirs +# HG_BATCH_SIZE Per-rank batch size (default: from config) +# HG_NUM_EPOCH Number of training epochs (default: 1) +# HG_PRECISION fp32 | fp64 | bf16 (default: fp64) +# HG_OUTPUT_DIR Working directory for logs/checkpoints +# HG_REPO_DIR HydraGNN source clone location +# SCRATCH_LOCAL Node-local fast storage root (default: /scratch) +# HYDRAGNN_MAX_NUM_BATCH Cap batches per epoch (sanity test: 20-50, full: unset) +# HYDRAGNN_VALTEST Run val/test during training (default: 0 = skip) +# +# HydraGNN pinned SHA: 6c45f1682783e66dc89e9e23009f61716186432b (main) + +#SBATCH --job-name=hydragnn-train +#SBATCH --partition=YOUR_GPU_PARTITION +#SBATCH --account=YOUR_ACCOUNT +#SBATCH --nodes=1 +#SBATCH --ntasks-per-node=8 +#SBATCH --gpus-per-node=8 +#SBATCH --cpus-per-task=16 +#SBATCH --time=02:00:00 +#SBATCH --output=hydragnn-train-%j.out +#SBATCH --error=hydragnn-train-%j.out + +set -euo pipefail + +# --------------------------------------------------------------------------- +# SCRIPT_DIR — works both at submit time and inside the SLURM spool +# --------------------------------------------------------------------------- +if [[ -n "${SLURM_JOB_ID:-}" ]]; then + _ORIG_CMD=$(scontrol show job "$SLURM_JOB_ID" | sed -n 's/.*Command=\(\S\+\).*/\1/p') + SCRIPT_DIR=$(cd "$(dirname "$_ORIG_CMD")" && pwd) +else + SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +fi + +# --------------------------------------------------------------------------- +# Paths and configuration +# --------------------------------------------------------------------------- +HG_BASE="${AI4S_SHARED_DIR:?AI4S_SHARED_DIR must be set}/models/HydraGNN" +HG_SIF="${HG_SIF:-${AI4S_SHARED_DIR}/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif}" +HG_OVERLAY="${HG_OVERLAY:-${HG_BASE}/overlays/hydragnn-overlay.img}" +HG_DATASETS="${HG_DATASETS:-ANI1x,Alexandria}" +HG_DATA_DIR="${HG_DATA_DIR:-${HG_BASE}/weights}" +HG_BATCH_SIZE="${HG_BATCH_SIZE:-200}" +HG_NUM_EPOCH="${HG_NUM_EPOCH:-1}" +HG_PRECISION="${HG_PRECISION:-fp64}" +HG_OUTPUT_DIR="${HG_OUTPUT_DIR:-${HG_BASE}/outputs}" +HG_REPO_DIR="${HG_REPO_DIR:-${HG_BASE}/code/HydraGNN}" + +NODES="${SLURM_JOB_NUM_NODES:-1}" +GPUS_PER_NODE=8 +TOTAL_RANKS=$((NODES * GPUS_PER_NODE)) + +# --------------------------------------------------------------------------- +# Validate required files +# --------------------------------------------------------------------------- +for var in HG_SIF HG_OVERLAY; do + if [[ ! -f "${!var}" ]]; then + echo "ERROR: $var not found: ${!var}" >&2 + exit 2 + fi +done + +IFS=',' read -ra DATASET_ARRAY <<< "$HG_DATASETS" +for ds in "${DATASET_ARRAY[@]}"; do + ds_path="${HG_DATA_DIR}/${ds}-v2.bp" + if [[ ! -d "$ds_path" ]]; then + echo "ERROR: Dataset not found: $ds_path" >&2 + exit 2 + fi +done + +# --------------------------------------------------------------------------- +# Clone HydraGNN source if not present +# --------------------------------------------------------------------------- +HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-6c45f1682783e66dc89e9e23009f61716186432b}" +if [[ ! -d "${HG_REPO_DIR}/examples/multidataset_hpo_sc26" ]]; then + echo "--- Cloning HydraGNN (pinned SHA: ${HG_HYDRAGNN_SHA}) ---" + git clone https://github.com/ORNL/HydraGNN.git "$HG_REPO_DIR" + git -C "$HG_REPO_DIR" checkout "$HG_HYDRAGNN_SHA" +fi + +# --------------------------------------------------------------------------- +# Symlink datasets into the expected location +# --------------------------------------------------------------------------- +EXAMPLE_DIR="${HG_REPO_DIR}/examples/multidataset_hpo_sc26" +DATASET_DIR="${EXAMPLE_DIR}/dataset" +mkdir -p "$DATASET_DIR" + +for ds in "${DATASET_ARRAY[@]}"; do + target="${DATASET_DIR}/${ds}-v2.bp" + source="${HG_DATA_DIR}/${ds}-v2.bp" + if [[ ! -e "$target" ]]; then + ln -sfn "$source" "$target" + echo " Linked: $target -> $source" + fi +done + +# --------------------------------------------------------------------------- +# Prepare output directory +# --------------------------------------------------------------------------- +mkdir -p "$HG_OUTPUT_DIR" + +# --------------------------------------------------------------------------- +# Print job configuration +# --------------------------------------------------------------------------- +echo "=== HydraGNN Training (MPI-enabled ADIOS, no DDStore) ===" +echo " Nodes : $NODES" +echo " GPUs/node : $GPUS_PER_NODE" +echo " Total ranks : $TOTAL_RANKS" +echo " SIF : $HG_SIF" +echo " Overlay : $HG_OVERLAY" +echo " Datasets : $HG_DATASETS" +echo " Data dir : $HG_DATA_DIR" +echo " Precision : $HG_PRECISION" +echo " Batch size : ${HG_BATCH_SIZE:-200}" +echo " Num epochs : ${HG_NUM_EPOCH:-1}" +echo " Max batches : ${HYDRAGNN_MAX_NUM_BATCH:-unlimited}" +echo " Output dir : $HG_OUTPUT_DIR" +echo " Repo dir : $HG_REPO_DIR" +echo " Config : ${EXAMPLE_DIR}/gfm_mlip.json" +echo " Node(s) : ${SLURM_NODELIST:-$(hostname)}" +echo " Date : $(date)" +echo "" + +# --------------------------------------------------------------------------- +# Write rank script (runs inside the container on every rank) +# --------------------------------------------------------------------------- +RANK_SCRIPT="${HG_OUTPUT_DIR}/hydragnn-rank-${SLURM_JOB_ID:-$$}.sh" +cat > "$RANK_SCRIPT" << 'RANKEOF' +#!/usr/bin/env bash +set -euo pipefail +source /opt/venv/bin/activate + +export PYTHONPATH="/opt/hydragnn-pkgs:${PYTHONPATH:-}" +export LD_LIBRARY_PATH="/opt/hydragnn-pkgs/adios2:/opt/ompi/lib:${LD_LIBRARY_PATH:-}" + +# Use node-local fast storage for per-rank temp writes (MIOpen cache, etc.) +# SCRATCH_LOCAL comes from .cluster-config.yaml paths.scratch_local (e.g. /scratch, /local, /tmp) +SCRATCH_RANK="${SCRATCH_LOCAL:-/scratch}/${USER:?}/hydragnn-${SLURM_JOB_ID:-$$}/${SLURM_PROCID:-0}" +mkdir -p "$SCRATCH_RANK" +export TMPDIR="$SCRATCH_RANK" + +export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-16}" +export MIOPEN_DISABLE_CACHE=1 +export MIOPEN_USER_DB_PATH="${SCRATCH_RANK}/miopen" +mkdir -p "$MIOPEN_USER_DB_PATH" +export PYTHONNOUSERSITE=1 + +# HydraGNN runtime env vars (from upstream HydraGNN-scaling-test.sh) +export HYDRAGNN_USE_VARIABLE_GRAPH_SIZE=1 +export HYDRAGNN_AGGR_BACKEND=mpi +export HYDRAGNN_VALTEST="${HYDRAGNN_VALTEST:-0}" +export HYDRAGNN_USE_FSDP=0 +export HYDRAGNN_TRACE_LEVEL="${HYDRAGNN_TRACE_LEVEL:-1}" + +# Required for RCCL stability on MI355X +export HSA_NO_SCRATCH_RECLAIM=1 + +cd "$HG_OUTPUT_DIR" + +# Monkey-patch AdiosMultiDataset to inject avg_num_neighbors, avoiding an +# expensive full-dataset scan of neighbour counts at init. +export HYDRAGNN_AVG_NUM_NEIGHBORS="${HYDRAGNN_AVG_NUM_NEIGHBORS:-13.735293601560318}" + +exec python -u -c " +import sys, os, runpy + +avg_nn = float(os.environ.get('HYDRAGNN_AVG_NUM_NEIGHBORS', '0')) +if avg_nn > 0: + import hydragnn.utils.datasets.adiosdataset as adm + _orig_init = adm.AdiosMultiDataset.__init__ + def _patched_init(self, *a, **kw): + _orig_init(self, *a, **kw) + self.avg_num_neighbors = avg_nn + adm.AdiosMultiDataset.__init__ = _patched_init + +batch_size = int(os.environ.get('HG_BATCH_SIZE', '0') or 200) +max_num_batch = int(os.environ.get('HYDRAGNN_MAX_NUM_BATCH', '0')) +num_samples_val = max_num_batch * batch_size if max_num_batch > 0 else 0 + +sys.argv = [ + os.environ['HG_EXAMPLE_DIR'] + '/gfm_mlip_all_mpnn.py', + '--inputfile=' + os.environ['HG_EXAMPLE_DIR'] + '/gfm_mlip.json', + '--multi', + '--everyone', + '--multi_model_list=' + os.environ['HG_DATASETS'], + '--precision=' + os.environ['HG_PRECISION'], + '--batch_size=' + str(batch_size), + '--num_epoch=' + os.environ.get('HG_NUM_EPOCH', '1'), + '--startfrom=none', + '--log=hydragnn-train-' + os.environ.get('SLURM_JOB_ID','0') + '-N' + os.environ.get('SLURM_JOB_NUM_NODES','1'), +] + +if num_samples_val > 0: + sys.argv += [ + '--num_samples=' + str(num_samples_val), + '--oversampling', + '--oversampling_num_samples=' + str(num_samples_val), + ] + +runpy.run_path(sys.argv[0], run_name='__main__') +" +RANKEOF +chmod +x "$RANK_SCRIPT" + +# --------------------------------------------------------------------------- +# Multi-node network environment (required for >1 node) +# +# All cluster-specific values are read from .cluster-config.yaml (gitignored) +# or overridden via env vars: +# NCCL_IB_HCA — IB HCA device list (discover: ibstat | grep 'CA ') +# NCCL_SOCKET_IFNAME — Management NIC for OOB (discover: ip -o link show up) +# RCCL_ANP_PLUGIN — Path to librccl-anp.so on host +# LIBIONIC_PATH — Path to libionic.so.1 on host +# --------------------------------------------------------------------------- +RCCL_MULTINODE_ENVS=() +MPI_MULTINODE_ENVS=() +if [[ "$NODES" -gt 1 ]]; then + # Read cluster-specific network settings from .cluster-config.yaml if not + # already set via environment. Requires yq or falls back to grep parsing. + _CLUSTER_CFG="${SCRIPT_DIR}/../../../../.cluster-config.yaml" + if [[ -f "$_CLUSTER_CFG" ]]; then + _yaml_get() { grep "^ $1:" "$_CLUSTER_CFG" 2>/dev/null | sed 's/.*: *"\?\([^"]*\)"\?.*/\1/' | grep -v '^$'; } + : "${NCCL_IB_HCA:=$(_yaml_get ib_hca)}" + : "${NCCL_SOCKET_IFNAME:=$(_yaml_get mgmt_iface)}" + : "${RCCL_ANP_PLUGIN:=$(_yaml_get rccl_anp_plugin)}" + : "${LIBIONIC_PATH:=$(_yaml_get libionic_path)}" + fi + + # Validate required settings + : "${NCCL_IB_HCA:?Set NCCL_IB_HCA or configure network.ib_hca in .cluster-config.yaml}" + : "${NCCL_SOCKET_IFNAME:?Set NCCL_SOCKET_IFNAME or configure network.mgmt_iface in .cluster-config.yaml}" + : "${RCCL_ANP_PLUGIN:?Set RCCL_ANP_PLUGIN or configure network.rccl_anp_plugin in .cluster-config.yaml}" + : "${LIBIONIC_PATH:?Set LIBIONIC_PATH or configure network.libionic_path in .cluster-config.yaml}" + + # MPI transport: ob1/tcp over management NIC. + # Pensando/ionic data NICs use /31 point-to-point subnets that don't route + # between nodes, so IB verbs cannot work for MPI. Only RCCL (via ANP plugin) + # bypasses IP routing on the data fabric. MPI uses TCP on the management NIC. + MPI_MULTINODE_ENVS=( + --env OMPI_MCA_pml=ob1 + --env OMPI_MCA_btl=tcp,self + --env OMPI_MCA_btl_tcp_if_include="$NCCL_SOCKET_IFNAME" + --env MPI4PY_RC_THREADS=false + ) + + # RCCL/NCCL env vars for ANP over ionic (from AMD MI355X documentation). + # The ANP plugin and libionic must be bind-mounted into the container since + # Apptainer --rocm does not expose them automatically. + RCCL_MULTINODE_ENVS=( + --bind "${RCCL_ANP_PLUGIN}:${RCCL_ANP_PLUGIN}:ro" + --bind "${LIBIONIC_PATH}:${LIBIONIC_PATH}:ro" + --env NCCL_NET_PLUGIN="$RCCL_ANP_PLUGIN" + --env NCCL_IB_HCA="$NCCL_IB_HCA" + --env NCCL_IB_GID_INDEX=1 + --env NCCL_GDR_FLUSH_DISABLE=1 + --env RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0 + --env NCCL_GDRCOPY_ENABLE=0 + --env NCCL_IB_QPS_PER_CONNECTION=1 + --env HSA_NO_SCRATCH_RECLAIM=1 + --env NCCL_IB_TC=96 + --env NCCL_IB_FIFO_TC=192 + --env NCCL_IGNORE_CPU_AFFINITY=1 + --env NCCL_PXN_DISABLE=0 + --env NET_OPTIONAL_RECV_COMPLETION=1 + --env NCCL_IB_USE_INLINE=1 + --env NCCL_SOCKET_IFNAME="$NCCL_SOCKET_IFNAME" + --env RCCL_LL128_FORCE_ENABLE=1 + --env NCCL_IB_PCI_RELAXED_ORDERING=1 + --env NCCL_DMABUF_ENABLE=1 + --env NCCL_DEBUG="${NCCL_DEBUG:-WARN}" + ) + echo " Multi-node MPI transport: ob1/tcp (btl_tcp_if_include=$NCCL_SOCKET_IFNAME)" + echo " Multi-node RCCL env vars enabled" + echo " NCCL_IB_HCA=$NCCL_IB_HCA" + echo " NCCL_SOCKET_IFNAME=$NCCL_SOCKET_IFNAME" +fi + +# --------------------------------------------------------------------------- +# Launch distributed training via srun + apptainer +# --------------------------------------------------------------------------- +echo "--- Launching training: $TOTAL_RANKS ranks across $NODES nodes ---" +echo "" + +srun --mpi=pmix \ + apptainer exec \ + --rocm \ + --overlay "${HG_OVERLAY}:ro" \ + --bind "/opt/ompi:/opt/ompi:ro" \ + --bind "${SCRATCH_LOCAL:-/scratch}:${SCRATCH_LOCAL:-/scratch}" \ + --bind "${HG_REPO_DIR}:${HG_REPO_DIR}:ro" \ + --bind "${HG_DATA_DIR}:${HG_DATA_DIR}:ro" \ + --bind "${HG_OUTPUT_DIR}:${HG_OUTPUT_DIR}" \ + --env PMIX_MCA_gds=hash \ + --env PMIX_MCA_psec=native \ + "${MPI_MULTINODE_ENVS[@]}" \ + --env SCRATCH_LOCAL="${SCRATCH_LOCAL:-/scratch}" \ + --env HG_DATASETS="$HG_DATASETS" \ + --env HG_PRECISION="$HG_PRECISION" \ + --env HG_BATCH_SIZE="${HG_BATCH_SIZE:-200}" \ + --env HG_NUM_EPOCH="${HG_NUM_EPOCH:-1}" \ + --env HYDRAGNN_MAX_NUM_BATCH="${HYDRAGNN_MAX_NUM_BATCH:-}" \ + --env HYDRAGNN_VALTEST="${HYDRAGNN_VALTEST:-0}" \ + --env HYDRAGNN_TRACE_LEVEL="${HYDRAGNN_TRACE_LEVEL:-1}" \ + --env HG_EXAMPLE_DIR="$EXAMPLE_DIR" \ + --env HG_OUTPUT_DIR="$HG_OUTPUT_DIR" \ + "${RCCL_MULTINODE_ENVS[@]}" \ + "$HG_SIF" \ + bash "$RANK_SCRIPT" + +echo "" +echo "=== HydraGNN training complete ===" +echo " Logs: ${HG_OUTPUT_DIR}/" diff --git a/material_science/models/HydraGNN/model.yaml b/material_science/models/HydraGNN/model.yaml index 983e1bb..9f10928 100644 --- a/material_science/models/HydraGNN/model.yaml +++ b/material_science/models/HydraGNN/model.yaml @@ -6,9 +6,10 @@ domain: material_science upstream_code: https://github.com/ORNL/HydraGNN upstream_branch: Predictive_GFM_2024 paper: null -container_image: rocm/pytorch:rocm7.0_ubuntu22.04_py3.10_pytorch_release_2.7.1 +container_image: rocm/pytorch:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0 vram_gb: null -validated_hardware: [] +validated_hardware: + - MI355X (gfx950) data_format: ADIOS (HDF5-based) data_source: https://doi.ccs.ornl.gov/dataset/3a49c8df-83f7-5d32-84be-f81d289e7cdd @@ -18,29 +19,59 @@ recipes: description: Load checkpoints and run predictions with the GFM ensemble recipe_path: recipes/inference/ script: examples/run_inference.sh + sbatch_script: examples/sbatch_infer_amd.sh - task: train - description: Frontier-scale pretraining and HPO + description: Multi-dataset GFM training with ANI1x and Alexandria (multi-node) recipe_path: recipes/train/ script: examples/run_train.sh + sbatch_script: examples/sbatch_train_amd.sh env_vars: + # Required for HPC scripts (sbatch_*_amd.sh, build_overlay_amd.sh) + AI4S_SHARED_DIR: + description: Base path for shared assets (SIF, overlays, datasets, code) + required: true + default: null + # Inference task HG_CHECKPOINT: - description: Path to .pk checkpoint file + description: Path to .pk checkpoint file (inference) required: true default: null HG_CONFIG: - description: Matching config.json for the checkpoint + description: Matching config.json for the checkpoint (inference) required: true default: null + # Training task + HG_DATASETS: + description: Comma-separated dataset names for training + required: false + default: ANI1x,Alexandria + HG_DATA_DIR: + description: Directory containing -v2.bp ADIOS dataset dirs + required: false + default: null + HG_BATCH_SIZE: + description: Per-rank batch size (overrides config) + required: false + default: null + HG_NUM_EPOCH: + description: Number of training epochs (overrides config) + required: false + default: null + HG_PRECISION: + description: Training precision (fp32, fp64, bf16) + required: false + default: fp64 + # Shared HG_OUTPUT_DIR: - description: Prediction output directory + description: Output/working directory required: false - default: /workspace/results - HG_EXAMPLE: - description: Upstream example folder name (train task) + default: null + HG_REPO_DIR: + description: HydraGNN source clone location (training, pinned SHA) required: false - default: multidataset_hpo - HG_DATA_DIR: - description: Dataset staging directory (train task) + default: null + HG_INFER_REPO: + description: HydraGNN Predictive_GFM_2024 branch clone (inference only) required: false - default: /data + default: null diff --git a/material_science/models/HydraGNN/recipes/inference/README.md b/material_science/models/HydraGNN/recipes/inference/README.md index 50ec64d..52c1e48 100644 --- a/material_science/models/HydraGNN/recipes/inference/README.md +++ b/material_science/models/HydraGNN/recipes/inference/README.md @@ -6,8 +6,8 @@ This is the usual **AI4Science Studio** entry point: use **published checkpoints ## Prerequisites -- Container: `rocm/pytorch:rocm7.0_ubuntu22.04_py3.10_pytorch_release_2.7.1` -- Code: clone [`ORNL/HydraGNN`](https://github.com/ORNL/HydraGNN) branch **`Predictive_GFM_2024`**; install with `pip install -e .` +- Container: `rocm/pytorch:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0` (with hydragnn overlay) +- Code: **must use branch `Predictive_GFM_2024`** — the inference script auto-clones this at runtime because the GFM 2024 checkpoints are incompatible with the newer training SHA (different model architecture keys and state_dict structure) - Weights: download from [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024) -- under `Ensemble_of_models/`, each trial folder has a `config.json` and `.pk` checkpoint ## Environment Variables diff --git a/material_science/models/HydraGNN/recipes/train/README.md b/material_science/models/HydraGNN/recipes/train/README.md index 7ad0bd9..a607177 100644 --- a/material_science/models/HydraGNN/recipes/train/README.md +++ b/material_science/models/HydraGNN/recipes/train/README.md @@ -1,31 +1,268 @@ -# HydraGNN: large-scale training on HPC (Frontier-oriented) +# HydraGNN: Distributed Training on AMD Instinct MI355X -> **Ready-to-run scripts:** see [`../../examples/`](../../examples/) for `docker_run.sh` and `run_train.sh`. +> **Ready-to-run scripts:** see [`../../examples/`](../../examples/) for `sbatch_train_amd.sh` and `run_train.sh`. -This recipe summarizes the **exascale-oriented** workflow described on the [Hugging Face model card](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024) and implemented in [`ORNL/HydraGNN`](https://github.com/ORNL/HydraGNN) branch **`Predictive_GFM_2024`**. It is **not** something most users can run without a major HPC allocation and staged **ADIOS** datasets. +This recipe documents multi-node GFM (Graph Foundation Model) training using HydraGNN on AMD Instinct MI355X GPUs with Pensando/ionic interconnect. -## Environment Variables +--- -| Variable | Required | Default | Description | -|----------|----------|---------|-------------| -| `HG_EXAMPLE` | No | `multidataset_hpo` | Upstream example folder name | -| `HG_CONFIG` | Yes | -- | JSON config path | -| `HG_DATA_DIR` | No | `/data` | Dataset staging directory | -| `HG_OUTPUT_DIR` | No | `/workspace/checkpoints` | Checkpoints and logs | +## 1. Environment & Dependencies -## Who can run this +No module loads or conda environments are needed. Everything runs inside an Apptainer container with a pre-built overlay: -- **OLCF** users with an approved project and compute on **Frontier** (or comparable **AMD** systems with a supported **PyTorch + ROCm** stack and sufficient scale). -- **General users:** prefer Hub checkpoints and [../inference/README.md](../inference/README.md), or smaller experiments on hardware you control—see [../compute/README.md](../compute/README.md). For **institutional AMD clusters**, see the cluster notes in [../compute/README.md](../compute/README.md). +| Component | Path | +|-----------|------| +| SIF image | `$AI4S_SHARED_DIR/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif` | +| Overlay | `$AI4S_SHARED_DIR/models/HydraGNN/overlays/hydragnn-overlay.img` | -## High-level steps (from upstream) +**Overlay contents:** PyTorch 2.10.0+rocm7.2.2, torch-geometric, torch-scatter/sparse/cluster (ROCm wheels), mpi4py, adios2, hydragnn, and runtime deps (tqdm, pyyaml, tensorboard, scikit-learn, scipy, h5py, ase). -1. **Environment** — Follow HydraGNN installation instructions on branch **`Predictive_GFM_2024`** for your target system (modules, PyTorch ROCm build, and optional distributed dependencies). -2. **Data** — Preprocessed training data in **ADIOS** form are described under `ADIOS_files/` on the Hub and in the Constellation release (see [../data/README.md](../data/README.md)). Paths in your configs must point at **your** staged files. -3. **HPO and pretraining** — The model card describes **hyperparameter optimization** with [DeepHyper](https://github.com/deephyper/deephyper), short HPO training epochs with early stopping, selection of fifteen trials, and **continued pretraining** up to a capped epoch count. Entry points live under **`examples/multidataset_hpo`** on the **`Predictive_GFM_2024`** branch. -4. **Distributed training** — Pretraining was performed with **distributed data parallelism (DDP)** at very large node counts (see the model card). Reproduce at smaller scale by reducing world size and batch settings per upstream guidance. -5. **Energy accounting** — The publication references **omnistat** for training energy measurement on large runs; use that only if your site supports it. +**Hardware:** AMD Instinct MI355X (gfx950), 8 GPUs per node, 288 GB VRAM per GPU, Pensando ionic interconnect. -## Disclaimer +## 2. Build / Setup Instructions -AI4Science Studio does **not** grant OLCF access, storage, or software support. For policies, queues, and project requests, use **official OLCF documentation** and the user assistance process linked from [../compute/README.md](../compute/README.md). +The overlay is already built. The only setup step is cloning the HydraGNN source (done automatically by the sbatch script on first run): + +```bash +# Manual clone (optional, script does this automatically): +git clone --depth=1 https://github.com/ORNL/HydraGNN.git \ + $AI4S_SHARED_DIR/models/HydraGNN/code/HydraGNN +``` + +The script also symlinks the staged ADIOS datasets into the expected location (`examples/multidataset_hpo_sc26/dataset/`). + +### Staged datasets + +| Dataset | Path | Size | +|---------|------|------| +| ANI1x | `$AI4S_SHARED_DIR/models/HydraGNN/weights/ANI1x-v2.bp` | 30 GB | +| Alexandria | `$AI4S_SHARED_DIR/models/HydraGNN/weights/Alexandria-v2.bp` | 23 GB | + +These are pre-processed ADIOS datasets from the [HydraGNN Predictive GFM 2024 model card](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024). + +## 3. Run Instructions + +### Single node (8 GPUs) + +```bash +export AI4S_SHARED_DIR=/path/to/shared # set via /init-cluster +sbatch sbatch_train_amd.sh +``` + +### 2 nodes (16 GPUs) + +```bash +sbatch --nodes=2 sbatch_train_amd.sh +``` + +### 4 nodes (32 GPUs) + +```bash +sbatch --nodes=4 sbatch_train_amd.sh +``` + +### Custom parameters + +```bash +export HG_BATCH_SIZE=200 +export HG_NUM_EPOCH=4 +export HG_PRECISION=bf16 +sbatch --nodes=4 --time=04:00:00 sbatch_train_amd.sh +``` + +### Environment variable reference + +| Variable | Default | Description | +|----------|---------|-------------| +| `AI4S_SHARED_DIR` | (required) | Base path for shared assets | +| `HG_SIF` | `$AI4S_SHARED_DIR/images/pytorch_rocm7.2.2_...sif` | Apptainer SIF | +| `HG_OVERLAY` | `$AI4S_SHARED_DIR/models/HydraGNN/overlays/hydragnn-overlay.img` | Overlay image | +| `HG_DATASETS` | `ANI1x,Alexandria` | Comma-separated dataset names | +| `HG_DATA_DIR` | `$AI4S_SHARED_DIR/models/HydraGNN/weights` | Dir containing `-v2.bp` | +| `HG_BATCH_SIZE` | 128 (from config) | Per-rank batch size | +| `HG_NUM_EPOCH` | 1 (from config) | Training epochs | +| `HG_PRECISION` | `fp64` | `fp32`, `fp64`, or `bf16` | +| `HG_OUTPUT_DIR` | `$AI4S_SHARED_DIR/models/HydraGNN/outputs` | Output working directory | +| `HG_REPO_DIR` | `$AI4S_SHARED_DIR/models/HydraGNN/code/HydraGNN` | HydraGNN source clone | +| `HYDRAGNN_MAX_NUM_BATCH` | (unset = unlimited) | Cap batches per epoch (sanity: 50) | +| `HYDRAGNN_VALTEST` | `0` | Run validation/test loops (0=skip) | +| `SCRATCH_LOCAL` | `/scratch` | Node-local fast storage for MIOpen cache | + +### How the launch works + +``` +sbatch → SLURM allocates N nodes + └─ srun --mpi=pmix (N*8 ranks) + └─ apptainer exec --rocm --overlay hydragnn-overlay.img:ro $SIF + └─ bash rank_script.sh + └─ python -u gfm_mlip_all_mpnn.py --multi --multi_model_list=ANI1x,Alexandria +``` + +MPI is initialized by `hydragnn.utils.distributed.setup_ddp()`. Each rank opens the ADIOS datasets via `AdiosMultiDataset` with the MPI communicator. No DDStore in phase 1. + +## 4. Config File + +The upstream `gfm_mlip.json` (in `examples/multidataset_hpo_sc26/`) is used as-is. It is hardware-agnostic: + +| Parameter | Value | Notes | +|-----------|-------|-------| +| `mpnn_type` | MACE | Equivariant GNN architecture | +| `hidden_dim` | 128 | | +| `num_conv_layers` | 4 | | +| `force_weight` | 10.0 | Energy + forces training | +| `batch_size` | 128 | Overridable via `--batch_size` | +| `num_epoch` | 1 | Overridable via `--num_epoch` | +| `precision` | fp64 | Overridable via `--precision` | +| `learning_rate` | 0.001 | AdamW optimizer | +| `Checkpoint` | true | Writes to `logs/` in working dir | + +CLI flags (`--batch_size`, `--num_epoch`, `--precision`) override these at runtime. + +## 5. Expected Model Behavior + +With `num_epoch=1` (validation/smoke-test): +- Training should complete without errors +- Loss values should be finite and decreasing within the epoch +- All ranks should report consistent loss values + +For longer training runs (`num_epoch >= 10`): +- MAE loss on energy should decrease steadily +- Force prediction error should also decrease +- Early stopping (patience=10) will halt training if validation loss plateaus + +### Validated sanity test (1 node / 8 GPUs, 50 batches) + +Validated on MI355X (2026-05-19) with `HYDRAGNN_MAX_NUM_BATCH=50`, `HG_BATCH_SIZE=200`, `HG_PRECISION=fp64`, datasets `ANI1x,Alexandria`: + +| Metric | Value | +|--------|-------| +| Total training time | 183 s (50 batches) | +| Steady-state iteration time | 2.9–3.0 s/batch | +| Peak GPU memory (allocated) | 7.5 GB | +| Peak GPU memory (reserved) | 9.0 GB | +| Data load time | 2.9 s | +| Model creation time | 0.75 s | +| RCCL errors | None | +| All ranks converged | Yes | + +## 6. Performance & Scaling + +| Nodes | GPUs | Throughput (batch/s) | Time per batch | Notes | +|-------|------|---------------------|----------------|-------| +| 1 | 8 | ~0.42 | 2.4 s | Validated (50-batch sanity test, 2026-05-19) | +| 2 | 16 | ~0.33 | 3.0 s | Validated (30-batch, ob1/tcp MPI, 2026-05-19) | +| 4 | 32 | TBD | — | Expected to work with same flags | + +### Multi-node network architecture + +On Pensando/ionic fabrics, data NICs use `/31` point-to-point subnets that don't route between nodes — standard IB verbs cannot work for inter-node MPI. The architecture separates traffic into three paths: + +| Path | Interface | Protocol | Purpose | +|------|-----------|----------|---------| +| MPI (data bcast, OOB) | management NIC | TCP | ADIOS metadata bcast, rank coordination | +| RCCL bootstrap | same mgmt NIC | TCP socket | RCCL topology exchange | +| RCCL data (GPU allreduce) | IB HCA devices (400G each) | ANP plugin (RoCEv2, GDRDMA) | GPU-to-GPU collectives | + +**MPI configuration (in `sbatch_train_amd.sh`):** + +```bash +--env OMPI_MCA_pml=ob1 # ob1 PML (UCX PML hangs on ionic fabrics) +--env OMPI_MCA_btl=tcp,self # TCP BTL over management NIC +--env OMPI_MCA_btl_tcp_if_include=$NCCL_SOCKET_IFNAME # Management interface +--env MPI4PY_RC_THREADS=false # Avoid MPI_THREAD_MULTIPLE +``` + +**RCCL configuration:** + +```bash +--bind $RCCL_ANP_PLUGIN:$RCCL_ANP_PLUGIN:ro # ANP plugin (not exposed by --rocm) +--bind $LIBIONIC_PATH:$LIBIONIC_PATH:ro # ionic userspace driver +--env NCCL_NET_PLUGIN=$RCCL_ANP_PLUGIN # Select ANP transport +--env NCCL_IB_HCA=$NCCL_IB_HCA # All IB HCAs (from .cluster-config.yaml) +``` + +Without the ANP/libionic bind-mounts, RCCL falls back to socket transport over the management NIC (functional but significantly slower). + +**Why ob1/tcp for MPI is correct:** +- MPI over TCP on the mgmt NIC is sufficient for ADIOS I/O (metadata bcast at startup, not on the training hot path) +- GPU gradient allreduce uses RCCL over ANP with GDRDMA — completely independent of MPI + +## 7. Hyperparameter Recommendations + +| Scale | Batch size | Learning rate | Precision | Notes | +|-------|-----------|---------------|-----------|-------| +| 1 node (smoke test) | 128 | 0.001 | fp64 | Config defaults | +| 1 node (training) | 200 | 0.001 | fp64 | Per upstream scaling test | +| 2-4 nodes | 200 | 0.001 | fp64 | Linear scaling rule may apply for LR | + +## 8. Validation & Reproducibility + +- Random seed is fixed at `random_state = 0` in `gfm_mlip_all_mpnn.py` +- Config is saved to `logs//` at start of training +- Checkpoints (if enabled) write to `logs//` +- Existing pretrained weights in `$AI4S_SHARED_DIR/models/HydraGNN/weights/` are never modified + +## 9. Convergence Tracking + +HydraGNN prints epoch-level loss metrics when `HYDRAGNN_VALTEST=1`. Use the bundled parser to extract structured convergence data: + +```bash +# Run training with validation enabled: +export HYDRAGNN_VALTEST=1 +export HG_NUM_EPOCH=10 +export HYDRAGNN_MAX_NUM_BATCH=50 # cap batches per epoch for faster iteration +sbatch sbatch_train_amd.sh + +# Parse the output: +python examples/parse_convergence.py --log hydragnn-train-.out -o convergence.csv +python examples/parse_convergence.py --log hydragnn-train-.out --plot convergence.png +``` + +### Output format (SLURM log parsing) + +HydraGNN emits per-epoch lines: +``` +0: Epoch: 01, Train Loss: 8.01219610, Val Loss: 7.77382998, Test Loss: 12.61618049 +0: Tasks Train Loss: [1.351, 0.106, 0.791] +``` + +The parser extracts these into CSV columns: `epoch, batch, train_loss, val_loss, test_loss, tasks_train, tasks_val, wall_time_s, source`. + +### Validated convergence (1 node / 8 GPUs, 5 epochs, 50 batches/epoch) + +Validated on MI355X (2026-05-19) with `HYDRAGNN_VALTEST=1`, `HYDRAGNN_MAX_NUM_BATCH=50`, `HG_BATCH_SIZE=200`, `HG_PRECISION=fp64`: + +| Epoch | Train Loss | Val Loss | Test Loss | Time/batch | +|-------|-----------|----------|-----------|------------| +| 0 | 8.2051 | 8.2576 | 13.2743 | 2.44 s | +| 1 | 8.0122 | 7.7738 | 12.6162 | 2.44 s | +| 2 | 7.3135 | 6.8208 | 11.1788 | 2.41 s | +| 3 | 6.5180 | 6.2752 | 10.1997 | 2.43 s | +| 4 | 6.0819 | 6.1667 | 10.1823 | 2.39 s | + +Total training time: 720 s (12 min). Loss reduced 25.9% over 5 epochs with clear downward trend. + +### TensorBoard (alternative) + +HydraGNN also writes TensorBoard events when `HYDRAGNN_VALTEST=1`. Event files appear in `$HG_OUTPUT_DIR/logs//`. The parser supports this mode: + +```bash +python examples/parse_convergence.py --tbdir /shared/aaji/models/HydraGNN/outputs/logs/hydragnn-train--N1/ -o convergence.csv +``` + +Note: TensorBoard scalars are only populated when validation runs (they remain empty with `HYDRAGNN_VALTEST=0`). + +## Phase 2: DDStore (future) + +For larger-scale runs, add DDStore for improved data distribution: + +```bash +# Phase 2 (not yet implemented): +# Add --ddstore flag and set: +export HYDRAGNN_AGGR_BACKEND=mpi +export HYDRAGNN_DDSTORE_METHOD=1 +export HYDRAGNN_CUSTOM_DATALOADER=1 +export HYDRAGNN_NUM_WORKERS=2 +``` + +This requires verifying MPI DDStore functionality on MI355X and may need RCCL tuning for inter-node communication.