diff --git a/.claude/commands/run-hydragnn.md b/.claude/commands/run-hydragnn.md index 298eb75..79a18fe 100644 --- a/.claude/commands/run-hydragnn.md +++ b/.claude/commands/run-hydragnn.md @@ -13,10 +13,12 @@ Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/clus - **Training** — Smaller-scale training (the full GFM pretraining is Frontier-scale) **Q1. (Inference) Checkpoint and config** -Do you have a `.pk` checkpoint and matching `config.json` from the HF Hub [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024)? If not, I will generate the download command. +First, auto-discover: run `find -maxdepth 5 -name "*.pk" 2>/dev/null` and `find -maxdepth 5 -name "config.json" 2>/dev/null` (substituting `paths.projects` from cluster config) to check for existing checkpoints on shared storage. Present any results to the user. If nothing is found, ask: +- Do you have a `.pk` checkpoint and matching `config.json` from the HF Hub [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024)? +- Options: **Yes, provide paths** / **No, download for me** / **Auto-discovered (use found path)** **Q2. Output directory** -Where to write predictions? Default: `/workspace/results`. +Where to write predictions? Default: `/hydragnn-results` (read `paths.projects` from cluster config, never use $HOME for large outputs). --- diff --git a/.cursor/skills/ai4science-run-models/SKILL.md b/.cursor/skills/ai4science-run-models/SKILL.md index 1ea6f58..7ea7106 100644 --- a/.cursor/skills/ai4science-run-models/SKILL.md +++ b/.cursor/skills/ai4science-run-models/SKILL.md @@ -95,6 +95,12 @@ If the user has access to a terminal: - For MatterGen: check generated CIF files - For GP-MoLFormer: check CSV with RDKit validity +## Overlay build: always use $TMPDIR, not NFS + +When directing users to build an Apptainer ext3 overlay, always point them to the `build_overlay_amd.sh` script in the model's `examples/` directory. That script already handles the correct pattern: build on `$TMPDIR` (node-local fast disk), then copy the finished `.img` file to NFS. + +**Do not suggest building the overlay directly on NFS paths** (`/shared/`, `/scratch/`, etc.). An ext3 image served via FUSE over NFS incurs massive per-file overhead when writing thousands of Python package files — both `cp -r` and `tar | tar` are equally slow because the bottleneck is the FUSE layer, not the copy tool. The fix is to create and populate the image on local disk and copy the completed file to NFS as one large sequential write. + ## Model-specific notes ### Models with auto-download weights diff --git a/.cursor/skills/ai4science-studio/SKILL.md b/.cursor/skills/ai4science-studio/SKILL.md index ece48a5..ffd3efc 100755 --- a/.cursor/skills/ai4science-studio/SKILL.md +++ b/.cursor/skills/ai4science-studio/SKILL.md @@ -216,8 +216,24 @@ done Always verify at end: `assert 'rocm' in torch.__version__` +**Always build the overlay image on `$TMPDIR` (node-local disk), not on NFS.** + +An ext3 overlay image is served via FUSE. Writing thousands of small Python package files through FUSE into an image stored on NFS is extremely slow — both `cp -r` and `tar | tar` are equally bottlenecked because the destination is the FUSE layer over NFS, not the copy tool. The correct pattern: + +```bash +LOCAL_OVERLAY="${TMPDIR:-/tmp}/-overlay-${SLURM_JOB_ID:-$$}.img" +apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY" + +apptainer exec --rocm --overlay "${LOCAL_OVERLAY}:rw" ... # populate on local disk (fast) + +# Copy finished image to NFS — one large sequential write (fast) +cp "$LOCAL_OVERLAY" "$OVERLAY" # $OVERLAY = /shared/... NFS destination +rm -f "$LOCAL_OVERLAY" +``` + **Validated overlay sizes:** - ORBIT-2 (pytorch-lightning + xformers + wandb + mpi4py + ...): 7 GB overlay, ~2 GB content +- HydraGNN (torch-scatter + torch-geometric + mpi4py + ...): 4 GB overlay, ~522 MB content - StormCast (earth2studio + cartopy): 4 GB overlay, ~1.7 GB content ## Local cluster config diff --git a/earth_science/models/ORBIT-2/examples/build_overlay_amd.sh b/earth_science/models/ORBIT-2/examples/build_overlay_amd.sh index 4bab746..d794bfa 100644 --- a/earth_science/models/ORBIT-2/examples/build_overlay_amd.sh +++ b/earth_science/models/ORBIT-2/examples/build_overlay_amd.sh @@ -39,6 +39,16 @@ # ORBIT2_STAGE_DIR="" (empty). The script will install directly into the # overlay and strip torch afterward. Simpler, but requires ~7 GB overlay. # +# ── Why the overlay is built on $TMPDIR ────────────────────────────────────── +# The overlay image is an ext3 filesystem served via FUSE. Writing thousands +# of small Python package files through FUSE into an image that lives on NFS +# is extremely slow (ext3-on-NFS via FUSE). We instead: +# 1. Create and populate the ext3 image on $TMPDIR (node-local fast disk). +# 2. Copy the finished image file (one large sequential write) to the NFS +# destination ($ORBIT2_OVERLAY). +# This makes the file-intensive install step fast and leaves a single large +# file on NFS. +# # ── Key environment variables ───────────────────────────────────────────────── # ORBIT2_SIF Path to the Apptainer SIF image (required) # ORBIT2_OVERLAY Output overlay path @@ -96,9 +106,13 @@ else OVERLAY_SIZE_MB="${ORBIT2_OVERLAY_SIZE_MB:-8192}" fi +# Build overlay on local disk; copy finished image to NFS destination. +LOCAL_OVERLAY="${TMPDIR:-/tmp}/orbit2-overlay-${SLURM_JOB_ID:-$$}.img" + echo "=== ORBIT-2 overlay build ===" echo " SIF : $ORBIT2_SIF" -echo " Overlay : $OVERLAY (${OVERLAY_SIZE_MB} MB)" +echo " Overlay (NFS): $OVERLAY (${OVERLAY_SIZE_MB} MB)" +echo " Build on : $LOCAL_OVERLAY (node-local $TMPDIR)" echo " ROCm whl tag : $ROCM_WHL_TAG" echo " Staging mode : $USE_STAGING (stage dir: ${STAGE_DIR:-n/a})" echo " Node : $(hostname) Date: $(date)" @@ -223,20 +237,27 @@ INNEREOF chmod +x "$SCRIPTS_DIR/overlay_install.sh" # --------------------------------------------------------------------------- -# Create the overlay image and run the install inside the container +# Create overlay on local disk, run install, then copy image to NFS # --------------------------------------------------------------------------- -echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay image ..." -apptainer overlay create --size "$OVERLAY_SIZE_MB" "$OVERLAY" +echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..." +apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY" echo "Running install inside container ..." apptainer exec \ --rocm \ - --overlay "${OVERLAY}:rw" \ + --overlay "${LOCAL_OVERLAY}:rw" \ --bind "$SCRIPTS_DIR":/scripts \ - --bind "$(dirname "$OVERLAY")":/overlay-dir \ + --bind "${STAGE_DIR}:${STAGE_DIR}" \ "$ORBIT2_SIF" \ bash /scripts/overlay_install.sh +echo "" +echo "--- Copying finished overlay to NFS: $OVERLAY ---" +echo " Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)" +cp "$LOCAL_OVERLAY" "$OVERLAY" +rm -f "$LOCAL_OVERLAY" +echo " Done." + echo "" echo "=== Overlay ready ===" ls -lh "$OVERLAY" diff --git a/material_science/models/HydraGNN/examples/build_overlay_amd.sh b/material_science/models/HydraGNN/examples/build_overlay_amd.sh new file mode 100755 index 0000000..c2d703d --- /dev/null +++ b/material_science/models/HydraGNN/examples/build_overlay_amd.sh @@ -0,0 +1,208 @@ +#!/usr/bin/env bash +# Build a persistent Apptainer ext3 overlay pre-loaded with all HydraGNN pip +# dependencies. Run once per cluster; reuse the overlay image across jobs to +# skip the pip install phase on every submission. +# +# ── Quick start ────────────────────────────────────────────────────────────── +# export HG_SIF=/path/to/rocm_pytorch.sif +# sbatch build_overlay_amd.sh +# # then in your inference job: +# export HG_OVERLAY=/path/to/hydragnn-overlay.img +# sbatch sbatch_infer_amd.sh +# +# ── torch-scatter / torch-sparse / torch-cluster / torch-spline-conv ───────── +# No official ROCm wheels exist at data.pyg.org. We use pre-built ROCm +# wheels from https://github.com/Looong01/pyg-rocm-build (release 15): +# - PyTorch 2.10.x + ROCm 7.1 wheels run fine on ROCm 7.2.x (ABI compat) +# - Python 3.12, linux_x86_64 +# torch-geometric (pure Python) is installed from PyPI as normal. +# +# ── Why staging is needed ──────────────────────────────────────────────────── +# pip install --target does not see the SIF's /opt/venv, so torch gets +# re-downloaded (~6 GB) and fills the overlay. We stage into a writable +# scratch dir, strip torch/nvidia/triton, then copy ~1-2 GB into the overlay. +# +# ── Why the overlay is built on $TMPDIR ────────────────────────────────────── +# The overlay image is an ext3 filesystem served via FUSE. Writing thousands +# of small Python package files through FUSE into an image that lives on NFS +# is extremely slow (ext3-on-NFS via FUSE). We instead: +# 1. Create and populate the ext3 image on $TMPDIR (node-local fast disk). +# 2. Copy the finished image file (one large sequential write) to the NFS +# destination ($HG_OVERLAY). +# This makes the file-intensive install step fast and leaves a single large +# file on NFS. +# +# ── Key environment variables ───────────────────────────────────────────────── +# HG_SIF Path to the Apptainer SIF image (required) +# HG_OVERLAY Output overlay path +# (default: /hydragnn-overlay.img) +# HG_STAGE_DIR Writable scratch dir for staging install (~6 GB) +# (default: /hydragnn-stage) +# HG_OVERLAY_SIZE_MB Size of the ext3 overlay in MB (default: 4096) +# PYG_ROCM_RELEASE Looong01 release tag to use (default: 15) +# PYG_PYTHON_TAG Python tag in zip filename (default: py312) + +#SBATCH --job-name=hydragnn-overlay +#SBATCH --partition=lux +#SBATCH --account=vultr_lux +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --gres=gpu:1 +#SBATCH --cpus-per-task=8 +#SBATCH --time=03:00:00 +#SBATCH --output=hydragnn-overlay-build-%j.out +#SBATCH --error=hydragnn-overlay-build-%j.out + +set -euo pipefail + +if [[ -z "${HG_SIF:-}" ]]; then + echo "error: set HG_SIF to your Apptainer SIF path before submitting" >&2 + exit 1 +fi + +OVERLAY="${HG_OVERLAY:-$(dirname "$HG_SIF")/hydragnn-overlay.img}" +STAGE_DIR="${HG_STAGE_DIR:-$(dirname "$OVERLAY")/hydragnn-stage}" +OVERLAY_SIZE_MB="${HG_OVERLAY_SIZE_MB:-4096}" +PYG_ROCM_RELEASE="${PYG_ROCM_RELEASE:-15}" +PYG_PYTHON_TAG="${PYG_PYTHON_TAG:-py312}" + +PYG_ZIP_NAME="torch-2.10.0-rocm-7.1-${PYG_PYTHON_TAG}-linux_x86_64.zip" +PYG_ZIP_URL="https://github.com/Looong01/pyg-rocm-build/releases/download/${PYG_ROCM_RELEASE}/${PYG_ZIP_NAME}" +PYG_WHEELS_DIR="${STAGE_DIR}/pyg-rocm-wheels" + +# Build the overlay on local disk to avoid ext3-on-NFS small-file I/O penalty; +# copy the finished image file to NFS destination as one sequential write. +LOCAL_OVERLAY="${TMPDIR:-/tmp}/hydragnn-overlay-${SLURM_JOB_ID:-$$}.img" + +echo "=== HydraGNN overlay build ===" +echo " SIF : $HG_SIF" +echo " Overlay (NFS): $OVERLAY (${OVERLAY_SIZE_MB} MB)" +echo " Build on : $LOCAL_OVERLAY (node-local $TMPDIR)" +echo " Stage dir : $STAGE_DIR" +echo " PyG wheels : Looong01 release ${PYG_ROCM_RELEASE} (torch 2.10, ROCm 7.1, ${PYG_PYTHON_TAG})" +echo " Node : $(hostname) Date: $(date)" +echo "" + +if [[ -f "$OVERLAY" ]]; then + echo "Overlay already exists: $OVERLAY" + echo "Delete it to rebuild: rm $OVERLAY" + exit 0 +fi + +mkdir -p "$STAGE_DIR" "$PYG_WHEELS_DIR" + +# --------------------------------------------------------------------------- +# Download and extract Looong01 PyG ROCm wheels (on the host, before container) +# --------------------------------------------------------------------------- +PYG_ZIP="${STAGE_DIR}/${PYG_ZIP_NAME}" +if [[ ! -f "$PYG_ZIP" ]]; then + echo "--- Downloading PyG ROCm wheels from Looong01 release ${PYG_ROCM_RELEASE} ---" + curl -fL "$PYG_ZIP_URL" -o "$PYG_ZIP" +else + echo "--- PyG ROCm zip already downloaded: $PYG_ZIP ---" +fi + +echo "--- Extracting wheels ---" +unzip -q -o "$PYG_ZIP" -d "$PYG_WHEELS_DIR" +echo "Wheels extracted:" +ls "$PYG_WHEELS_DIR"/*.whl 2>/dev/null || ls "$PYG_WHEELS_DIR" + +# --------------------------------------------------------------------------- +# Write the install script that runs inside the container +# --------------------------------------------------------------------------- +SCRIPTS_DIR="${TMPDIR:-/tmp}/hydragnn-overlay-scripts-${SLURM_JOB_ID:-$$}" +mkdir -p "$SCRIPTS_DIR" + +cat > "$SCRIPTS_DIR/overlay_install.sh" << INNEREOF +#!/usr/bin/env bash +set -euo pipefail +source /opt/venv/bin/activate + +PKG=/opt/hydragnn-pkgs +STAGE="${STAGE_DIR}/pkgs" +PYG_WHEELS="${PYG_WHEELS_DIR}" + +rm -rf "\$STAGE" && mkdir -p "\$STAGE" +echo "--- torch: \$(python3 -c 'import torch; print(torch.__version__)') ---" +echo "--- Installing to staging dir: \$STAGE ---" + +echo "--- Step 1: PyG ROCm wheels (torch-scatter, torch-sparse, torch-cluster, etc.) ---" +# --no-deps: these wheels depend on torch, which is already in the SIF venv +pip install -q --no-cache-dir --no-deps --target "\$STAGE" "\$PYG_WHEELS"/*.whl +echo " installed: \$(ls \$STAGE | grep torch | tr '\n' ' ')" + +echo "--- Step 2: torch-geometric (pure Python) ---" +pip install -q --no-cache-dir --target "\$STAGE" torch-geometric 2>&1 | tail -3 + +echo "--- Step 3: HydraGNN runtime deps ---" +pip install -q --no-cache-dir --target "\$STAGE" \ + mpi4py tqdm pyyaml tensorboard matplotlib \ + scikit-learn scipy h5py ase 2>&1 | tail -5 + +echo "--- Step 4: HydraGNN package (clone + install, no deps) ---" +if [[ ! -d /tmp/HydraGNN-src ]]; then + git clone -q --depth=1 --branch Predictive_GFM_2024 \ + https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src +fi +pip install -q --no-cache-dir --target "\$STAGE" --no-deps /tmp/HydraGNN-src 2>&1 | tail -3 + +echo "--- Step 5: strip torch / nvidia / triton from staging dir ---" +for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do + rm -rf "\${STAGE}/\${pkg}" "\${STAGE}/\${pkg}"-*.dist-info 2>/dev/null || true +done +echo " Staging size after strip: \$(du -sh \$STAGE | cut -f1)" + +echo "--- Step 6: copy stripped packages → overlay (local disk, fast) ---" +mkdir -p "\$PKG" +cp -r "\$STAGE/." "\$PKG/" +# Final safety strip +for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do + rm -rf "\${PKG}/\${pkg}" "\${PKG}/\${pkg}"-*.dist-info 2>/dev/null || true +done +echo " Overlay size: \$(du -sh \$PKG | cut -f1)" + +echo "--- Step 7: verify ---" +PYTHONPATH="\$PKG" python3 -c " +import torch, torch_scatter, torch_geometric, hydragnn +print('torch :', torch.__version__) +print('torch_scatter : ok') +print('torch_geometric:', torch_geometric.__version__) +print('hydragnn : ok') +assert torch.cuda.is_available() or 'rocm' in torch.__version__, 'ROCm torch expected' +print('Verification passed.') +" +echo "=== Overlay install complete ===" +INNEREOF + +chmod +x "$SCRIPTS_DIR/overlay_install.sh" + +# --------------------------------------------------------------------------- +# Create overlay on local disk, run install, then copy image to NFS +# --------------------------------------------------------------------------- +echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..." +apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY" + +echo "Running install inside container ..." +apptainer exec \ + --rocm \ + --overlay "${LOCAL_OVERLAY}:rw" \ + --bind "$SCRIPTS_DIR":/scripts \ + --bind "${STAGE_DIR}:${STAGE_DIR}" \ + --bind "${PYG_WHEELS_DIR}:${PYG_WHEELS_DIR}" \ + "$HG_SIF" \ + bash /scripts/overlay_install.sh + +echo "" +echo "--- Copying finished overlay to NFS: $OVERLAY ---" +echo " Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)" +cp "$LOCAL_OVERLAY" "$OVERLAY" +rm -f "$LOCAL_OVERLAY" +echo " Done." + +echo "" +echo "=== Overlay ready ===" +ls -lh "$OVERLAY" +echo "" +echo "To use in inference jobs:" +echo " export HG_OVERLAY=$OVERLAY" +echo " sbatch sbatch_infer_amd.sh" diff --git a/material_science/models/HydraGNN/examples/run_inference.sh b/material_science/models/HydraGNN/examples/run_inference.sh index 356cc0f..b73bacd 100755 --- a/material_science/models/HydraGNN/examples/run_inference.sh +++ b/material_science/models/HydraGNN/examples/run_inference.sh @@ -47,7 +47,6 @@ if [[ -z "${HG_CHECKPOINT}" || -z "${HG_CONFIG}" ]]; then fi mkdir -p "$HG_OUTPUT_DIR" -cd /workspace/HydraGNN echo "=== HydraGNN Predictive GFM 2024 — Inference ===" echo " Checkpoint : $HG_CHECKPOINT" @@ -56,27 +55,45 @@ echo " Output : $HG_OUTPUT_DIR" echo "" python - </hydragnn-results) + +#SBATCH -J hydragnn-infer +#SBATCH --partition=lux +#SBATCH --account=vultr_lux +#SBATCH --nodes=1 +#SBATCH --gres=gpu:1 +#SBATCH --ntasks-per-node=1 +#SBATCH --cpus-per-task=8 +#SBATCH -t 00:30:00 +#SBATCH -o hydragnn-infer-%j.out +#SBATCH -e hydragnn-infer-%j.out + +set -euo pipefail + +# --------------------------------------------------------------------------- +# SCRIPT_DIR — works both at submit time and inside the SLURM job +# --------------------------------------------------------------------------- +if [[ -n "${SLURM_JOB_ID:-}" ]]; then + _ORIG_CMD=$(scontrol show job "$SLURM_JOB_ID" | sed -n 's/.*Command=\(\S\+\).*/\1/p') + SCRIPT_DIR=$(cd "$(dirname "$_ORIG_CMD")" && pwd) +else + SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +fi + +# --------------------------------------------------------------------------- +# Validate required inputs +# --------------------------------------------------------------------------- +for var in HG_SIF HG_OVERLAY HG_CHECKPOINT HG_CONFIG; do + if [[ -z "${!var:-}" ]]; then + echo "error: $var must be set" >&2 + exit 2 + fi +done + +if [[ ! -f "$HG_OVERLAY" ]]; then + echo "error: overlay not found: $HG_OVERLAY" >&2 + echo " Build it first: sbatch build_overlay_amd.sh" >&2 + exit 2 +fi + +HG_OUTPUT_DIR="${HG_OUTPUT_DIR:-$(dirname "$HG_CHECKPOINT")/results}" +mkdir -p "$HG_OUTPUT_DIR" + +# --------------------------------------------------------------------------- +# ROCm env +# --------------------------------------------------------------------------- +export HSA_NO_SCRATCH_RECLAIM=1 +export MIOPEN_USER_DB_PATH="${TMPDIR:-/tmp}/hydragnn-miopen-${SLURM_JOB_ID:-$$}" +mkdir -p "$MIOPEN_USER_DB_PATH" + +# --------------------------------------------------------------------------- +# Sanity-check GPU visibility +# --------------------------------------------------------------------------- +GPU_OK=$(apptainer exec --rocm --overlay "${HG_OVERLAY}:ro" "$HG_SIF" \ + python3 -c "import torch; print(torch.cuda.is_available())" 2>/dev/null || echo "False") +if [[ "$GPU_OK" != "True" ]]; then + echo "WARNING: torch.cuda.is_available() = False — check SIF or try HSA_OVERRIDE_GFX_VERSION=9.5.0" >&2 +fi + +# --------------------------------------------------------------------------- +# Run inference +# --------------------------------------------------------------------------- +echo "=== HydraGNN Predictive GFM 2024 — Inference ===" +echo " SIF : $HG_SIF" +echo " Overlay : $HG_OVERLAY" +echo " Checkpoint : $HG_CHECKPOINT" +echo " Config : $HG_CONFIG" +echo " Output : $HG_OUTPUT_DIR" +echo "" + +apptainer exec --rocm \ + --overlay "${HG_OVERLAY}:ro" \ + --bind "$(dirname "$HG_CHECKPOINT"):$(dirname "$HG_CHECKPOINT"):ro" \ + --bind "$(dirname "$HG_CONFIG"):$(dirname "$HG_CONFIG"):ro" \ + --bind "${HG_OUTPUT_DIR}:${HG_OUTPUT_DIR}" \ + --bind "$SCRIPT_DIR":/examples:ro \ + --env PYTHONPATH="/opt/hydragnn-pkgs:${PYTHONPATH:-}" \ + --env HG_CHECKPOINT="$HG_CHECKPOINT" \ + --env HG_CONFIG="$HG_CONFIG" \ + --env HG_OUTPUT_DIR="$HG_OUTPUT_DIR" \ + "$HG_SIF" \ + bash -c "source /opt/venv/bin/activate && bash /examples/run_inference.sh"