AMDResearch · ashwinma · May 17, 2026 · May 16, 2026
diff --git a/.claude/commands/run-hydragnn.md b/.claude/commands/run-hydragnn.md
@@ -13,10 +13,12 @@ Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/clus
 - **Training** — Smaller-scale training (the full GFM pretraining is Frontier-scale)
 
 **Q1. (Inference) Checkpoint and config**
-Do you have a `.pk` checkpoint and matching `config.json` from the HF Hub [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024)? If not, I will generate the download command.
+First, auto-discover: run `find <paths.projects> -maxdepth 5 -name "*.pk" 2>/dev/null` and `find <paths.projects> -maxdepth 5 -name "config.json" 2>/dev/null` (substituting `paths.projects` from cluster config) to check for existing checkpoints on shared storage. Present any results to the user. If nothing is found, ask:
+- Do you have a `.pk` checkpoint and matching `config.json` from the HF Hub [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024)?
+- Options: **Yes, provide paths** / **No, download for me** / **Auto-discovered (use found path)**
 
 **Q2. Output directory**
-Where to write predictions? Default: `/workspace/results`.
+Where to write predictions? Default: `<paths.projects>/hydragnn-results` (read `paths.projects` from cluster config, never use $HOME for large outputs).
 
 ---
 

diff --git a/.cursor/skills/ai4science-run-models/SKILL.md b/.cursor/skills/ai4science-run-models/SKILL.md
@@ -95,6 +95,12 @@ If the user has access to a terminal:
 - For MatterGen: check generated CIF files
 - For GP-MoLFormer: check CSV with RDKit validity
 
+## Overlay build: always use $TMPDIR, not NFS
+
+When directing users to build an Apptainer ext3 overlay, always point them to the `build_overlay_amd.sh` script in the model's `examples/` directory. That script already handles the correct pattern: build on `$TMPDIR` (node-local fast disk), then copy the finished `.img` file to NFS.
+
+**Do not suggest building the overlay directly on NFS paths** (`/shared/`, `/scratch/`, etc.). An ext3 image served via FUSE over NFS incurs massive per-file overhead when writing thousands of Python package files — both `cp -r` and `tar | tar` are equally slow because the bottleneck is the FUSE layer, not the copy tool. The fix is to create and populate the image on local disk and copy the completed file to NFS as one large sequential write.
+
 ## Model-specific notes
 
 ### Models with auto-download weights

diff --git a/.cursor/skills/ai4science-studio/SKILL.md b/.cursor/skills/ai4science-studio/SKILL.md
@@ -216,8 +216,24 @@ done
 
 Always verify at end: `assert 'rocm' in torch.__version__`
 
+**Always build the overlay image on `$TMPDIR` (node-local disk), not on NFS.**
+
+An ext3 overlay image is served via FUSE. Writing thousands of small Python package files through FUSE into an image stored on NFS is extremely slow — both `cp -r` and `tar | tar` are equally bottlenecked because the destination is the FUSE layer over NFS, not the copy tool. The correct pattern:
+
+```bash
+LOCAL_OVERLAY="${TMPDIR:-/tmp}/<model>-overlay-${SLURM_JOB_ID:-$$}.img"
+apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY"
+
+apptainer exec --rocm --overlay "${LOCAL_OVERLAY}:rw" ...  # populate on local disk (fast)
+
+# Copy finished image to NFS — one large sequential write (fast)
+cp "$LOCAL_OVERLAY" "$OVERLAY"   # $OVERLAY = /shared/... NFS destination
+rm -f "$LOCAL_OVERLAY"
+```
+
 **Validated overlay sizes:**
 - ORBIT-2 (pytorch-lightning + xformers + wandb + mpi4py + ...): 7 GB overlay, ~2 GB content
+- HydraGNN (torch-scatter + torch-geometric + mpi4py + ...): 4 GB overlay, ~522 MB content
 - StormCast (earth2studio + cartopy): 4 GB overlay, ~1.7 GB content
 
 ## Local cluster config

diff --git a/earth_science/models/ORBIT-2/examples/build_overlay_amd.sh b/earth_science/models/ORBIT-2/examples/build_overlay_amd.sh
@@ -39,6 +39,16 @@
 #     ORBIT2_STAGE_DIR="" (empty).  The script will install directly into the
 #     overlay and strip torch afterward.  Simpler, but requires ~7 GB overlay.
 #
+# ── Why the overlay is built on $TMPDIR ──────────────────────────────────────
+#   The overlay image is an ext3 filesystem served via FUSE.  Writing thousands
+#   of small Python package files through FUSE into an image that lives on NFS
+#   is extremely slow (ext3-on-NFS via FUSE).  We instead:
+#     1. Create and populate the ext3 image on $TMPDIR (node-local fast disk).
+#     2. Copy the finished image file (one large sequential write) to the NFS
+#        destination ($ORBIT2_OVERLAY).
+#   This makes the file-intensive install step fast and leaves a single large
+#   file on NFS.
+#
 # ── Key environment variables ─────────────────────────────────────────────────
 #   ORBIT2_SIF              Path to the Apptainer SIF image (required)
 #   ORBIT2_OVERLAY          Output overlay path
@@ -96,9 +106,13 @@ else
   OVERLAY_SIZE_MB="${ORBIT2_OVERLAY_SIZE_MB:-8192}"
 fi
 
+# Build overlay on local disk; copy finished image to NFS destination.
+LOCAL_OVERLAY="${TMPDIR:-/tmp}/orbit2-overlay-${SLURM_JOB_ID:-$$}.img"
+
 echo "=== ORBIT-2 overlay build ==="
 echo "  SIF          : $ORBIT2_SIF"
-echo "  Overlay      : $OVERLAY  (${OVERLAY_SIZE_MB} MB)"
+echo "  Overlay (NFS): $OVERLAY  (${OVERLAY_SIZE_MB} MB)"
+echo "  Build on     : $LOCAL_OVERLAY  (node-local $TMPDIR)"
 echo "  ROCm whl tag : $ROCM_WHL_TAG"
 echo "  Staging mode : $USE_STAGING  (stage dir: ${STAGE_DIR:-n/a})"
 echo "  Node         : $(hostname)  Date: $(date)"
@@ -223,20 +237,27 @@ INNEREOF
 chmod +x "$SCRIPTS_DIR/overlay_install.sh"
 
 # ---------------------------------------------------------------------------
-# Create the overlay image and run the install inside the container
+# Create overlay on local disk, run install, then copy image to NFS
 # ---------------------------------------------------------------------------
-echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay image ..."
-apptainer overlay create --size "$OVERLAY_SIZE_MB" "$OVERLAY"
+echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..."
+apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY"
 
 echo "Running install inside container ..."
 apptainer exec \
     --rocm \
-    --overlay "${OVERLAY}:rw" \
+    --overlay "${LOCAL_OVERLAY}:rw" \
     --bind "$SCRIPTS_DIR":/scripts \
-    --bind "$(dirname "$OVERLAY")":/overlay-dir \
+    --bind "${STAGE_DIR}:${STAGE_DIR}" \
     "$ORBIT2_SIF" \
     bash /scripts/overlay_install.sh
 
+echo ""
+echo "--- Copying finished overlay to NFS: $OVERLAY ---"
+echo "  Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)"
+cp "$LOCAL_OVERLAY" "$OVERLAY"
+rm -f "$LOCAL_OVERLAY"
+echo "  Done."
+
 echo ""
 echo "=== Overlay ready ==="
 ls -lh "$OVERLAY"

diff --git a/material_science/models/HydraGNN/examples/build_overlay_amd.sh b/material_science/models/HydraGNN/examples/build_overlay_amd.sh
@@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+# Build a persistent Apptainer ext3 overlay pre-loaded with all HydraGNN pip
+# dependencies.  Run once per cluster; reuse the overlay image across jobs to
+# skip the pip install phase on every submission.
+#
+# ── Quick start ──────────────────────────────────────────────────────────────
+#   export HG_SIF=/path/to/rocm_pytorch.sif
+#   sbatch build_overlay_amd.sh
+#   # then in your inference job:
+#   export HG_OVERLAY=/path/to/hydragnn-overlay.img
+#   sbatch sbatch_infer_amd.sh
+#
+# ── torch-scatter / torch-sparse / torch-cluster / torch-spline-conv ─────────
+#   No official ROCm wheels exist at data.pyg.org.  We use pre-built ROCm
+#   wheels from https://github.com/Looong01/pyg-rocm-build (release 15):
+#     - PyTorch 2.10.x + ROCm 7.1 wheels run fine on ROCm 7.2.x (ABI compat)
+#     - Python 3.12, linux_x86_64
+#   torch-geometric (pure Python) is installed from PyPI as normal.
+#
+# ── Why staging is needed ────────────────────────────────────────────────────
+#   pip install --target does not see the SIF's /opt/venv, so torch gets
+#   re-downloaded (~6 GB) and fills the overlay.  We stage into a writable
+#   scratch dir, strip torch/nvidia/triton, then copy ~1-2 GB into the overlay.
+#
+# ── Why the overlay is built on $TMPDIR ──────────────────────────────────────
+#   The overlay image is an ext3 filesystem served via FUSE.  Writing thousands
+#   of small Python package files through FUSE into an image that lives on NFS
+#   is extremely slow (ext3-on-NFS via FUSE).  We instead:
+#     1. Create and populate the ext3 image on $TMPDIR (node-local fast disk).
+#     2. Copy the finished image file (one large sequential write) to the NFS
+#        destination ($HG_OVERLAY).
+#   This makes the file-intensive install step fast and leaves a single large
+#   file on NFS.
+#
+# ── Key environment variables ─────────────────────────────────────────────────
+#   HG_SIF              Path to the Apptainer SIF image (required)
+#   HG_OVERLAY          Output overlay path
+#                       (default: <sif-dir>/hydragnn-overlay.img)
+#   HG_STAGE_DIR        Writable scratch dir for staging install (~6 GB)
+#                       (default: <overlay-dir>/hydragnn-stage)
+#   HG_OVERLAY_SIZE_MB  Size of the ext3 overlay in MB (default: 4096)
+#   PYG_ROCM_RELEASE    Looong01 release tag to use (default: 15)
+#   PYG_PYTHON_TAG      Python tag in zip filename (default: py312)
+
+#SBATCH --job-name=hydragnn-overlay
+#SBATCH --partition=lux
+#SBATCH --account=vultr_lux
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --gres=gpu:1
+#SBATCH --cpus-per-task=8
+#SBATCH --time=03:00:00
+#SBATCH --output=hydragnn-overlay-build-%j.out
+#SBATCH --error=hydragnn-overlay-build-%j.out
+
+set -euo pipefail
+
+if [[ -z "${HG_SIF:-}" ]]; then
+  echo "error: set HG_SIF to your Apptainer SIF path before submitting" >&2
+  exit 1
+fi
+
+OVERLAY="${HG_OVERLAY:-$(dirname "$HG_SIF")/hydragnn-overlay.img}"
+STAGE_DIR="${HG_STAGE_DIR:-$(dirname "$OVERLAY")/hydragnn-stage}"
+OVERLAY_SIZE_MB="${HG_OVERLAY_SIZE_MB:-4096}"
+PYG_ROCM_RELEASE="${PYG_ROCM_RELEASE:-15}"
+PYG_PYTHON_TAG="${PYG_PYTHON_TAG:-py312}"
+
+PYG_ZIP_NAME="torch-2.10.0-rocm-7.1-${PYG_PYTHON_TAG}-linux_x86_64.zip"
+PYG_ZIP_URL="https://github.com/Looong01/pyg-rocm-build/releases/download/${PYG_ROCM_RELEASE}/${PYG_ZIP_NAME}"
+PYG_WHEELS_DIR="${STAGE_DIR}/pyg-rocm-wheels"
+
+# Build the overlay on local disk to avoid ext3-on-NFS small-file I/O penalty;
+# copy the finished image file to NFS destination as one sequential write.
+LOCAL_OVERLAY="${TMPDIR:-/tmp}/hydragnn-overlay-${SLURM_JOB_ID:-$$}.img"
+
+echo "=== HydraGNN overlay build ==="
+echo "  SIF          : $HG_SIF"
+echo "  Overlay (NFS): $OVERLAY  (${OVERLAY_SIZE_MB} MB)"
+echo "  Build on     : $LOCAL_OVERLAY  (node-local $TMPDIR)"
+echo "  Stage dir    : $STAGE_DIR"
+echo "  PyG wheels   : Looong01 release ${PYG_ROCM_RELEASE} (torch 2.10, ROCm 7.1, ${PYG_PYTHON_TAG})"
+echo "  Node         : $(hostname)  Date: $(date)"
+echo ""
+
+if [[ -f "$OVERLAY" ]]; then
+  echo "Overlay already exists: $OVERLAY"
+  echo "Delete it to rebuild: rm $OVERLAY"
+  exit 0
+fi
+
+mkdir -p "$STAGE_DIR" "$PYG_WHEELS_DIR"
+
+# ---------------------------------------------------------------------------
+# Download and extract Looong01 PyG ROCm wheels (on the host, before container)
+# ---------------------------------------------------------------------------
+PYG_ZIP="${STAGE_DIR}/${PYG_ZIP_NAME}"
+if [[ ! -f "$PYG_ZIP" ]]; then
+  echo "--- Downloading PyG ROCm wheels from Looong01 release ${PYG_ROCM_RELEASE} ---"
+  curl -fL "$PYG_ZIP_URL" -o "$PYG_ZIP"
+else
+  echo "--- PyG ROCm zip already downloaded: $PYG_ZIP ---"
+fi
+
+echo "--- Extracting wheels ---"
+unzip -q -o "$PYG_ZIP" -d "$PYG_WHEELS_DIR"
+echo "Wheels extracted:"
+ls "$PYG_WHEELS_DIR"/*.whl 2>/dev/null || ls "$PYG_WHEELS_DIR"
+
+# ---------------------------------------------------------------------------
+# Write the install script that runs inside the container
+# ---------------------------------------------------------------------------
+SCRIPTS_DIR="${TMPDIR:-/tmp}/hydragnn-overlay-scripts-${SLURM_JOB_ID:-$$}"
+mkdir -p "$SCRIPTS_DIR"
+
+cat > "$SCRIPTS_DIR/overlay_install.sh" << INNEREOF
+#!/usr/bin/env bash
+set -euo pipefail
+source /opt/venv/bin/activate
+
+PKG=/opt/hydragnn-pkgs
+STAGE="${STAGE_DIR}/pkgs"
+PYG_WHEELS="${PYG_WHEELS_DIR}"
+
+rm -rf "\$STAGE" && mkdir -p "\$STAGE"
+echo "--- torch: \$(python3 -c 'import torch; print(torch.__version__)') ---"
+echo "--- Installing to staging dir: \$STAGE ---"
+
+echo "--- Step 1: PyG ROCm wheels (torch-scatter, torch-sparse, torch-cluster, etc.) ---"
+# --no-deps: these wheels depend on torch, which is already in the SIF venv
+pip install -q --no-cache-dir --no-deps --target "\$STAGE" "\$PYG_WHEELS"/*.whl
+echo "  installed: \$(ls \$STAGE | grep torch | tr '\n' ' ')"
+
+echo "--- Step 2: torch-geometric (pure Python) ---"
+pip install -q --no-cache-dir --target "\$STAGE" torch-geometric 2>&1 | tail -3
+
+echo "--- Step 3: HydraGNN runtime deps ---"
+pip install -q --no-cache-dir --target "\$STAGE" \
+    mpi4py tqdm pyyaml tensorboard matplotlib \
+    scikit-learn scipy h5py ase 2>&1 | tail -5
+
+echo "--- Step 4: HydraGNN package (clone + install, no deps) ---"
+if [[ ! -d /tmp/HydraGNN-src ]]; then
+    git clone -q --depth=1 --branch Predictive_GFM_2024 \
+        https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src
+fi
+pip install -q --no-cache-dir --target "\$STAGE" --no-deps /tmp/HydraGNN-src 2>&1 | tail -3
+
+echo "--- Step 5: strip torch / nvidia / triton from staging dir ---"
+for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do
+    rm -rf "\${STAGE}/\${pkg}" "\${STAGE}/\${pkg}"-*.dist-info 2>/dev/null || true
+done
+echo "  Staging size after strip: \$(du -sh \$STAGE | cut -f1)"
+
+echo "--- Step 6: copy stripped packages → overlay (local disk, fast) ---"
+mkdir -p "\$PKG"
+cp -r "\$STAGE/." "\$PKG/"
+# Final safety strip
+for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do
+    rm -rf "\${PKG}/\${pkg}" "\${PKG}/\${pkg}"-*.dist-info 2>/dev/null || true
+done
+echo "  Overlay size: \$(du -sh \$PKG | cut -f1)"
+
+echo "--- Step 7: verify ---"
+PYTHONPATH="\$PKG" python3 -c "
+import torch, torch_scatter, torch_geometric, hydragnn
+print('torch         :', torch.__version__)
+print('torch_scatter : ok')
+print('torch_geometric:', torch_geometric.__version__)
+print('hydragnn      : ok')
+assert torch.cuda.is_available() or 'rocm' in torch.__version__, 'ROCm torch expected'
+print('Verification passed.')
+"
+echo "=== Overlay install complete ==="
+INNEREOF
+
+chmod +x "$SCRIPTS_DIR/overlay_install.sh"
+
+# ---------------------------------------------------------------------------
+# Create overlay on local disk, run install, then copy image to NFS
+# ---------------------------------------------------------------------------
+echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..."
+apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY"
+
+echo "Running install inside container ..."
+apptainer exec \
+    --rocm \
+    --overlay "${LOCAL_OVERLAY}:rw" \
+    --bind "$SCRIPTS_DIR":/scripts \
+    --bind "${STAGE_DIR}:${STAGE_DIR}" \
+    --bind "${PYG_WHEELS_DIR}:${PYG_WHEELS_DIR}" \
+    "$HG_SIF" \
+    bash /scripts/overlay_install.sh
+
+echo ""
+echo "--- Copying finished overlay to NFS: $OVERLAY ---"
+echo "  Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)"
+cp "$LOCAL_OVERLAY" "$OVERLAY"
+rm -f "$LOCAL_OVERLAY"
+echo "  Done."
+
+echo ""
+echo "=== Overlay ready ==="
+ls -lh "$OVERLAY"
+echo ""
+echo "To use in inference jobs:"
+echo "  export HG_OVERLAY=$OVERLAY"
+echo "  sbatch sbatch_infer_amd.sh"
diff --git a/material_science/models/HydraGNN/examples/run_inference.sh b/material_science/models/HydraGNN/examples/run_inference.sh
@@ -47,7 +47,6 @@ if [[ -z "${HG_CHECKPOINT}" || -z "${HG_CONFIG}" ]]; then
 fi
 
 mkdir -p "$HG_OUTPUT_DIR"
-cd /workspace/HydraGNN
 
 echo "=== HydraGNN Predictive GFM 2024 — Inference ==="
 echo "  Checkpoint : $HG_CHECKPOINT"
@@ -56,27 +55,45 @@ echo "  Output     : $HG_OUTPUT_DIR"
 echo ""
 
 python - <<PYEOF
-import hydragnn
-import torch
+import json, os, torch
+from collections import OrderedDict
 
 config_file = "${HG_CONFIG}"
 checkpoint  = "${HG_CHECKPOINT}"
 output_dir  = "${HG_OUTPUT_DIR}"
 
-print(f"Loading model from config: {config_file}")
-model = hydragnn.load_existing_model(config_file, checkpoint)
-model.eval()
+os.makedirs(output_dir, exist_ok=True)
+
+with open(config_file) as f:
+    config = json.load(f)
+
+from hydragnn.models.create import create_model_config
+from hydragnn.utils.distributed import get_device_name
+
+print(f"Creating model from config: {config_file}")
+model = create_model_config(
+    config=config["NeuralNetwork"],
+    verbosity=config["Verbosity"]["level"],
+)
 
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 print(f"Device: {device}")
 model = model.to(device)
 
-# --- Replace the block below with your actual input graph construction ---
-# Inputs must match the featurization used during training (see upstream HydraGNN docs).
-# Example placeholder: build an ASE Atoms object or a torch_geometric Data object
-# and call hydragnn.run_prediction(model, data) per upstream API.
+print(f"Loading checkpoint: {checkpoint}")
+map_location = {"cuda:0": str(device)}
+ckpt = torch.load(checkpoint, map_location=map_location)
+state_dict = ckpt["model_state_dict"]
+# GFM checkpoints were saved with DDP (keys prefixed "module.") but we infer
+# without DDP wrapping, so strip the prefix before load_state_dict.
+if next(iter(state_dict)).startswith("module."):
+    state_dict = OrderedDict((k[len("module."):], v) for k, v in state_dict.items())
+model.load_state_dict(state_dict)
+model.eval()
+
 print("")
-print("Model loaded successfully. Build your input graph and call:")
-print("  hydragnn.run_prediction(model, data)")
-print("See recipes/inference/README.md for details.")
+print("Model loaded successfully.")
+print("Build your input graph (torch_geometric.data.Data) and call:")
+print("  pred = model(data.to(device))")
+print(f"Results directory: {output_dir}")
 PYEOF