Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .claude/commands/run-hydragnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@ Check if `.cluster-config.yaml` (repo root) or `~/.config/ai4science-studio/clus
- **Training** — Smaller-scale training (the full GFM pretraining is Frontier-scale)

**Q1. (Inference) Checkpoint and config**
Do you have a `.pk` checkpoint and matching `config.json` from the HF Hub [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024)? If not, I will generate the download command.
First, auto-discover: run `find <paths.projects> -maxdepth 5 -name "*.pk" 2>/dev/null` and `find <paths.projects> -maxdepth 5 -name "config.json" 2>/dev/null` (substituting `paths.projects` from cluster config) to check for existing checkpoints on shared storage. Present any results to the user. If nothing is found, ask:
- Do you have a `.pk` checkpoint and matching `config.json` from the HF Hub [`mlupopa/HydraGNN_Predictive_GFM_2024`](https://huggingface.co/mlupopa/HydraGNN_Predictive_GFM_2024)?
- Options: **Yes, provide paths** / **No, download for me** / **Auto-discovered (use found path)**

**Q2. Output directory**
Where to write predictions? Default: `/workspace/results`.
Where to write predictions? Default: `<paths.projects>/hydragnn-results` (read `paths.projects` from cluster config, never use $HOME for large outputs).

---

Expand Down
6 changes: 6 additions & 0 deletions .cursor/skills/ai4science-run-models/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,12 @@ If the user has access to a terminal:
- For MatterGen: check generated CIF files
- For GP-MoLFormer: check CSV with RDKit validity

## Overlay build: always use $TMPDIR, not NFS

When directing users to build an Apptainer ext3 overlay, always point them to the `build_overlay_amd.sh` script in the model's `examples/` directory. That script already handles the correct pattern: build on `$TMPDIR` (node-local fast disk), then copy the finished `.img` file to NFS.

**Do not suggest building the overlay directly on NFS paths** (`/shared/`, `/scratch/`, etc.). An ext3 image served via FUSE over NFS incurs massive per-file overhead when writing thousands of Python package files — both `cp -r` and `tar | tar` are equally slow because the bottleneck is the FUSE layer, not the copy tool. The fix is to create and populate the image on local disk and copy the completed file to NFS as one large sequential write.

## Model-specific notes

### Models with auto-download weights
Expand Down
16 changes: 16 additions & 0 deletions .cursor/skills/ai4science-studio/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,8 +216,24 @@ done

Always verify at end: `assert 'rocm' in torch.__version__`

**Always build the overlay image on `$TMPDIR` (node-local disk), not on NFS.**

An ext3 overlay image is served via FUSE. Writing thousands of small Python package files through FUSE into an image stored on NFS is extremely slow — both `cp -r` and `tar | tar` are equally bottlenecked because the destination is the FUSE layer over NFS, not the copy tool. The correct pattern:

```bash
LOCAL_OVERLAY="${TMPDIR:-/tmp}/<model>-overlay-${SLURM_JOB_ID:-$$}.img"
apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY"

apptainer exec --rocm --overlay "${LOCAL_OVERLAY}:rw" ... # populate on local disk (fast)

# Copy finished image to NFS — one large sequential write (fast)
cp "$LOCAL_OVERLAY" "$OVERLAY" # $OVERLAY = /shared/... NFS destination
rm -f "$LOCAL_OVERLAY"
```

**Validated overlay sizes:**
- ORBIT-2 (pytorch-lightning + xformers + wandb + mpi4py + ...): 7 GB overlay, ~2 GB content
- HydraGNN (torch-scatter + torch-geometric + mpi4py + ...): 4 GB overlay, ~522 MB content
- StormCast (earth2studio + cartopy): 4 GB overlay, ~1.7 GB content

## Local cluster config
Expand Down
33 changes: 27 additions & 6 deletions earth_science/models/ORBIT-2/examples/build_overlay_amd.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@
# ORBIT2_STAGE_DIR="" (empty). The script will install directly into the
# overlay and strip torch afterward. Simpler, but requires ~7 GB overlay.
#
# ── Why the overlay is built on $TMPDIR ──────────────────────────────────────
# The overlay image is an ext3 filesystem served via FUSE. Writing thousands
# of small Python package files through FUSE into an image that lives on NFS
# is extremely slow (ext3-on-NFS via FUSE). We instead:
# 1. Create and populate the ext3 image on $TMPDIR (node-local fast disk).
# 2. Copy the finished image file (one large sequential write) to the NFS
# destination ($ORBIT2_OVERLAY).
# This makes the file-intensive install step fast and leaves a single large
# file on NFS.
#
# ── Key environment variables ─────────────────────────────────────────────────
# ORBIT2_SIF Path to the Apptainer SIF image (required)
# ORBIT2_OVERLAY Output overlay path
Expand Down Expand Up @@ -96,9 +106,13 @@ else
OVERLAY_SIZE_MB="${ORBIT2_OVERLAY_SIZE_MB:-8192}"
fi

# Build overlay on local disk; copy finished image to NFS destination.
LOCAL_OVERLAY="${TMPDIR:-/tmp}/orbit2-overlay-${SLURM_JOB_ID:-$$}.img"

echo "=== ORBIT-2 overlay build ==="
echo " SIF : $ORBIT2_SIF"
echo " Overlay : $OVERLAY (${OVERLAY_SIZE_MB} MB)"
echo " Overlay (NFS): $OVERLAY (${OVERLAY_SIZE_MB} MB)"
echo " Build on : $LOCAL_OVERLAY (node-local $TMPDIR)"
echo " ROCm whl tag : $ROCM_WHL_TAG"
echo " Staging mode : $USE_STAGING (stage dir: ${STAGE_DIR:-n/a})"
echo " Node : $(hostname) Date: $(date)"
Expand Down Expand Up @@ -223,20 +237,27 @@ INNEREOF
chmod +x "$SCRIPTS_DIR/overlay_install.sh"

# ---------------------------------------------------------------------------
# Create the overlay image and run the install inside the container
# Create overlay on local disk, run install, then copy image to NFS
# ---------------------------------------------------------------------------
echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay image ..."
apptainer overlay create --size "$OVERLAY_SIZE_MB" "$OVERLAY"
echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..."
apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY"

echo "Running install inside container ..."
apptainer exec \
--rocm \
--overlay "${OVERLAY}:rw" \
--overlay "${LOCAL_OVERLAY}:rw" \
--bind "$SCRIPTS_DIR":/scripts \
--bind "$(dirname "$OVERLAY")":/overlay-dir \
--bind "${STAGE_DIR}:${STAGE_DIR}" \
"$ORBIT2_SIF" \
bash /scripts/overlay_install.sh

echo ""
echo "--- Copying finished overlay to NFS: $OVERLAY ---"
echo " Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)"
cp "$LOCAL_OVERLAY" "$OVERLAY"
rm -f "$LOCAL_OVERLAY"
echo " Done."

echo ""
echo "=== Overlay ready ==="
ls -lh "$OVERLAY"
Expand Down
208 changes: 208 additions & 0 deletions material_science/models/HydraGNN/examples/build_overlay_amd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
#!/usr/bin/env bash
# Build a persistent Apptainer ext3 overlay pre-loaded with all HydraGNN pip
# dependencies. Run once per cluster; reuse the overlay image across jobs to
# skip the pip install phase on every submission.
#
# ── Quick start ──────────────────────────────────────────────────────────────
# export HG_SIF=/path/to/rocm_pytorch.sif
# sbatch build_overlay_amd.sh
# # then in your inference job:
# export HG_OVERLAY=/path/to/hydragnn-overlay.img
# sbatch sbatch_infer_amd.sh
#
# ── torch-scatter / torch-sparse / torch-cluster / torch-spline-conv ─────────
# No official ROCm wheels exist at data.pyg.org. We use pre-built ROCm
# wheels from https://github.com/Looong01/pyg-rocm-build (release 15):
# - PyTorch 2.10.x + ROCm 7.1 wheels run fine on ROCm 7.2.x (ABI compat)
# - Python 3.12, linux_x86_64
# torch-geometric (pure Python) is installed from PyPI as normal.
#
# ── Why staging is needed ────────────────────────────────────────────────────
# pip install --target does not see the SIF's /opt/venv, so torch gets
# re-downloaded (~6 GB) and fills the overlay. We stage into a writable
# scratch dir, strip torch/nvidia/triton, then copy ~1-2 GB into the overlay.
#
# ── Why the overlay is built on $TMPDIR ──────────────────────────────────────
# The overlay image is an ext3 filesystem served via FUSE. Writing thousands
# of small Python package files through FUSE into an image that lives on NFS
# is extremely slow (ext3-on-NFS via FUSE). We instead:
# 1. Create and populate the ext3 image on $TMPDIR (node-local fast disk).
# 2. Copy the finished image file (one large sequential write) to the NFS
# destination ($HG_OVERLAY).
# This makes the file-intensive install step fast and leaves a single large
# file on NFS.
#
# ── Key environment variables ─────────────────────────────────────────────────
# HG_SIF Path to the Apptainer SIF image (required)
# HG_OVERLAY Output overlay path
# (default: <sif-dir>/hydragnn-overlay.img)
# HG_STAGE_DIR Writable scratch dir for staging install (~6 GB)
# (default: <overlay-dir>/hydragnn-stage)
# HG_OVERLAY_SIZE_MB Size of the ext3 overlay in MB (default: 4096)
# PYG_ROCM_RELEASE Looong01 release tag to use (default: 15)
# PYG_PYTHON_TAG Python tag in zip filename (default: py312)

#SBATCH --job-name=hydragnn-overlay
#SBATCH --partition=lux
#SBATCH --account=vultr_lux
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --time=03:00:00
#SBATCH --output=hydragnn-overlay-build-%j.out
#SBATCH --error=hydragnn-overlay-build-%j.out

set -euo pipefail

if [[ -z "${HG_SIF:-}" ]]; then
echo "error: set HG_SIF to your Apptainer SIF path before submitting" >&2
exit 1
fi

OVERLAY="${HG_OVERLAY:-$(dirname "$HG_SIF")/hydragnn-overlay.img}"
STAGE_DIR="${HG_STAGE_DIR:-$(dirname "$OVERLAY")/hydragnn-stage}"
OVERLAY_SIZE_MB="${HG_OVERLAY_SIZE_MB:-4096}"
PYG_ROCM_RELEASE="${PYG_ROCM_RELEASE:-15}"
PYG_PYTHON_TAG="${PYG_PYTHON_TAG:-py312}"

PYG_ZIP_NAME="torch-2.10.0-rocm-7.1-${PYG_PYTHON_TAG}-linux_x86_64.zip"
PYG_ZIP_URL="https://github.com/Looong01/pyg-rocm-build/releases/download/${PYG_ROCM_RELEASE}/${PYG_ZIP_NAME}"
PYG_WHEELS_DIR="${STAGE_DIR}/pyg-rocm-wheels"

# Build the overlay on local disk to avoid ext3-on-NFS small-file I/O penalty;
# copy the finished image file to NFS destination as one sequential write.
LOCAL_OVERLAY="${TMPDIR:-/tmp}/hydragnn-overlay-${SLURM_JOB_ID:-$$}.img"

echo "=== HydraGNN overlay build ==="
echo " SIF : $HG_SIF"
echo " Overlay (NFS): $OVERLAY (${OVERLAY_SIZE_MB} MB)"
echo " Build on : $LOCAL_OVERLAY (node-local $TMPDIR)"
echo " Stage dir : $STAGE_DIR"
echo " PyG wheels : Looong01 release ${PYG_ROCM_RELEASE} (torch 2.10, ROCm 7.1, ${PYG_PYTHON_TAG})"
echo " Node : $(hostname) Date: $(date)"
echo ""

if [[ -f "$OVERLAY" ]]; then
echo "Overlay already exists: $OVERLAY"
echo "Delete it to rebuild: rm $OVERLAY"
exit 0
fi

mkdir -p "$STAGE_DIR" "$PYG_WHEELS_DIR"

# ---------------------------------------------------------------------------
# Download and extract Looong01 PyG ROCm wheels (on the host, before container)
# ---------------------------------------------------------------------------
PYG_ZIP="${STAGE_DIR}/${PYG_ZIP_NAME}"
if [[ ! -f "$PYG_ZIP" ]]; then
echo "--- Downloading PyG ROCm wheels from Looong01 release ${PYG_ROCM_RELEASE} ---"
curl -fL "$PYG_ZIP_URL" -o "$PYG_ZIP"
else
echo "--- PyG ROCm zip already downloaded: $PYG_ZIP ---"
fi

echo "--- Extracting wheels ---"
unzip -q -o "$PYG_ZIP" -d "$PYG_WHEELS_DIR"
echo "Wheels extracted:"
ls "$PYG_WHEELS_DIR"/*.whl 2>/dev/null || ls "$PYG_WHEELS_DIR"

# ---------------------------------------------------------------------------
# Write the install script that runs inside the container
# ---------------------------------------------------------------------------
SCRIPTS_DIR="${TMPDIR:-/tmp}/hydragnn-overlay-scripts-${SLURM_JOB_ID:-$$}"
mkdir -p "$SCRIPTS_DIR"

cat > "$SCRIPTS_DIR/overlay_install.sh" << INNEREOF
#!/usr/bin/env bash
set -euo pipefail
source /opt/venv/bin/activate

PKG=/opt/hydragnn-pkgs
STAGE="${STAGE_DIR}/pkgs"
PYG_WHEELS="${PYG_WHEELS_DIR}"

rm -rf "\$STAGE" && mkdir -p "\$STAGE"
echo "--- torch: \$(python3 -c 'import torch; print(torch.__version__)') ---"
echo "--- Installing to staging dir: \$STAGE ---"

echo "--- Step 1: PyG ROCm wheels (torch-scatter, torch-sparse, torch-cluster, etc.) ---"
# --no-deps: these wheels depend on torch, which is already in the SIF venv
pip install -q --no-cache-dir --no-deps --target "\$STAGE" "\$PYG_WHEELS"/*.whl
echo " installed: \$(ls \$STAGE | grep torch | tr '\n' ' ')"

echo "--- Step 2: torch-geometric (pure Python) ---"
pip install -q --no-cache-dir --target "\$STAGE" torch-geometric 2>&1 | tail -3

echo "--- Step 3: HydraGNN runtime deps ---"
pip install -q --no-cache-dir --target "\$STAGE" \
mpi4py tqdm pyyaml tensorboard matplotlib \
scikit-learn scipy h5py ase 2>&1 | tail -5

echo "--- Step 4: HydraGNN package (clone + install, no deps) ---"
if [[ ! -d /tmp/HydraGNN-src ]]; then
git clone -q --depth=1 --branch Predictive_GFM_2024 \
https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src
fi
pip install -q --no-cache-dir --target "\$STAGE" --no-deps /tmp/HydraGNN-src 2>&1 | tail -3

echo "--- Step 5: strip torch / nvidia / triton from staging dir ---"
for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do
rm -rf "\${STAGE}/\${pkg}" "\${STAGE}/\${pkg}"-*.dist-info 2>/dev/null || true
done
echo " Staging size after strip: \$(du -sh \$STAGE | cut -f1)"

echo "--- Step 6: copy stripped packages → overlay (local disk, fast) ---"
mkdir -p "\$PKG"
cp -r "\$STAGE/." "\$PKG/"
# Final safety strip
for pkg in torch torchvision torchaudio torchgen functorch nvidia triton; do
rm -rf "\${PKG}/\${pkg}" "\${PKG}/\${pkg}"-*.dist-info 2>/dev/null || true
done
echo " Overlay size: \$(du -sh \$PKG | cut -f1)"

echo "--- Step 7: verify ---"
PYTHONPATH="\$PKG" python3 -c "
import torch, torch_scatter, torch_geometric, hydragnn
print('torch :', torch.__version__)
print('torch_scatter : ok')
print('torch_geometric:', torch_geometric.__version__)
print('hydragnn : ok')
assert torch.cuda.is_available() or 'rocm' in torch.__version__, 'ROCm torch expected'
print('Verification passed.')
"
echo "=== Overlay install complete ==="
INNEREOF

chmod +x "$SCRIPTS_DIR/overlay_install.sh"

# ---------------------------------------------------------------------------
# Create overlay on local disk, run install, then copy image to NFS
# ---------------------------------------------------------------------------
echo "Creating ${OVERLAY_SIZE_MB} MB ext3 overlay on local disk: $LOCAL_OVERLAY ..."
apptainer overlay create --size "$OVERLAY_SIZE_MB" "$LOCAL_OVERLAY"

echo "Running install inside container ..."
apptainer exec \
--rocm \
--overlay "${LOCAL_OVERLAY}:rw" \
--bind "$SCRIPTS_DIR":/scripts \
--bind "${STAGE_DIR}:${STAGE_DIR}" \
--bind "${PYG_WHEELS_DIR}:${PYG_WHEELS_DIR}" \
"$HG_SIF" \
bash /scripts/overlay_install.sh

echo ""
echo "--- Copying finished overlay to NFS: $OVERLAY ---"
echo " Image size: $(du -sh "$LOCAL_OVERLAY" | cut -f1)"
cp "$LOCAL_OVERLAY" "$OVERLAY"
rm -f "$LOCAL_OVERLAY"
echo " Done."

echo ""
echo "=== Overlay ready ==="
ls -lh "$OVERLAY"
echo ""
echo "To use in inference jobs:"
echo " export HG_OVERLAY=$OVERLAY"
echo " sbatch sbatch_infer_amd.sh"
43 changes: 30 additions & 13 deletions material_science/models/HydraGNN/examples/run_inference.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ if [[ -z "${HG_CHECKPOINT}" || -z "${HG_CONFIG}" ]]; then
fi

mkdir -p "$HG_OUTPUT_DIR"
cd /workspace/HydraGNN

echo "=== HydraGNN Predictive GFM 2024 — Inference ==="
echo " Checkpoint : $HG_CHECKPOINT"
Expand All @@ -56,27 +55,45 @@ echo " Output : $HG_OUTPUT_DIR"
echo ""

python - <<PYEOF
import hydragnn
import torch
import json, os, torch
from collections import OrderedDict

config_file = "${HG_CONFIG}"
checkpoint = "${HG_CHECKPOINT}"
output_dir = "${HG_OUTPUT_DIR}"

print(f"Loading model from config: {config_file}")
model = hydragnn.load_existing_model(config_file, checkpoint)
model.eval()
os.makedirs(output_dir, exist_ok=True)

with open(config_file) as f:
config = json.load(f)

from hydragnn.models.create import create_model_config
from hydragnn.utils.distributed import get_device_name

print(f"Creating model from config: {config_file}")
model = create_model_config(
config=config["NeuralNetwork"],
verbosity=config["Verbosity"]["level"],
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
model = model.to(device)

# --- Replace the block below with your actual input graph construction ---
# Inputs must match the featurization used during training (see upstream HydraGNN docs).
# Example placeholder: build an ASE Atoms object or a torch_geometric Data object
# and call hydragnn.run_prediction(model, data) per upstream API.
print(f"Loading checkpoint: {checkpoint}")
map_location = {"cuda:0": str(device)}
ckpt = torch.load(checkpoint, map_location=map_location)
state_dict = ckpt["model_state_dict"]
# GFM checkpoints were saved with DDP (keys prefixed "module.") but we infer
# without DDP wrapping, so strip the prefix before load_state_dict.
if next(iter(state_dict)).startswith("module."):
state_dict = OrderedDict((k[len("module."):], v) for k, v in state_dict.items())
model.load_state_dict(state_dict)
model.eval()

print("")
print("Model loaded successfully. Build your input graph and call:")
print(" hydragnn.run_prediction(model, data)")
print("See recipes/inference/README.md for details.")
print("Model loaded successfully.")
print("Build your input graph (torch_geometric.data.Data) and call:")
print(" pred = model(data.to(device))")
print(f"Results directory: {output_dir}")
PYEOF
Loading
Loading