Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions .claude/commands/init-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,19 +44,28 @@ which apptainer 2>/dev/null && apptainer --version
which docker 2>/dev/null && docker --version 2>/dev/null
docker info 2>/dev/null | grep -i "amd\|runtime"

# 8. Common scratch / project directories
# 8. Common scratch / project directories (shared storage)
for d in /scratch/$USER /scratch /lustre /gpfs /tmp/$USER; do
[[ -d "$d" ]] && echo "scratch: $d"
done

# 9. Internet connectivity from this node
# 9. Node-local fast storage (for MIOpen cache, tmp writes during training)
for d in /scratch /local /nvme /tmp; do
[[ -d "$d" && -w "$d" ]] && echo "scratch_local: $d" && break
done

# 10. RCCL / network interface discovery (for multi-node)
ip -o link show up 2>/dev/null | awk -F': ' '{print $2}' | grep -v lo
ibstat 2>/dev/null | grep "CA '" | awk -F"'" '{print $2}'

# 11. Internet connectivity from this node
curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://huggingface.co 2>/dev/null

# 10. Proxy settings
# 12. Proxy settings
echo "HTTP_PROXY=${HTTP_PROXY:-<unset>}"
echo "HTTPS_PROXY=${HTTPS_PROXY:-<unset>}"

# 11. Existing SIF files
# 13. Existing SIF files
find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
```

Expand Down Expand Up @@ -131,6 +140,7 @@ slurm:
paths:
home: "<$HOME value>"
scratch: "<confirmed scratch path>" # also export as AI4S_SHARED_DIR
scratch_local: "<node-local fast storage>" # for MIOpen cache, tmp writes (e.g. /scratch, /local, /tmp)
projects: "<scratch>/models" # per-model artifact root
sif_cache: "<scratch>/images" # shared base SIF files

Expand All @@ -147,6 +157,10 @@ network:
internet_access: <true|false>
proxy: "<proxy URL or empty>"

rccl:
socket_ifname: "<interface for RCCL OOB, e.g. enp193s0f1np1>" # discover: ip link show up
ib_hca: "<IB HCA devices, e.g. ionic_0,ionic_1,...>" # discover: ibstat | grep CA

discovered_sifs:
# SIF files found during init (informational — confirm they are under sif_cache)
# - <scratch>/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif
Expand Down
4 changes: 4 additions & 0 deletions .cluster-config.example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,7 @@ gpu:
network:
internet_access: true # can compute nodes reach the internet?
proxy: "" # HTTP proxy if required (e.g. "http://proxy:8080")
mgmt_iface: "" # Management NIC for MPI/OOB (discover: ip -o link show up)
ib_hca: "" # IB HCA device list for RCCL (discover: ibstat | grep 'CA ')
rccl_anp_plugin: "" # Path to librccl-anp.so on host (default: /opt/rocm/lib/librccl-anp.so)
libionic_path: "" # Path to libionic.so.1 on host (default: /usr/lib/x86_64-linux-gnu/libionic.so.1)
25 changes: 23 additions & 2 deletions .cursor/skills/ai4science-material-science/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,29 @@ The `material_science/` domain covers **materials**, **chemistry**, and related
- Keep recipes **explicit** about input representations (graphs, crystals, SMILES, etc.) and any **unit conventions** (including **energy** vs **force** targets and graph vs node outputs where relevant).
- Prefer citing **benchmarks** and **baseline** numbers from literature or model cards when adding evaluation snippets.
- Do not commit large **proprietary structure databases**; link to public sources or describe how users supply their own data.
- **Institutional AMD** clusters and **staging** large artifacts (**Hugging Face CLI**, **Globus** / Constellation mirrors, OLCF copy-out when applicable) belong in `data-access.md`–style runbooks for **ADIOS** or similar scientific I/O when models use them.
- **Institutional AMD** clusters and **staging** large artifacts (**Hugging Face CLI**, **Globus** / Constellation mirrors, OLCF copy-out when applicable) belong in `data-access.md`-style runbooks for **ADIOS** or similar scientific I/O when models use them.

## HydraGNN multi-node training pattern

HydraGNN uses MPI-based distributed training with ADIOS datasets. Key patterns for AMD clusters:

- **Launch:** `srun --mpi=pmix apptainer exec --rocm --overlay <overlay>:ro $SIF bash <rank_script>`. The `--mpi=pmix` is required because `mpi4py` calls `MPI_Init` inside the container.
- **Non-DDStore path (phase 1):** Use `--multi --multi_model_list=<datasets>` without `--ddstore`. Each rank opens ADIOS files directly via `AdiosMultiDataset`. Simpler and sufficient for moderate scales (1-8 nodes).
- **DDStore path (phase 2):** Add `--ddstore` flag plus env vars `HYDRAGNN_AGGR_BACKEND=mpi`, `HYDRAGNN_DDSTORE_METHOD=1`, `HYDRAGNN_CUSTOM_DATALOADER=1`. Required for large-scale (32+ nodes) where per-rank ADIOS I/O becomes a bottleneck.
- **Dataset convention:** ADIOS datasets are expected at `./dataset/<name>-v2.bp` relative to the training script. Symlink from shared storage rather than copying.
- **Config file:** The upstream `gfm_mlip.json` is hardware-agnostic and works on any cluster. Use CLI flags (`--batch_size`, `--num_epoch`, `--precision`) to override parameters at runtime without modifying the file.
- **Env vars inside container:** Set `OMP_NUM_THREADS` to match `--cpus-per-task`, `MIOPEN_DISABLE_CACHE=1`, `MIOPEN_USER_DB_PATH=$SCRATCH_LOCAL/<jobid>/miopen` (node-local fast storage from `.cluster-config.yaml`), `HYDRAGNN_USE_VARIABLE_GRAPH_SIZE=1`.
- **Multi-node MPI transport (ob1/tcp):** On Pensando/ionic fabrics, data NICs use `/31` subnets that don't route between nodes — IB verbs cannot work for MPI. Use `OMPI_MCA_pml=ob1`, `OMPI_MCA_btl=tcp,self`, `OMPI_MCA_btl_tcp_if_include=$MGMT_NIC`, `MPI4PY_RC_THREADS=false`. RCCL uses ANP plugin (`librccl-anp.so`) over ionic native transport (RoCEv2/GDRDMA) for GPU allreduce, independent of MPI. The ANP plugin and `libionic.so.1` must be bind-mounted from the host into the container.
- **Convergence capture:** Run with `HYDRAGNN_VALTEST=1` to get epoch-level loss reporting. Parse with `examples/parse_convergence.py --log <slurm_output>`.

## HydraGNN inference pattern

GFM 2024 checkpoints use an older model architecture (branch `Predictive_GFM_2024`) that is structurally incompatible with the training-pinned SHA. The inference script auto-clones the correct branch and uses `sys.path.insert(0, ...)` to load matching code. Key points:

- **Code version mismatch:** Never load old GFM checkpoints with new HydraGNN code — state_dict keys differ (e.g. `heads_NN.0.0.weight` vs `heads_NN.0.energy.0.weight`).
- **Config format:** Old configs use `model_type`; new code expects `mpnn_type`. Old `output_heads` is a flat dict; new expects list-of-dicts.
- **Inference repo:** Set `HG_INFER_REPO` or let `run_inference.sh` clone `Predictive_GFM_2024` to `$HG_OUTPUT_DIR/HydraGNN-infer` on first run.

## Safety and licensing

- Chemistry and materials models may have **use-based** restrictions; mirror the model cards limitations in the local `README.md`.
- Chemistry and materials models may have **use-based** restrictions; mirror the model card's limitations in the local `README.md`.
34 changes: 34 additions & 0 deletions .cursor/skills/ai4science-studio/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,40 @@ srun --mpi=pmix apptainer exec \

**Multi-node scaling:** Change `--nodes=N --ntasks=N*<gpus_per_node>` in the SBATCH header; the `srun` command is unchanged.

## GPU visibility: `--gpus-per-node` NOT `--gpus-per-task` (MI355X / RCCL)

**Use `--gpus-per-node=8`, NOT `--gpus-per-task=1`** for multi-GPU distributed training with RCCL/NCCL backend on AMD Instinct MI355X.

**Why:** With `--gpus-per-task=1`, SLURM's cgroup restricts each rank to seeing only 1 GPU (`torch.cuda.device_count()=1`). RCCL still tries to read the full KFD topology (`/sys/class/kfd/kfd/topology/nodes/*/io_links/*/properties`) to discover XGMI links, but cannot reconcile 8-GPU topology info with 1-GPU access. This produces:
```
NCCL WARN Could not read node # N
ncclUnhandledCudaError: Call to CUDA function failed.
```

**Fix:** Use `--gpus-per-node=8` (all GPUs visible to all ranks). Frameworks like HydraGNN, DeepSpeed, and plain PyTorch DDP use `SLURM_LOCALID` to select the correct device per rank when `device_count() > 1`:
```python
local_rank = int(os.environ["SLURM_LOCALID"])
torch.cuda.set_device(local_rank) # rank 0→GPU 0, rank 1→GPU 1, ...
```

**MI355X RCCL env vars (from AMD documentation):**
- Single-node: only `HSA_NO_SCRATCH_RECLAIM=1` needed
- Multi-node: full set (`NCCL_NET_PLUGIN`, `NCCL_IB_HCA`, `NCCL_IB_GID_INDEX`, etc.) — see `sbatch_train_amd.sh` in HydraGNN examples

**Multi-node MPI transport (ob1/tcp — correct for Pensando/ionic fabric):**
- Pensando ionic data NICs use `/31` point-to-point subnets; nodes cannot route to each other on data interfaces. Standard IB verbs do NOT work for MPI inter-node traffic.
- MPI uses ob1 PML with TCP BTL over the management NIC: `OMPI_MCA_pml=ob1`, `OMPI_MCA_btl=tcp,self`, `OMPI_MCA_btl_tcp_if_include=<mgmt_iface>`, `MPI4PY_RC_THREADS=false`.
- RCCL uses ANP plugin (`librccl-anp.so`) + `libionic.so.1` (bind-mounted from host) over ionic native transport (RoCEv2/GDRDMA) for GPU allreduce — independent of MPI/UCX.
- Without the ANP bind-mounts, RCCL falls back to socket transport over the management NIC (works but much slower).

## PMIx shared memory fix for Apptainer

When using `srun --mpi=pmix` with Apptainer, ranks may segfault with `PMIx ERROR: UNREACHABLE` if `/dev/shm` or `/tmp` is shared from the host. Fix:
```bash
--env PMIX_MCA_gds=hash
--env PMIX_MCA_psec=native
```

---

# Docker-specific patterns
Expand Down
Loading
Loading