AMDResearch · ashwinma · May 20, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.claude/commands/init-cluster.md b/.claude/commands/init-cluster.md
@@ -44,19 +44,28 @@ which apptainer 2>/dev/null && apptainer --version
 which docker 2>/dev/null && docker --version 2>/dev/null
 docker info 2>/dev/null | grep -i "amd\|runtime"
 
-# 8. Common scratch / project directories
+# 8. Common scratch / project directories (shared storage)
 for d in /scratch/$USER /scratch /lustre /gpfs /tmp/$USER; do
     [[ -d "$d" ]] && echo "scratch: $d"
 done
 
-# 9. Internet connectivity from this node
+# 9. Node-local fast storage (for MIOpen cache, tmp writes during training)
+for d in /scratch /local /nvme /tmp; do
+    [[ -d "$d" && -w "$d" ]] && echo "scratch_local: $d" && break
+done
+
+# 10. RCCL / network interface discovery (for multi-node)
+ip -o link show up 2>/dev/null | awk -F': ' '{print $2}' | grep -v lo
+ibstat 2>/dev/null | grep "CA '" | awk -F"'" '{print $2}'
+
+# 11. Internet connectivity from this node
 curl -s --connect-timeout 5 -o /dev/null -w "%{http_code}" https://huggingface.co 2>/dev/null
 
-# 10. Proxy settings
+# 12. Proxy settings
 echo "HTTP_PROXY=${HTTP_PROXY:-<unset>}"
 echo "HTTPS_PROXY=${HTTPS_PROXY:-<unset>}"
 
-# 11. Existing SIF files
+# 13. Existing SIF files
 find "$HOME" /scratch /projects /opt -maxdepth 4 -name "*.sif" 2>/dev/null | head -20
 ```
 
@@ -131,6 +140,7 @@ slurm:
 paths:
   home: "<$HOME value>"
   scratch: "<confirmed scratch path>"   # also export as AI4S_SHARED_DIR
+  scratch_local: "<node-local fast storage>"  # for MIOpen cache, tmp writes (e.g. /scratch, /local, /tmp)
   projects: "<scratch>/models"          # per-model artifact root
   sif_cache: "<scratch>/images"         # shared base SIF files
 
@@ -147,6 +157,10 @@ network:
   internet_access: <true|false>
   proxy: "<proxy URL or empty>"
 
+rccl:
+  socket_ifname: "<interface for RCCL OOB, e.g. enp193s0f1np1>"  # discover: ip link show up
+  ib_hca: "<IB HCA devices, e.g. ionic_0,ionic_1,...>"           # discover: ibstat | grep CA
+
 discovered_sifs:
   # SIF files found during init (informational — confirm they are under sif_cache)
   # - <scratch>/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif

diff --git a/.cluster-config.example.yaml b/.cluster-config.example.yaml
@@ -36,3 +36,7 @@ gpu:
 network:
   internet_access: true  # can compute nodes reach the internet?
   proxy: ""              # HTTP proxy if required (e.g. "http://proxy:8080")
+  mgmt_iface: ""         # Management NIC for MPI/OOB (discover: ip -o link show up)
+  ib_hca: ""             # IB HCA device list for RCCL (discover: ibstat | grep 'CA ')
+  rccl_anp_plugin: ""    # Path to librccl-anp.so on host (default: /opt/rocm/lib/librccl-anp.so)
+  libionic_path: ""      # Path to libionic.so.1 on host (default: /usr/lib/x86_64-linux-gnu/libionic.so.1)
diff --git a/.cursor/skills/ai4science-material-science/SKILL.md b/.cursor/skills/ai4science-material-science/SKILL.md
@@ -21,8 +21,29 @@ The `material_science/` domain covers **materials**, **chemistry**, and related
 - Keep recipes **explicit** about input representations (graphs, crystals, SMILES, etc.) and any **unit conventions** (including **energy** vs **force** targets and graph vs node outputs where relevant).
 - Prefer citing **benchmarks** and **baseline** numbers from literature or model cards when adding evaluation snippets.
 - Do not commit large **proprietary structure databases**; link to public sources or describe how users supply their own data.
-- **Institutional AMD** clusters and **staging** large artifacts (**Hugging Face CLI**, **Globus** / Constellation mirrors, OLCF copy-out when applicable) belong in `data-access.md`–style runbooks for **ADIOS** or similar scientific I/O when models use them.
+- **Institutional AMD** clusters and **staging** large artifacts (**Hugging Face CLI**, **Globus** / Constellation mirrors, OLCF copy-out when applicable) belong in `data-access.md`-style runbooks for **ADIOS** or similar scientific I/O when models use them.
+
+## HydraGNN multi-node training pattern
+
+HydraGNN uses MPI-based distributed training with ADIOS datasets. Key patterns for AMD clusters:
+
+- **Launch:** `srun --mpi=pmix apptainer exec --rocm --overlay <overlay>:ro $SIF bash <rank_script>`. The `--mpi=pmix` is required because `mpi4py` calls `MPI_Init` inside the container.
+- **Non-DDStore path (phase 1):** Use `--multi --multi_model_list=<datasets>` without `--ddstore`. Each rank opens ADIOS files directly via `AdiosMultiDataset`. Simpler and sufficient for moderate scales (1-8 nodes).
+- **DDStore path (phase 2):** Add `--ddstore` flag plus env vars `HYDRAGNN_AGGR_BACKEND=mpi`, `HYDRAGNN_DDSTORE_METHOD=1`, `HYDRAGNN_CUSTOM_DATALOADER=1`. Required for large-scale (32+ nodes) where per-rank ADIOS I/O becomes a bottleneck.
+- **Dataset convention:** ADIOS datasets are expected at `./dataset/<name>-v2.bp` relative to the training script. Symlink from shared storage rather than copying.
+- **Config file:** The upstream `gfm_mlip.json` is hardware-agnostic and works on any cluster. Use CLI flags (`--batch_size`, `--num_epoch`, `--precision`) to override parameters at runtime without modifying the file.
+- **Env vars inside container:** Set `OMP_NUM_THREADS` to match `--cpus-per-task`, `MIOPEN_DISABLE_CACHE=1`, `MIOPEN_USER_DB_PATH=$SCRATCH_LOCAL/<jobid>/miopen` (node-local fast storage from `.cluster-config.yaml`), `HYDRAGNN_USE_VARIABLE_GRAPH_SIZE=1`.
+- **Multi-node MPI transport (ob1/tcp):** On Pensando/ionic fabrics, data NICs use `/31` subnets that don't route between nodes — IB verbs cannot work for MPI. Use `OMPI_MCA_pml=ob1`, `OMPI_MCA_btl=tcp,self`, `OMPI_MCA_btl_tcp_if_include=$MGMT_NIC`, `MPI4PY_RC_THREADS=false`. RCCL uses ANP plugin (`librccl-anp.so`) over ionic native transport (RoCEv2/GDRDMA) for GPU allreduce, independent of MPI. The ANP plugin and `libionic.so.1` must be bind-mounted from the host into the container.
+- **Convergence capture:** Run with `HYDRAGNN_VALTEST=1` to get epoch-level loss reporting. Parse with `examples/parse_convergence.py --log <slurm_output>`.
+
+## HydraGNN inference pattern
+
+GFM 2024 checkpoints use an older model architecture (branch `Predictive_GFM_2024`) that is structurally incompatible with the training-pinned SHA. The inference script auto-clones the correct branch and uses `sys.path.insert(0, ...)` to load matching code. Key points:
+
+- **Code version mismatch:** Never load old GFM checkpoints with new HydraGNN code — state_dict keys differ (e.g. `heads_NN.0.0.weight` vs `heads_NN.0.energy.0.weight`).
+- **Config format:** Old configs use `model_type`; new code expects `mpnn_type`. Old `output_heads` is a flat dict; new expects list-of-dicts.
+- **Inference repo:** Set `HG_INFER_REPO` or let `run_inference.sh` clone `Predictive_GFM_2024` to `$HG_OUTPUT_DIR/HydraGNN-infer` on first run.
 
 ## Safety and licensing
 
-- Chemistry and materials models may have **use-based** restrictions; mirror the model card’s limitations in the local `README.md`.
+- Chemistry and materials models may have **use-based** restrictions; mirror the model card's limitations in the local `README.md`.
diff --git a/.cursor/skills/ai4science-studio/SKILL.md b/.cursor/skills/ai4science-studio/SKILL.md
@@ -315,6 +315,40 @@ srun --mpi=pmix apptainer exec \
 
 **Multi-node scaling:** Change `--nodes=N --ntasks=N*<gpus_per_node>` in the SBATCH header; the `srun` command is unchanged.
 
+## GPU visibility: `--gpus-per-node` NOT `--gpus-per-task` (MI355X / RCCL)
+
+**Use `--gpus-per-node=8`, NOT `--gpus-per-task=1`** for multi-GPU distributed training with RCCL/NCCL backend on AMD Instinct MI355X.
+
+**Why:** With `--gpus-per-task=1`, SLURM's cgroup restricts each rank to seeing only 1 GPU (`torch.cuda.device_count()=1`). RCCL still tries to read the full KFD topology (`/sys/class/kfd/kfd/topology/nodes/*/io_links/*/properties`) to discover XGMI links, but cannot reconcile 8-GPU topology info with 1-GPU access. This produces:
+```
+NCCL WARN Could not read node # N
+ncclUnhandledCudaError: Call to CUDA function failed.
+```
+
+**Fix:** Use `--gpus-per-node=8` (all GPUs visible to all ranks). Frameworks like HydraGNN, DeepSpeed, and plain PyTorch DDP use `SLURM_LOCALID` to select the correct device per rank when `device_count() > 1`:
+```python
+local_rank = int(os.environ["SLURM_LOCALID"])
+torch.cuda.set_device(local_rank)  # rank 0→GPU 0, rank 1→GPU 1, ...
+```
+
+**MI355X RCCL env vars (from AMD documentation):**
+- Single-node: only `HSA_NO_SCRATCH_RECLAIM=1` needed
+- Multi-node: full set (`NCCL_NET_PLUGIN`, `NCCL_IB_HCA`, `NCCL_IB_GID_INDEX`, etc.) — see `sbatch_train_amd.sh` in HydraGNN examples
+
+**Multi-node MPI transport (ob1/tcp — correct for Pensando/ionic fabric):**
+- Pensando ionic data NICs use `/31` point-to-point subnets; nodes cannot route to each other on data interfaces. Standard IB verbs do NOT work for MPI inter-node traffic.
+- MPI uses ob1 PML with TCP BTL over the management NIC: `OMPI_MCA_pml=ob1`, `OMPI_MCA_btl=tcp,self`, `OMPI_MCA_btl_tcp_if_include=<mgmt_iface>`, `MPI4PY_RC_THREADS=false`.
+- RCCL uses ANP plugin (`librccl-anp.so`) + `libionic.so.1` (bind-mounted from host) over ionic native transport (RoCEv2/GDRDMA) for GPU allreduce — independent of MPI/UCX.
+- Without the ANP bind-mounts, RCCL falls back to socket transport over the management NIC (works but much slower).
+
+## PMIx shared memory fix for Apptainer
+
+When using `srun --mpi=pmix` with Apptainer, ranks may segfault with `PMIx ERROR: UNREACHABLE` if `/dev/shm` or `/tmp` is shared from the host. Fix:
+```bash
+--env PMIX_MCA_gds=hash
+--env PMIX_MCA_psec=native
+```
+
 ---
 
 # Docker-specific patterns