Add HydraGNN distributed training recipe for AMD MI355X#32
Merged
Conversation
Adds sbatch_train_amd.sh for multi-node/multi-GPU HydraGNN GFM training using MPI-enabled ADIOS2 multi-dataset loading (no DDStore) on AMD Instinct MI355X via Apptainer. Key design decisions validated on Lux cluster: - Use --gpus-per-node=8 (not --gpus-per-task=1) to allow RCCL full topology discovery via KFD sysfs; per-task isolation causes "Could not read node" RCCL errors on MI355X - Single-node only needs HSA_NO_SCRATCH_RECLAIM=1 (wiki-documented) - Multi-node RCCL env vars (IB HCA, socket ifname, etc.) are parameterized for site-specific override - Monkey-patch AdiosMultiDataset.avg_num_neighbors to skip expensive full-dataset degree scan at init Also updates: - build_overlay_amd.sh: MPI-enabled adios2 build from source with cmake - model.yaml: add training recipe and AI4S_SHARED_DIR env var - recipes/train/README.md: full runbook - init-cluster: add scratch_local and RCCL network discovery - Studio + material science skills: GPU visibility lesson, PMIx fix Co-authored-by: Cursor <cursoragent@cursor.com>
50-batch sanity test passed on MI355X (1 node, 8 GPUs): - 2.4 s/batch steady state, 130s total training time - 7.5 GB peak allocated / 9.0 GB reserved per GPU - No RCCL errors, all ranks converged - Environment: fp64, batch_size=200, ANI1x+Alexandria datasets Also adds HYDRAGNN_MAX_NUM_BATCH, HYDRAGNN_VALTEST, SCRATCH_LOCAL to the environment variable reference table. Co-authored-by: Cursor <cursoragent@cursor.com>
Apptainer inherits the host process environment by default — SLURM_JOB_ID, SLURM_JOB_NUM_NODES, SLURM_PROCID, and SLURM_CPUS_PER_TASK are already set by srun's PMIx launcher in each rank's environment. Explicit --env lines for these were misleading (implied they wouldn't propagate). Co-authored-by: Cursor <cursoragent@cursor.com>
Previously, RCCL fell back to socket transport because the ANP plugin and libionic were not bind-mounted into the container (--rocm does not expose them). This adds the required bind-mounts and MPI ob1/tcp configuration for Pensando/ionic fabrics. Key changes: - Bind-mount librccl-anp.so and libionic.so.1 from host into container - Add MPI ob1/tcp transport (ionic /31 subnets don't route for verbs) - Read all cluster-specific values from .cluster-config.yaml (gitignored) - Add network fields to .cluster-config.example.yaml - Add inference recipe and convergence tracking tooling - Remove all cluster-specific names from committed files Validated: 2-node/16-GPU training with NCCL_DEBUG=INFO confirms RCCL-ANP plugin loaded, all 8 ionic HCAs active, GDRDMA channels established for cross-node GPU allreduce. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sbatch_train_amd.shfor multi-node HydraGNN GFM training on AMD Instinct MI355X via SLURM + Apptainer (MPI-enabled ADIOS, no DDStore)build_overlay_amd.shto compile adios2 from source with MPI, pin HydraGNN SHA, and use node-local scratch for build I/Orecipes/train/README.mdwith validated performance numbers from a 50-batch sanity test (1 node, 8 GPUs)What's included
examples/sbatch_train_amd.shexamples/build_overlay_amd.shexamples/run_train.shrecipes/train/README.mdmodel.yamlAI4S_SHARED_DIRand training env vars.cursor/skills/ai4science-studio/SKILL.md--gpus-per-noderequirement, PMIx shared memory fix.cursor/skills/ai4science-material-science/SKILL.md.claude/commands/init-cluster.mdValidated on
Key design decisions
$HG_OUTPUT_DIR/hydragnn-rank-<jobid>.shfor debuggability.avg_num_neighbors— injects precomputed value (13.74) to skip expensive full-dataset neighbor-degree scan at init.--env.NODES > 1— single-node runs don't need IB/network config.Test plan