Skip to content

Add HydraGNN AMD inference scripts and fix overlay build pattern#31

Merged
ashwinma merged 1 commit into
mainfrom
feat/hydragnn-amd-inference
May 17, 2026
Merged

Add HydraGNN AMD inference scripts and fix overlay build pattern#31
ashwinma merged 1 commit into
mainfrom
feat/hydragnn-amd-inference

Conversation

@ashwinma
Copy link
Copy Markdown
Collaborator

Summary

  • New build_overlay_amd.sh: builds ext3 overlay on $TMPDIR (node-local disk),
    copies finished image to NFS as one sequential write — avoids ext3-on-NFS
    FUSE per-file overhead. Uses Looong01/pyg-rocm-build r15 wheels for ROCm 7.1+.
  • New sbatch_infer_amd.sh: SLURM job for Apptainer-based inference.
  • Fix run_inference.sh: GFM checkpoints saved with DDP (module.* keys) but
    inference runs without DDP — strip the prefix, don't add it. Remove Docker-era
    cd /workspace/HydraGNN. Replace hydragnn.run_prediction() (requires Dataset
    config key absent from GFM configs) with direct create_model_config + torch.load.
  • Propagate $TMPDIR overlay pattern to ORBIT-2/examples/build_overlay_amd.sh.
  • Update Cursor and Claude skills with lessons learned.

Validated on AMD MI355X (ROCm 7.2.2, torch 2.10, lux partition)

  • Overlay build: COMPLETED
  • Inference: COMPLETED (exit 0, 63s), gfm_0.229 checkpoint loaded cleanly

- Add build_overlay_amd.sh: builds ext3 overlay on \$TMPDIR (node-local
  disk) then copies the finished image to NFS as one sequential write,
  avoiding the ext3-on-NFS per-file FUSE overhead. Uses Looong01/pyg-rocm-
  build release 15 wheels (torch-scatter, torch-sparse, etc.) for ROCm 7.1+.
- Add sbatch_infer_amd.sh: SLURM job script for inference with Apptainer.
- Fix run_inference.sh: GFM checkpoints are saved with DDP (module.* keys)
  but inference runs without DDP wrapping — strip the module. prefix rather
  than adding it. Removes Docker-era cd /workspace/HydraGNN and replaces
  hydragnn.run_prediction() (requires Dataset config key absent from GFM
  configs) with direct create_model_config + torch.load + load_state_dict.
- Apply same \$TMPDIR overlay pattern to ORBIT-2/examples/build_overlay_amd.sh.
- Propagate overlay and inference lessons to Cursor and Claude skills.

Validated end-to-end on AMD MI355X (ROCm 7.2.2, torch 2.10) with
mlupopa/HydraGNN_Predictive_GFM_2024 gfm_0.229 checkpoint.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@ashwinma ashwinma merged commit 34f5a93 into main May 17, 2026
2 checks passed
@ashwinma ashwinma deleted the feat/hydragnn-amd-inference branch May 17, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant