HydraGNN perf characterization + iterative sysopt-loop recipe (MI355X) by ashwinma · Pull Request #36 · AMDResearch/ai4science-studio

ashwinma · 2026-05-26T21:45:09Z

Summary

This PR adds a complete HydraGNN performance characterization and iterative optimization workflow for AMD MI355X GPU clusters, built around an agent-first recipe pattern using
TraceLens/Omnistat observability tooling and Claude Code CLI.

What's new

Performance analysis recipe (recipes/perf-analysis/)
A multi-subagent bottleneck-analysis workflow that runs after a HydraGNN training job:

launcher subagent: submits the job, manages Omnistat user-mode telemetry + VictoriaMetrics, writes a structured manifest.json
Parallel tracelens_analyst + omnistat_analyst subagents: analyze PyTorch kineto traces and GPU hardware counters (MFMA, HBM, power) via PromQL
Parallel tracelens_verifier + omnistat_verifier subagents: independently challenge each analyst's top claims to guard against hallucination
synthesizer subagent: merges verified claims into a single combined_report.md

Iterative sysopt-loop recipe (recipes/perf-optimizer-loop/)
An autonomous optimizer loop that drives repeated sbatch submissions, applies one lever per iteration, and accepts or reverts based on the primary FOM (epoch_time_s):

orchestrator subagent: owns the loop, polls SLURM, handles resume-from-disk, and dispatches all other subagents
lever_picker subagent: reads lever_catalog.yaml + the previous iteration's combined_report.md to score and pick the next lever
fom_extractor subagent: computes six figures of merit (epoch time, throughput, MFMA TFLOPS, energy, power, loss) and builds a 1-second TraceLens↔Omnistat kernel-correlation table
story_writer subagent: produces a story.md narrative and foms.png progression chart
run_optimizer_loop.sh: CLI driver that invokes the orchestrator via Claude Code CLI inside a detached tmux session, with preflight checks and retry-with-backoff

Instrumented sbatch wrapper (sbatch_train_perf_amd.sh)
A perf-focused variant of the existing training script that adds: Omnistat user-mode telemetry, per-rank kineto profiling, optional rocprofiler-sdk kernel-dispatch tracing, a per-node
mount-health probe (exits 42 if any node has a broken NFS mount), and an opt-in pre-train hook for lever patches without editing the script.

Node health microbench (microbench_node_health.sh)
A ~30-second per-node survey: host inventory, container mount probe, dual-NUMA STREAM bandwidth, and HIP kernel launch latency. Designed to run as a SLURM Prolog to surface per-node
hardware regressions before they waste a full job.

Lever catalog (lever_catalog.yaml)
Machine-readable record of every optimization knob tried, including blocked levers with full evidence so future agents never re-attempt known-dead paths:

torch.compile: blocked by AOT-Autograd double-backward (HydraGNN's energy_force_loss calls autograd.grad(create_graph=True))
torch.jit.script: blocked by **conv_args expansion at MACEStack.py:397
PYTORCH_TUNABLEOP_TUNING=1: blocked by GPU memory access faults in hipBLASLt tuning on ROCm 7.2.2
rccl_high_priority (TORCH_NCCL_HIGH_PRIORITY=1 GPU_MAX_HW_QUEUES=2): accepted, −1.3% epoch time, −8.4% J/sample — the one lever that worked, now baked into the sbatch defaults

Key findings (2-node MI355X, 16× GPUs, HydraGNN GFM-MLIP MACE)

Best: 75.0 s/epoch at batch=400, fp64, 8 workers, persistent_workers=1
Workload is dispatch-bound (HIP launch latency ceiling ~3.7 µs × 270K launches/s), proven by fp32 showing < 2% speedup vs fp64 despite having 2× peak MFMA throughput
Path to a step-change requires upstream HydraGNN PRs: move energy_force_loss into EnhancedModelWrapper.forward (unblocks torch.compile) and unpack **conv_args at
MACEStack.py:397 (unblocks torch.jit.script)

Lessons captured

19 lessons in .cursor/skills/ai4science-perf-analysis/SKILL.md covering: per-node mount-health probe design, omnistat config pitfalls, dispatch-vs-compute discrimination via fp32/fp64,
rocprofiler counter validation, SLURM --export gotchas, and how confounded baselines produce false node-class narratives.

Adds a multi-subagent bottleneck-analysis workflow for HydraGNN on Lux: - recipes/perf-analysis/ — README + 6 agent prompt files (launcher, tracelens_analyst, tracelens_verifier, omnistat_analyst, omnistat_verifier, synthesizer) and an Omnistat user-mode config template. - examples/sbatch_train_perf_amd.sh — 2-node SLURM wrapper that injects the PyTorch profiler into HydraGNN's JSON config (under NeuralNetwork.Profile, where train_validate_test() actually reads it), starts/stops omnistat-usermode around srun, and renders the omnistat config from a @JOB_DIR@ template at submit time (configparser does not interpolate os.environ, so %(SLURM_JOB_ID)s is unreliable). - .cursor/skills/ai4science-perf-analysis/SKILL.md — orchestration guide for the main agent. - .cursor/skills/ai4science-studio/SKILL.md — captures the cross-cutting lessons surfaced by the smoke run on JOB 6762: configparser interpolation, /tmp/omni_rmsjobinfo permission collision, VM -fs.disableMmap on login node, PromQL rmsjob_info join, HydraGNN Profile-block scoping, two-phase analyst+verifier rationale, and UTC timestamp handling for omnistat-inspect. Validated end-to-end on a 2-node (16 × MI355X) HydraGNN training run: COMPLETED in 8m28s, 353 MB rank-0 trace + 1.6 MB Omnistat DB, all 6 analyst claims verified by independent re-derivation, combined_report.md generated. Co-authored-by: Cursor <cursoragent@cursor.com>

Bumps the pinned HydraGNN SHA from 6c45f16 to upstream main (2fb0bd0, 2026-05-20) for both sbatch_train_amd.sh and sbatch_train_perf_amd.sh, and rebuilds the Apptainer overlay on top of that SHA with a small, opt-in patch that exposes a HYDRAGNN_PERSISTENT_WORKERS env var. The existing HYDRAGNN_NUM_WORKERS knob is unchanged; together the two let the dataloader keep workers alive across epochs instead of forking them every epoch boundary. Why this matters: single-node MI355X probes with NW=8 + PW=1 reduce the steady-state per-iteration time from 2.35 s/it (baseline NW=0, PW=0) to ~0.25 s/it — a ~9x speedup — and eliminate the bimodal iteration-time distribution the perf-analysis subagents had previously flagged as the primary cause of >96% GPU idle time in job 6762. Build script changes: - Replace `git clone --depth=1 + git fetch --depth=1 <sha>` with a full clone + checkout, since some remote configs reject fetch-by-sha on shallow clones. - Stage examples/patches/*.patch and apply them inside the container with `patch -p1 --dry-run` first so a stale patch fails fast. - Beef up the verify step to print `hydragnn.__file__` and confirm the patch is present in the installed source. sbatch / probe scripts: - Drop the rank-0-only `python3 -c "import hydragnn"` verify block. That import triggers `from mpi4py import MPI` which calls MPI_Init() on rank 0 alone; under `srun --mpi=pmix` this leaves ranks 1..N-1 hanging in their later MPI_Init() and the whole job deadlocks in `dist.init_process_group → _create_c10d_store`. Move the same verification logic inside the main python where all ranks align at the same MPI_Init. Skill update: - Add an "Anti-pattern: rank-0 mpi4py import before srun" section to ai4science-studio SKILL.md so a fresh session recognizes the symptom (all ranks stuck in `_create_c10d_store`) and the cure. Patch is documented in examples/patches/README.md and is opt-in: the default behaviour with HYDRAGNN_PERSISTENT_WORKERS unset matches stock upstream main. Co-authored-by: Cursor <cursoragent@cursor.com>

…rd gfx950 limits The omnistat configs used by jobs 6762 and 6840 had `enable_rocprofiler = False` because the staged copy under `/shared/aaji/models/HydraGNN/perf-runs/` had drifted from the in-repo template, silently dropping all GPU hardware counters from both analyses. This change closes that gap and documents what we learned getting the counter pipeline working on Lux MI355X. Script changes (sbatch_train_perf_amd.sh): * Default OMNISTAT_TEMPLATE to the in-repo `recipes/perf-analysis/*.template` instead of a /shared staged path; the staged copy was the source of the silent-drop bug. * Parse `enable_rocprofiler` and `profile` from the chosen template at submit time and surface them in the run banner. Print a WARN if rocprofiler is off, so a future submitter catches the silent-counters case before the run finishes. Template changes (omnistat-lux.config.template): * Reduce to a single working `constant`-mode profile with 5 counters (FETCH_SIZE + 4 fp64 SQ_INSTS_VALU_*_F64). This is the empirically-validated set that fits gfx950's per-block hardware register budget AND avoids the omnistat periodic-mode bug surfaced during this work. * Add a long comment block explaining the constraints — what fails, why, and the workaround — so a future maintainer doesn't recompose the broken 7-counter or two-set version. SKILL changes (.cursor/skills/ai4science-studio/SKILL.md), four new lessons under "Performance-analysis subagent workflow": * §8: never read configs from a /shared staged copy — always from the in-repo template (the root-cause of the silent-drop above). * §9: omnistat rocprofiler-sdk extension is a separate C++ build (not a pip extra). `pip install -e '.[query]'` does NOT trigger the build even with BUILD_ROCPROFILER_SDK_EXTENSION=1; use `python setup.py build_ext --inplace` on a compute node where ROCm headers are present. * §10: gfx950 rocprofiler counter limits — FETCH_SIZE + WRITE_SIZE in one profile fails (error 38, both in TCC block), 4 x SQ_INSTS_VALU_*_F64 work in one profile. * §11: omnistat sampling_mode = periodic in 1.12.0+git.65ea9ac only fires the FIRST counter set — half the data silently lost. Use `constant` mode with one set that fits in the hardware limits. End-to-end verification: 1-node, 10-batch HydraGNN run (job 6858) with the new config collected non-zero FETCH_SIZE and SQ_INSTS_VALU_MFMA_MOPS_F64 samples on all 8 cards. Raw samples + per-card summary at /shared/aaji/models/HydraGNN/perf-runs/6858/omnistat/hardware_counters/. The rocprofiler MFMA_F64 rate triangulates with TraceLens's analytical roofline (TFLOP/s during burst windows) and Omnistat's 1-Hz utilization duty cycle — see the addendum to perf-runs/6840/combined_report.md. Co-authored-by: Cursor <cursoragent@cursor.com>

Adds an OMNISTAT_KERNEL_TRACE=1 env-gate to sbatch_train_perf_amd.sh that loads omnistat's libomnistat_trace.so via ROCP_TOOL_LIBRARIES, flips enable_kernel_trace=True in the rendered config, and surfaces the state in the run banner. Hard-fails (exit 2) if the env-gate is on but the tool library is missing — same defensive pattern as the rocprofiler counter-state guard from §8/SKILL, which keeps us from re-creating the "silent zero-counters" failure mode for the kernel-trace surface. The kernel-trace library itself is built separately on a compute node inside the workload SIF; launcher.md picks up the build step (idempotent, gated on OMNISTAT_KERNEL_TRACE=1). SKILL §12 documents the three rocprofiler-sdk surfaces (device counting, kernel tracing, rocprofv3) and when to pick which. Default behavior is unchanged: enable_kernel_trace=False in the template, no apptainer-exec changes when the gate is off. Co-authored-by: Cursor <cursoragent@cursor.com>

Validation probe (job 7034, lux-mi355x-b1, 1 node x 8 ranks, 5 batches): - libomnistat_trace.so loaded on every rank via ROCP_TOOL_LIBRARIES - 22.5k records/rank x 8 ranks (~180k dispatches), 0 dropped - 20/20 successful HTTP flushes per rank to localhost:8101/kernel_trace - VictoriaMetrics shows 130 distinct kernel names x 8 cards x 2 metrics - Top-by-duration confirms expected shape (NCCL 4.2s; Tensile GEMMs 1.7s; PyTorch elementwise tail) for a small-batch single-node DDP run Two fixes folded in: 1. Endpoint port alignment: the kernel-trace collector registers /kernel_trace on the SAME Flask app as the other omnistat collectors, so the tool library must POST to [omnistat.collectors] port (8101 in our template), NOT the library's own default 8001. Auto-derive the port from the rendered config to prevent drift. 2. Empty-string env passthrough trap (latent, surfaced by job 7033): HydraGNN's load_data.py and train_validate_test.py both use `os.getenv(K) is not None` then `int(os.environ[K])` to detect optional knobs (HYDRAGNN_NUM_WORKERS, HYDRAGNN_PERSISTENT_WORKERS, HYDRAGNN_MAX_NUM_BATCH). The old `--env KEY="${KEY:-}"` passthrough passes empty strings, which pass the existence check and then crash on int(''). Replaced with a conditional-array pattern that only appends --env flags when the value is non-empty. Applied to both sbatch_train_amd.sh and sbatch_train_perf_amd.sh; documented as SKILL §13 with an explicit "audit every */models/*/examples/*.sh" directive for future agents. SKILL also captures the SIF build recipe (pip-cmake + .deb-with-zstd extraction for libcurl headers; apt-get download is unavailable inside the SIF) so the next agent can build libomnistat_trace.so without re-discovering these gotchas. Co-authored-by: Cursor <cursoragent@cursor.com>

…failure debug) Adds three procedural notes that were only in chat after the kernel-trace build and validation session: §14. Cluster-state checks before queueing build/probe jobs on Lux. `sinfo -p lux,rad` + `sinfo -R` distinguishes "queued" from "queued forever" when the partition is drained for slurm maintenance. Includes the salloc-vs-srun bind-mount gotcha: `salloc + apptainer exec` doesn't auto-bind /shared/aaji/tools/ in a way that survives across hops, so for one-shot interactive builds inside the SIF prefer `srun ... apptainer exec --bind /shared/aaji/tools:/shared/aaji/tools`. §15. Dual-failure debug pattern — separate "tool wired correctly?" from "workload happy?". When a probe job fails, look for the instrumentation's startup banner before re-building. Worked example: kernel-trace job 7033 crashed in HydraGNN's create_dataloaders (workload bug §13), yet libomnistat_trace.so still printed its 0/0 exit summary on every rank — proving the tool wiring was correct all along. Saved a full rebuild cycle here. Also refreshes the perf-analysis recipe README: - Drops the stale "rocprofiler counters disabled by default" line (they have been on by default since commit a23e5c4). - Adds a "Telemetry knobs" table documenting the OMNISTAT_KERNEL_TRACE switch, the trace-lib path override, and the smoke-signal value of OMNISTAT_TRACE_LOG=1. Co-authored-by: Cursor <cursoragent@cursor.com>

ashwinma and others added 6 commits May 21, 2026 00:54

ashwinma changed the title ~~Performance Optimization Exploration of HydraGNN~~ HydraGNN perf characterization + iterative sysopt-loop recipe (MI355X) May 26, 2026

ashwinma closed this May 26, 2026

ashwinma deleted the aaji/perf-agents-hydragnn branch May 26, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HydraGNN perf characterization + iterative sysopt-loop recipe (MI355X)#36

HydraGNN perf characterization + iterative sysopt-loop recipe (MI355X)#36
ashwinma wants to merge 6 commits into
mainfrom
aaji/perf-agents-hydragnn

ashwinma commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ashwinma commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Key findings (2-node MI355X, 16× GPUs, HydraGNN GFM-MLIP MACE)

Lessons captured

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ashwinma commented May 26, 2026 •

edited

Loading