Skip to content

HydraGNN perf characterization + iterative sysopt-loop recipe (MI355X)#37

Open
ashwinma wants to merge 14 commits into
mainfrom
aaji/perf-optimizer-loop-hydragnn
Open

HydraGNN perf characterization + iterative sysopt-loop recipe (MI355X)#37
ashwinma wants to merge 14 commits into
mainfrom
aaji/perf-optimizer-loop-hydragnn

Conversation

@ashwinma
Copy link
Copy Markdown
Collaborator

Summary

This PR adds a complete HydraGNN performance characterization and iterative optimization workflow for AMD MI355X GPU clusters, built around an agent-first recipe pattern using
TraceLens/Omnistat observability tooling and Claude Code CLI.

What's new

Performance analysis recipe (recipes/perf-analysis/)
A multi-subagent bottleneck-analysis workflow that runs after a HydraGNN training job:

  • launcher subagent: submits the job, manages Omnistat user-mode telemetry + VictoriaMetrics, writes a structured manifest.json
  • Parallel tracelens_analyst + omnistat_analyst subagents: analyze PyTorch kineto traces and GPU hardware counters (MFMA, HBM, power) via PromQL
  • Parallel tracelens_verifier + omnistat_verifier subagents: independently challenge each analyst's top claims to guard against hallucination
  • synthesizer subagent: merges verified claims into a single combined_report.md

Iterative sysopt-loop recipe (recipes/perf-optimizer-loop/)
An autonomous optimizer loop that drives repeated sbatch submissions, applies one lever per iteration, and accepts or reverts based on the primary FOM (epoch_time_s):

  • orchestrator subagent: owns the loop, polls SLURM, handles resume-from-disk, and dispatches all other subagents
  • lever_picker subagent: reads lever_catalog.yaml + the previous iteration's combined_report.md to score and pick the next lever
  • fom_extractor subagent: computes six figures of merit (epoch time, throughput, MFMA TFLOPS, energy, power, loss) and builds a 1-second TraceLens↔Omnistat kernel-correlation table
  • story_writer subagent: produces a story.md narrative and foms.png progression chart
  • run_optimizer_loop.sh: CLI driver that invokes the orchestrator via Claude Code CLI inside a detached tmux session, with preflight checks and retry-with-backoff

Instrumented sbatch wrapper (sbatch_train_perf_amd.sh)
A perf-focused variant of the existing training script that adds: Omnistat user-mode telemetry, per-rank kineto profiling, optional rocprofiler-sdk kernel-dispatch tracing, a per-node
mount-health probe (exits 42 if any node has a broken NFS mount), and an opt-in pre-train hook for lever patches without editing the script.

Node health microbench (microbench_node_health.sh)
A ~30-second per-node survey: host inventory, container mount probe, dual-NUMA STREAM bandwidth, and HIP kernel launch latency. Designed to run as a SLURM Prolog to surface per-node
hardware regressions before they waste a full job.

Lever catalog (lever_catalog.yaml)
Machine-readable record of every optimization knob tried, including blocked levers with full evidence so future agents never re-attempt known-dead paths:

  • torch.compile: blocked by AOT-Autograd double-backward (HydraGNN's energy_force_loss calls autograd.grad(create_graph=True))
  • torch.jit.script: blocked by **conv_args expansion at MACEStack.py:397
  • PYTORCH_TUNABLEOP_TUNING=1: blocked by GPU memory access faults in hipBLASLt tuning on ROCm 7.2.2
  • rccl_high_priority (TORCH_NCCL_HIGH_PRIORITY=1 GPU_MAX_HW_QUEUES=2): accepted, −1.3% epoch time, −8.4% J/sample — the one lever that worked, now baked into the sbatch defaults

Key findings (2-node MI355X, 16× GPUs, HydraGNN GFM-MLIP MACE)

  • Best: 75.0 s/epoch at batch=400, fp64, 8 workers, persistent_workers=1
  • Workload is dispatch-bound (HIP launch latency ceiling ~3.7 µs × 270K launches/s), proven by fp32 showing < 2% speedup vs fp64 despite having 2× peak MFMA throughput
  • Path to a step-change requires upstream HydraGNN PRs: move energy_force_loss into EnhancedModelWrapper.forward (unblocks torch.compile) and unpack **conv_args at
    MACEStack.py:397 (unblocks torch.jit.script)

Lessons captured

19 lessons in .cursor/skills/ai4science-perf-analysis/SKILL.md covering: per-node mount-health probe design, omnistat config pitfalls, dispatch-vs-compute discrimination via fp32/fp64,
rocprofiler counter validation, SLURM --export gotchas, and how confounded baselines produce false node-class narratives.

ashwinma and others added 13 commits May 28, 2026 19:00
Adds a multi-subagent bottleneck-analysis workflow for HydraGNN on Lux:

- recipes/perf-analysis/ — README + 6 agent prompt files (launcher,
  tracelens_analyst, tracelens_verifier, omnistat_analyst,
  omnistat_verifier, synthesizer) and an Omnistat user-mode config
  template.
- examples/sbatch_train_perf_amd.sh — 2-node SLURM wrapper that injects
  the PyTorch profiler into HydraGNN's JSON config (under
  NeuralNetwork.Profile, where train_validate_test() actually reads it),
  starts/stops omnistat-usermode around srun, and renders the omnistat
  config from a @JOB_DIR@ template at submit time (configparser does
  not interpolate os.environ, so %(SLURM_JOB_ID)s is unreliable).
- .cursor/skills/ai4science-perf-analysis/SKILL.md — orchestration
  guide for the main agent.
- .cursor/skills/ai4science-studio/SKILL.md — captures the cross-cutting
  lessons surfaced by the smoke run on JOB 6762: configparser
  interpolation, /tmp/omni_rmsjobinfo permission collision, VM
  -fs.disableMmap on login node, PromQL rmsjob_info join, HydraGNN
  Profile-block scoping, two-phase analyst+verifier rationale, and UTC
  timestamp handling for omnistat-inspect.

Validated end-to-end on a 2-node (16 × MI355X) HydraGNN training run:
COMPLETED in 8m28s, 353 MB rank-0 trace + 1.6 MB Omnistat DB, all 6
analyst claims verified by independent re-derivation, combined_report.md
generated.

Co-authored-by: Cursor <cursoragent@cursor.com>
Bumps the pinned HydraGNN SHA from 6c45f16 to upstream main (2fb0bd0,
2026-05-20) for both sbatch_train_amd.sh and sbatch_train_perf_amd.sh,
and rebuilds the Apptainer overlay on top of that SHA with a small,
opt-in patch that exposes a HYDRAGNN_PERSISTENT_WORKERS env var. The
existing HYDRAGNN_NUM_WORKERS knob is unchanged; together the two let
the dataloader keep workers alive across epochs instead of forking them
every epoch boundary.

Why this matters: single-node MI355X probes with NW=8 + PW=1 reduce
the steady-state per-iteration time from 2.35 s/it (baseline NW=0,
PW=0) to ~0.25 s/it — a ~9x speedup — and eliminate the bimodal
iteration-time distribution the perf-analysis subagents had previously
flagged as the primary cause of >96% GPU idle time in job 6762.

Build script changes:
- Replace `git clone --depth=1 + git fetch --depth=1 <sha>` with a full
  clone + checkout, since some remote configs reject fetch-by-sha on
  shallow clones.
- Stage examples/patches/*.patch and apply them inside the container
  with `patch -p1 --dry-run` first so a stale patch fails fast.
- Beef up the verify step to print `hydragnn.__file__` and confirm the
  patch is present in the installed source.

sbatch / probe scripts:
- Drop the rank-0-only `python3 -c "import hydragnn"` verify block.
  That import triggers `from mpi4py import MPI` which calls MPI_Init()
  on rank 0 alone; under `srun --mpi=pmix` this leaves ranks 1..N-1
  hanging in their later MPI_Init() and the whole job deadlocks in
  `dist.init_process_group → _create_c10d_store`. Move the same
  verification logic inside the main python where all ranks align at
  the same MPI_Init.

Skill update:
- Add an "Anti-pattern: rank-0 mpi4py import before srun" section to
  ai4science-studio SKILL.md so a fresh session recognizes the symptom
  (all ranks stuck in `_create_c10d_store`) and the cure.

Patch is documented in examples/patches/README.md and is opt-in: the
default behaviour with HYDRAGNN_PERSISTENT_WORKERS unset matches stock
upstream main.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rd gfx950 limits

The omnistat configs used by jobs 6762 and 6840 had `enable_rocprofiler = False`
because the staged copy under `/shared/aaji/models/HydraGNN/perf-runs/` had drifted
from the in-repo template, silently dropping all GPU hardware counters from both
analyses. This change closes that gap and documents what we learned getting the
counter pipeline working on Lux MI355X.

Script changes (sbatch_train_perf_amd.sh):
  * Default OMNISTAT_TEMPLATE to the in-repo `recipes/perf-analysis/*.template`
    instead of a /shared staged path; the staged copy was the source of the
    silent-drop bug.
  * Parse `enable_rocprofiler` and `profile` from the chosen template at submit
    time and surface them in the run banner. Print a WARN if rocprofiler is off,
    so a future submitter catches the silent-counters case before the run finishes.

Template changes (omnistat-lux.config.template):
  * Reduce to a single working `constant`-mode profile with 5 counters
    (FETCH_SIZE + 4 fp64 SQ_INSTS_VALU_*_F64). This is the empirically-validated
    set that fits gfx950's per-block hardware register budget AND avoids the
    omnistat periodic-mode bug surfaced during this work.
  * Add a long comment block explaining the constraints — what fails, why,
    and the workaround — so a future maintainer doesn't recompose the broken
    7-counter or two-set version.

SKILL changes (.cursor/skills/ai4science-studio/SKILL.md), four new lessons
under "Performance-analysis subagent workflow":
  * §8: never read configs from a /shared staged copy — always from the
    in-repo template (the root-cause of the silent-drop above).
  * §9: omnistat rocprofiler-sdk extension is a separate C++ build (not a
    pip extra). `pip install -e '.[query]'` does NOT trigger the build even
    with BUILD_ROCPROFILER_SDK_EXTENSION=1; use `python setup.py build_ext
    --inplace` on a compute node where ROCm headers are present.
  * §10: gfx950 rocprofiler counter limits — FETCH_SIZE + WRITE_SIZE in one
    profile fails (error 38, both in TCC block), 4 x SQ_INSTS_VALU_*_F64
    work in one profile.
  * §11: omnistat sampling_mode = periodic in 1.12.0+git.65ea9ac only fires
    the FIRST counter set — half the data silently lost. Use `constant` mode
    with one set that fits in the hardware limits.

End-to-end verification: 1-node, 10-batch HydraGNN run (job 6858) with the
new config collected non-zero FETCH_SIZE and SQ_INSTS_VALU_MFMA_MOPS_F64
samples on all 8 cards. Raw samples + per-card summary at
/shared/aaji/models/HydraGNN/perf-runs/6858/omnistat/hardware_counters/.
The rocprofiler MFMA_F64 rate triangulates with TraceLens's analytical
roofline (TFLOP/s during burst windows) and Omnistat's 1-Hz utilization
duty cycle — see the addendum to perf-runs/6840/combined_report.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds an OMNISTAT_KERNEL_TRACE=1 env-gate to sbatch_train_perf_amd.sh that
loads omnistat's libomnistat_trace.so via ROCP_TOOL_LIBRARIES, flips
enable_kernel_trace=True in the rendered config, and surfaces the state
in the run banner. Hard-fails (exit 2) if the env-gate is on but the
tool library is missing — same defensive pattern as the rocprofiler
counter-state guard from §8/SKILL, which keeps us from re-creating the
"silent zero-counters" failure mode for the kernel-trace surface.

The kernel-trace library itself is built separately on a compute node
inside the workload SIF; launcher.md picks up the build step (idempotent,
gated on OMNISTAT_KERNEL_TRACE=1). SKILL §12 documents the three
rocprofiler-sdk surfaces (device counting, kernel tracing, rocprofv3)
and when to pick which.

Default behavior is unchanged: enable_kernel_trace=False in the template,
no apptainer-exec changes when the gate is off.

Co-authored-by: Cursor <cursoragent@cursor.com>
Validation probe (job 7034, lux-mi355x-b1, 1 node x 8 ranks, 5 batches):
- libomnistat_trace.so loaded on every rank via ROCP_TOOL_LIBRARIES
- 22.5k records/rank x 8 ranks (~180k dispatches), 0 dropped
- 20/20 successful HTTP flushes per rank to localhost:8101/kernel_trace
- VictoriaMetrics shows 130 distinct kernel names x 8 cards x 2 metrics
- Top-by-duration confirms expected shape (NCCL 4.2s; Tensile GEMMs 1.7s;
  PyTorch elementwise tail) for a small-batch single-node DDP run

Two fixes folded in:

1. Endpoint port alignment: the kernel-trace collector registers
   /kernel_trace on the SAME Flask app as the other omnistat collectors,
   so the tool library must POST to [omnistat.collectors] port (8101 in
   our template), NOT the library's own default 8001. Auto-derive the
   port from the rendered config to prevent drift.

2. Empty-string env passthrough trap (latent, surfaced by job 7033):
   HydraGNN's load_data.py and train_validate_test.py both use
   `os.getenv(K) is not None` then `int(os.environ[K])` to detect
   optional knobs (HYDRAGNN_NUM_WORKERS, HYDRAGNN_PERSISTENT_WORKERS,
   HYDRAGNN_MAX_NUM_BATCH). The old `--env KEY="${KEY:-}"` passthrough
   passes empty strings, which pass the existence check and then crash
   on int(''). Replaced with a conditional-array pattern that only
   appends --env flags when the value is non-empty. Applied to both
   sbatch_train_amd.sh and sbatch_train_perf_amd.sh; documented as
   SKILL §13 with an explicit "audit every */models/*/examples/*.sh"
   directive for future agents.

SKILL also captures the SIF build recipe (pip-cmake + .deb-with-zstd
extraction for libcurl headers; apt-get download is unavailable inside
the SIF) so the next agent can build libomnistat_trace.so without
re-discovering these gotchas.

Co-authored-by: Cursor <cursoragent@cursor.com>
…failure debug)

Adds three procedural notes that were only in chat after the
kernel-trace build and validation session:

§14. Cluster-state checks before queueing build/probe jobs on Lux.
     `sinfo -p lux,rad` + `sinfo -R` distinguishes "queued" from
     "queued forever" when the partition is drained for slurm
     maintenance. Includes the salloc-vs-srun bind-mount gotcha:
     `salloc + apptainer exec` doesn't auto-bind /shared/aaji/tools/
     in a way that survives across hops, so for one-shot interactive
     builds inside the SIF prefer
     `srun ... apptainer exec --bind /shared/aaji/tools:/shared/aaji/tools`.

§15. Dual-failure debug pattern — separate "tool wired correctly?"
     from "workload happy?". When a probe job fails, look for the
     instrumentation's startup banner before re-building. Worked
     example: kernel-trace job 7033 crashed in HydraGNN's
     create_dataloaders (workload bug §13), yet
     libomnistat_trace.so still printed its 0/0 exit summary on every
     rank — proving the tool wiring was correct all along. Saved a
     full rebuild cycle here.

Also refreshes the perf-analysis recipe README:
- Drops the stale "rocprofiler counters disabled by default" line
  (they have been on by default since commit a23e5c4).
- Adds a "Telemetry knobs" table documenting the
  OMNISTAT_KERNEL_TRACE switch, the trace-lib path override, and the
  smoke-signal value of OMNISTAT_TRACE_LOG=1.

Co-authored-by: Cursor <cursoragent@cursor.com>
…e CLI)

New recipe at material_science/models/HydraGNN/recipes/perf-optimizer-loop/
that drives 3-5 iterations of the existing perf-analysis pipeline, picks one
lever per iter from lever_catalog.yaml, accepts/auto-reverts based on
primary FOM (epoch_time_s), and writes a progression story.

Subagent contract mirrors perf-analysis/: orchestrator -> lever_picker ->
launcher -> tracelens+omnistat analysts (parallel) -> verifiers (parallel) ->
synthesizer -> fom_extractor -> accept/revert -> next iter, ending in
story_writer with story.md + foms.png.

Lever catalog has 9 entries ranked by expected payoff at R2 state:
batch_800, torch_compile_e3nn (rank-script patch), rccl_high_priority,
tunable_op, num_workers_12, nccl_minchannels, precision_fp32 (gated on
compute-bound state), kernel_trace_diag_only (gated diagnostic-only).
bf16 is intentionally excluded (MACE-class equivariance requires fp32 floor).
Each lever cites the ROCm blog / doc that motivates it.

fom_extractor adds a TraceLens<->Omnistat 1-second-window correlation
(kernel_correlation.csv) so we can attribute the dominant kernel without
turning on the omnistat kernel-trace collector (which inflates VM
cardinality 1000x; see SKILL #12).

Async unattended execution: examples/run_optimizer_loop.sh runs pre-flight
checks (cluster up, disk free, claude CLI, ANTHROPIC_API_KEY, https egress)
then invokes the orchestrator via claude CLI with retry-with-backoff. Designed
for a tmux session on rad-vultr-login so the loop survives ssh disconnect.

sbatch_train_perf_amd.sh: extend OPT_ENVS to forward lever knobs
(HG_TORCH_COMPILE, TORCH_NCCL_HIGH_PRIORITY, PYTORCH_TUNABLEOP_*, etc.) only
when non-empty (avoids the empty-string trap from SKILL #13), and add
HG_RANK_PRE_TRAIN_HOOK with parent-dir bind-mount so rank_script_patch
levers (e.g. torch.compile) work end-to-end.

parse_convergence.py: add --json-foms emitter for the FOM extractor.

.gitignore: cover stray omnistat/VM logs and convergence parser outputs.

Co-authored-by: Cursor <cursoragent@cursor.com>
…diagnosis

Loop-43b33ec1 iter-1 (job 7088) was misdiagnosed as "a-nodes lack /home";
actually only lux-mi355x-a6 had a broken per-user autofs/NFS mount, while
lux-mi355x-a1 in the same allocation was healthy. The pmix collective fence
on a1 deadlocked waiting for the failed a6 ranks until SLURM killed at the
30 min wall. Post-loop probe of all 12 a-nodes (a1-a5, a7-a12 healthy; a6
still bad as of 2026-05-22 18:23 UTC) confirmed the per-node fault.

Changes:

* sbatch_train_perf_amd.sh: inline per-job mount-health probe (one srun
  task per node, checks /home/$USER, /shared/$USER, SIF path). Exits 42
  with a clear FATAL line listing broken nodes and suggesting the retry
  command. Override via HG_SKIP_NODE_HEALTH_PROBE=1.

* recipes/perf-optimizer-loop/agents/orchestrator.md: new step 2e-bis to
  handle sacct ExitCode 42:0 by re-submitting the same lever with
  sbatch --exclude=<bad-nodes>, appending to loop-<uuid>/known_bad_nodes.txt,
  WITHOUT penalizing the lever or shrinking the node class. Escalates after
  3 recurrences of the same node.

* recipes/perf-optimizer-loop/agents/lever_picker.md: ignore foms.csv rows
  whose immediately-preceding STATUS event was ITER_BROKEN_NODE (infra fail,
  not lever evidence).

* .cursor/skills/ai4science-perf-analysis/SKILL.md: capture the correct
  cluster lesson (known-bad node list: lux-mi355x-a6) plus four other
  real lessons from the loop-43b33ec1 pilot (torch.compile hook path,
  parse_convergence epoch over-count, MFMA TFLOPS methodology, epoch_time
  as primary FOM).

Co-authored-by: Cursor <cursoragent@cursor.com>
…ssons

Findings from loop-c3e4df1c (2026-05-22, 5 iters, best -1.3% epoch /
-8.4% energy via rccl_high_priority) and compile-investigation-aa480554
(2026-05-23, 6 hooks across 3 hypotheses, all FAIL with same root cause).

Catalog changes (material_science/models/HydraGNN/recipes/perf-optimizer-loop/lever_catalog.yaml):

* torch_compile_e3nn -> status: blocked with full root-cause docstring.
  Root cause: HydraGNN's energy_force_loss calls
  torch.autograd.grad(graph_energy_pred, data.pos, create_graph=True),
  which requires DOUBLE BACKWARD. Every torch.compile mode goes through
  AOT Autograd which pre-compiles the backward and cannot itself be
  differentiated. Verbatim error: "RuntimeError: torch.compile with
  aot_autograd does not currently support double backward". Lever
  reference impl is preserved (now raises RuntimeError) as a cautionary
  record so future agents see the failure modes documented inline.

* tunable_op -> renamed to tunable_op_live with status: blocked.
  PYTORCH_TUNABLEOP_TUNING=1 causes "Memory access fault by GPU node-N"
  on all 16 MI355X GPUs during hipBLASLt live tuning. Loop-c3e4df1c
  iter-3 / job 7188.

* tunable_op_warmup_then_use -> new safe replacement. Documents the
  2-phase pattern: dedicated warmup sbatch generates
  tunableop_results_<N>.csv files, then production runs consume them
  with PYTORCH_TUNABLEOP_ENABLED=1 (no _TUNING=1).

* num_workers_12 -> node-class-dependent note. Improved a-nodes
  (-12.4% in loop-43b33ec1) but regressed b-nodes (+10.7% in
  loop-c3e4df1c). lever_picker now told to consider node class.

Lever picker change (agents/lever_picker.md):

* Drop any lever with status: blocked from the candidate set. Read
  blocked_reason + blocked_evidence to explain the rejection but never
  pick. This prevents future iterations from re-trying torch.compile
  or live TunableOp.

Skill change (.cursor/skills/ai4science-perf-analysis/SKILL.md):

* Add lessons 6-12 capturing the new findings: torch.compile BLOCKED
  (#6), allow_in_graph weaker than docs suggest (#7), TunableOp live
  unsafe (#8), lever payoff is node-class-dependent (#9), b-nodes
  ~17% faster baseline (#10), hydragnn/train re-export shadow (#11),
  sbatch --export var propagation (#12).

* Add lux-mi355x-a10 to the known-bad nodes list (broken PMIx, NOT
  a mount fault; loop-c3e4df1c iter-1 / job 7161).

Sbatch change (examples/sbatch_train_perf_amd.sh):

* Fix $SIF -> $HG_SIF typo in node-health-probe stanza (runtime-fixed
  by the orchestrator during loop-c3e4df1c iter-0 retry; lands here so
  future loops don't need the same retry).

Runtime artifacts (story.md, foms.png, foms.csv, STATUS.txt, hooks)
remain under /shared/aaji/models/HydraGNN/perf-runs/ and are not
committed. The investigation report at
/shared/aaji/models/HydraGNN/perf-runs/compile-investigation-aa480554-2a15-4e04-80dd-c58afc672a8d/INVESTIGATION_REPORT.md
is referenced in the catalog citation field.

Co-authored-by: Cursor <cursoragent@cursor.com>
…t BLOCKED

Findings from pathc-pathb-a3b8f3f0 (2026-05-23): Path C (torch.jit.script)
failed cleanly, and Path B (loop-e3fac2af) found no new accepted lever but
uncovered the project's most important diagnostic: HydraGNN GFM-MLIP MACE
on b-nodes is DISPATCH-BOUND, not compute-bound (fp32 ≈ fp64 with both
verified by precision-diagnostic logs).

Catalog changes (lever_catalog.yaml):

* torch_jit_script (NEW lever entry, status: blocked). Root cause:
  MACEStack.py:397 uses **conv_args keyword-arg expansion which
  TorchScript JIT does not support. Despite e3nn's @compile_mode("script")
  decorator, the call site is not scriptable. Both PyTorch compiler
  paths (torch.compile + torch.jit.script) are now definitively dead
  for HydraGNN MLIP MACE as shipped.

* rccl_high_priority -> status: accepted-baked-in. The only lever that
  produced a real improvement across 3 loops (-1.3% epoch, -8.4% energy
  in loop-c3e4df1c iter-2). Now part of the baseline contract.

* batch_800 -> NODE-CLASS-DEPENDENT regressor. Cross-loop evidence:
  +56.8% on a-nodes (loop-43b33ec1), +87.2% on b-nodes (loop-e3fac2af).
  expected_payoff updated to "none-for-MLIP-MACE".

* tunable_op_warmup_then_use -> status: blocked. Even the 2-phase split
  fails — the warmup phase itself triggers GPU mem faults in hipBLASLt
  tuning. Both TunableOp levers (live and warmup) are now blocked.

* nccl_minchannels -> confirmed regressor at N=2 (+10.3% in
  loop-e3fac2af iter-3). expected_payoff updated; lever_picker should
  drop entirely for N<8 runs.

* precision_fp32 -> repurposed as diagnostic-realized. The +1.3%
  result (within noise) IS the project's key dispatch-vs-compute
  discriminator. fp32 verified genuinely active via precision-diagnostic
  rank-0 logs (model_param_dtype=torch.float32). Catalog notes now
  explain the dispatch-bound conclusion and point to SKILL.md lesson #13.

Skill change (.cursor/skills/ai4science-perf-analysis/SKILL.md):

* +Lesson #13: fp32-vs-fp64 epoch-time comparison is the canonical
  low-cost dispatch-vs-compute discriminator on MI355X (fp32 peak is
  2x fp64 peak). Includes the three-branch decision tree for next-lever
  choice (≈ → dispatch-bound, ≪ → compute-bound, > → memory-bound).

* +Lesson #14: torch.jit.script also blocked for HydraGNN MLIP MACE.
  Both PyTorch compiler paths are dead until upstream fixes.

* +Lesson #15: TunableOp _TUNING=1 unsafe regardless of phase
  (live or warmup-then-use). Both lever variants now blocked.

* +Lesson #16: b-nodes hit the dispatch ceiling at the baseline.
  Every compute-side lever tested on b-nodes regressed or was neutral.
  a-nodes still had headroom for num_workers_12. lever_picker must
  encode node-class state.

* +Lesson #17: Current cross-loop best (75 s/epoch) decomposes as
  baseline-on-b-nodes (76 s, free) + rccl_high_priority (-1.3%).
  Single accepted lever across 3 loops + 1 investigation.

Runtime artifacts (story.md, foms.csv, STATUS.txt, hooks, SESSION_REPORT.md,
pathc-result.md) remain under /shared/aaji/models/HydraGNN/perf-runs/ and
are not committed; the catalog citation fields reference them by absolute
path.

Co-authored-by: Cursor <cursoragent@cursor.com>
…icrobench

Direct microbench on 12× lux-mi355x-a + 3× lux-mi355x-b (CPU/NUMA/kernel
inventory, HIP launch latency, NCCL single-node all_reduce, STREAM dual-NUMA,
1-node and 2-node HydraGNN sanity) shows a-nodes and b-nodes are physically
identical and the previously reported 17% gap was confounded — all three
"baseline" jobids ran on the same lux-mi355x-b[1-2] pair (verified via
nodes_list field). a-vs-b 2-node sanity now: a3+a4 = 77.0 s/epoch vs
b3+b4 = 78.4 s/epoch (1.8% a-pair faster, within noise).

One real per-node outlier: lux-mi355x-a5 — intermittent dual-NUMA STREAM
COPY at 179 GB/s vs cluster median 220 GB/s, recovers partially on retest.
Captured for the sysadmin alongside a6 (mount EIO) and a10 (PMIx ring).

Changes:
- examples/microbench_node_health.sh — new reusable node-health probe
  (mount-write + STREAM dual-NUMA + HIP launch latency + GPU/firmware inventory).
  ~30 s wall-clock, exit nonzero on failure — suitable for SLURM Prolog use.
- recipes/perf-optimizer-loop/lever_catalog.yaml — revise 4 entries
  (batch_800, rccl_high_priority, num_workers_12, precision_fp32) marked
  with "[REVISED 2026-05-26]" to remove the a-vs-b framing; the
  observations themselves are intact.
- .cursor/skills/ai4science-perf-analysis/SKILL.md — mark lessons 9, 10,
  16 as REVISED and add lessons 18 (the a-vs-b false-narrative root cause)
  and 19 (the per-node prologue probe argument). Add a5 to the
  Confirmed bad nodes list.

Runtime artefacts (REPORT.md + SYSADMIN_MESSAGE.md + raw probe outputs)
live under /shared/aaji/microbench-a-vs-b/ and are not committed.

Co-authored-by: Cursor <cursoragent@cursor.com>
Single-file orientation so a fresh agent can pick up mid-flight without
re-discovering anything. Captures: current best config (75.0 s/epoch on
2 nodes), prioritized next work (8/16-node scaling sweep, two upstream
HydraGNN PRs, second-user repro, Frontier-scale doc), landmines to avoid
(each cross-referenced to a SKILL lesson), and pointers to every artefact
(runtime dirs, microbench report, sysadmin message, ticket comment id).

Discovery paths:
- recipe README links to it at the top
- SKILL "Repository entry points" calls it out for the iterative-loop
  recipe
- keyword-rich title (HANDOFF, next agent, perf-optimizer, MI355X,
  dispatch-bound) so grep/glob surfaces it
- model.yaml's existing recipe_path: recipes/perf-optimizer-loop/
  already points the model-discovery flow to its folder

Co-authored-by: Cursor <cursoragent@cursor.com>
Remove username, site-specific paths (/shared/aaji, /home/aaji,
rad-vultr-login), cluster-specific node names (lux-mi355x-*),
partition/account names (lux, vultr_lux), and the now-refuted
a-nodes-vs-b-nodes 17% performance gap narrative from all committed
files.

- Replace hardcoded paths with $AI4S_SHARED_DIR throughout scripts
  and agent prompt files; add AI4S_SHARED_DIR required-or-fail guard
  in run_optimizer_loop.sh
- Replace partition/account SBATCH directives with generic defaults
- Rewrite lessons #9, #10, #16, #18 in SKILL.md to drop cluster-
  specific node-class framing; preserve the methodological lesson
  (always check nodes_list before drawing class conclusions)
- Delete HANDOFF.md (ephemeral session artifact, not repo material);
  add HANDOFF.md to .gitignore
- Remove Authors/contacts block from HANDOFF.md before deletion

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@ashwinma ashwinma force-pushed the aaji/perf-optimizer-loop-hydragnn branch from e426397 to 1c6ff07 Compare May 28, 2026 19:01
Add run_fom_extractor.py (FOM + TraceLens/Omnistat kernel_correlation) and
dispatch-attribution.md, plus lesson 20 in the perf-analysis skill on
VictoriaMetrics windowing and reading TraceLens ops_summary before escalating
to kernel-trace.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant