diff --git a/.cursor/skills/ai4science-perf-analysis/SKILL.md b/.cursor/skills/ai4science-perf-analysis/SKILL.md
new file mode 100644
index 0000000..aee6733
--- /dev/null
+++ b/.cursor/skills/ai4science-perf-analysis/SKILL.md
@@ -0,0 +1,67 @@
+---
+name: ai4science-perf-analysis
+description: Runs the AMD AI agents bottleneck-analysis workflow on a HydraGNN (or other ai4science) training job using TraceLens + Omnistat user-mode + paired analyst/verifier subagents. Use when the user wants to "analyze a 2-node HydraGNN run", "find bottlenecks with TraceLens/Omnistat", or "run the AMD AI agents on a job".
+---
+
+# AI4Science perf-analysis (multi-subagent bottleneck workflow)
+
+## When this skill applies
+
+The user wants an automated bottleneck analysis of a multi-node training/inference run using AMD's open-source observability tooling (TraceLens, Omnistat). This is **distinct** from the `ai4science-run-models` skill — that one launches a model; this one diagnoses a model after it runs.
+
+Default target: HydraGNN on Lux MI355X. The same pattern can extend to ORBIT-2 / StormCast / GP-MoLFormer in iteration 2.
+
+## Repository entry points
+
+- Recipe: [material_science/models/HydraGNN/recipes/perf-analysis/](../../../material_science/models/HydraGNN/recipes/perf-analysis/)
+- Sbatch wrapper: [material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh](../../../material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh)
+- Agent prompt files: `material_science/models/HydraGNN/recipes/perf-analysis/agents/*.md`
+
+## Orchestration loop (what the main agent does)
+
+1. **Read the recipe README** to refresh on the run topology and artifact layout.
+2. **Dispatch the launcher subagent** (Task tool, `generalPurpose` or `shell`) with the prompt at `agents/launcher.md`. Wait for the job to complete and the manifest to be written.
+3. **Dispatch the two analyst subagents in parallel** (one Task call per subagent in the same message):
+   - `tracelens_analyst` — prompt at `agents/tracelens_analyst.md`
+   - `omnistat_analyst` — prompt at `agents/omnistat_analyst.md`
+4. **Dispatch the two verifier subagents in parallel** once analysts return:
+   - `tracelens_verifier`
+   - `omnistat_verifier`
+5. **Dispatch the synthesizer subagent**, hand it both `verified_claims.json` files.
+6. **Print the combined report** to the user with a short executive summary in chat.
+
+## Subagent contract (every subagent must)
+
+- Read **only** the files listed in its `## Inputs` section.
+- Produce **only** the file(s) listed in its `## Outputs` section.
+- Write a single line `STATUS=ok|partial|fail; reason=<short>` to stdout as the last line.
+- Never escalate to multi-node `srun`; verifiers may use **at most one** 1-node `srun -p lux -A vultr_lux -N1 --time=00:05:00` interactive probe.
+- Never edit files outside `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/`.
+
+## Cluster constraints (Lux)
+
+- Partition: `lux`. Account: `vultr_lux`. Read from [.cluster-config.yaml](../../../.cluster-config.yaml).
+- Central Prometheus at `atlvrmonad01:9090` is **unreachable from compute nodes** (TCP RST, IP allowlist). Do **not** point omnistat-query at it. Always use the user-mode VictoriaMetrics that the launcher started.
+- System-mode Omnistat is running on every compute node at port 8001. Don't touch it; we run our own user-mode collector alongside.
+- The login node (`rad-vultr-login`) has no GPU and no MPI. All analysis runs after the job — no live profiling on the login node.
+
+## When something goes wrong
+
+- If the launcher times out waiting for `sacct` to report a terminal state, it must `scontrol show job <id>` and write `state=TIMEOUT_WAITING` in the manifest; the orchestrator should surface this to the user and stop.
+- If TraceLens or omnistat-inspect can't be installed (network, py-version), the launcher's install step writes a clear error to stderr and exits non-zero — the orchestrator must NOT fall back to "skip analysis", it must surface the error.
+- If a verifier refutes an analyst's top claim, it stays in the report (with `verdict=refuted`) so future iterations don't re-derive the same wrong conclusion.
+
+## Lessons captured (smoke-test on JOB 6762)
+
+The first 2-node end-to-end run on Lux exposed five real bugs that are now all worked around in [`sbatch_train_perf_amd.sh`](../../../material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh) and the agent prompts:
+
+1. **`%(SLURM_JOB_ID)s` in omnistat config is broken** — `configparser` doesn't interpolate `os.environ`. Use `@JOB_DIR@` placeholder and `sed` at submit time.
+2. **`/tmp/omni_rmsjobinfo` permission collision with system-mode Omnistat** — override `job_detection_file` to a per-job path under `/shared/aaji/...`.
+3. **VictoriaMetrics on login node needs `-fs.disableMmap`** to load even a 1.6 MB DB (cgroup mmap restriction).
+4. **HydraGNN `Profile` block must be under `NeuralNetwork`**, not at the top level — `train_validate_test()` is called with `config["NeuralNetwork"]` so that's the scope its Profiler reads.
+5. **PromQL with `jobid="..."` requires joining via `rmsjob_info`** — `rocm_*` metrics don't carry a `jobid` label directly. Verifier subagents must use:
+   ```promql
+   metric * on (instance) group_left() (max by (instance) (rmsjob_info{jobid="..."}))
+   ```
+
+The full list (with cross-cutting context) is at the end of [.cursor/skills/ai4science-studio/SKILL.md](../ai4science-studio/SKILL.md). When fixing a bug here, check there first and propagate the lesson.
diff --git a/.cursor/skills/ai4science-studio/SKILL.md b/.cursor/skills/ai4science-studio/SKILL.md
index 7355375..d9b203d 100755
--- a/.cursor/skills/ai4science-studio/SKILL.md
+++ b/.cursor/skills/ai4science-studio/SKILL.md
@@ -349,6 +349,58 @@ When using `srun --mpi=pmix` with Apptainer, ranks may segfault with `PMIx ERROR
 --env PMIX_MCA_psec=native
 ```
 
+## Never `import` mpi4py-bearing modules from a rank-0-only python before the real `srun`
+
+If a model script does `from mpi4py import MPI` at import time (most do — directly, or transitively via the model's `__init__.py`), the import calls `MPI_Init()` because `mpi4py.rc.initialize=True` is the default. Under `srun --mpi=pmix`, `MPI_Init()` is a **collective rendezvous**: every rank in the SLURM step must arrive at it together before any rank can return.
+
+**Anti-pattern that deadlocks the job:**
+```bash
+# Inside rank.sh — runs once per SLURM rank, 0..N-1
+if [[ "${SLURM_PROCID:-0}" == "0" ]]; then
+  python3 - <<'PYEOF'      # short-lived rank-0-only python
+    import hydragnn       # ← triggers mpi4py auto-MPI_Init on rank 0 alone
+    print("[verify] ...")
+PYEOF
+fi
+
+exec python -u -c "..."    # the real distributed python on every rank
+```
+
+**What goes wrong:** rank 0 enters `MPI_Init()` from the verify python and waits there. Ranks 1..N-1 reach `MPI_Init()` from the main python a moment later. PMIx counts rank 0 as already-arrived for *that* rendezvous, so the verify python returns and exits — taking rank 0's MPI participation with it. When rank 0's main python then calls `MPI_Init()` again, PMIx no longer matches it with ranks 1..N-1, and the whole job hangs in `dist.init_process_group → _create_c10d_store` with no useful error message until the 30-minute TCPStore timeout.
+
+**Symptoms (py-spy on a stuck rank):**
+```
+_create_c10d_store (torch/distributed/rendezvous.py:200)
+_env_rendezvous_handler (torch/distributed/rendezvous.py:281)
+init_process_group (torch/distributed/distributed_c10d.py:1806)
+setup_ddp (hydragnn/utils/distributed/distributed.py:254)
+```
+…on every rank except rank 0, which has died or is in a different MPI_Init that will never match.
+
+**Correct patterns (pick one):**
+
+1. **Move the verify inside the main python**, gated on `SLURM_PROCID == 0`. Same information gets logged, but mpi4py's `MPI_Init` happens once per rank, in lockstep:
+   ```python
+   exec python -u -c "
+   import os
+   if os.environ.get('SLURM_PROCID', '0') == '0':
+       import inspect, hydragnn
+       print('[verify]', hydragnn.__file__, flush=True)
+   # ...rest of training driver...
+   "
+   ```
+
+2. **Disable mpi4py auto-init at the top of the verify script** (only safe if the verify never actually calls any MPI op):
+   ```python
+   import mpi4py
+   mpi4py.rc.initialize = False
+   import hydragnn  # now safe to import without joining a rendezvous
+   ```
+
+3. **Print the verify info on rank 0 from the launching shell only**, never from a python that imports mpi4py-bearing modules.
+
+This bit us once on the HydraGNN perf-analysis recipe (see git log on `aaji/perf-agents-hydragnn`). Watch for it whenever you add `python3 -c "..."` scaffolding around an `srun` that uses `--mpi=pmix`.
+
 ---
 
 # Docker-specific patterns
@@ -464,3 +516,239 @@ export LD_LIBRARY_PATH="/opt/venv/lib/python3.12/site-packages/torch/lib:${LD_LI
 ```
 
 Only use `docker run -e` for variables you are **introducing** (e.g. `-e ORBIT2_ROOT=/orbit2`). Never use it for `LD_LIBRARY_PATH`.
+
+## Performance-analysis subagent workflow (TraceLens + Omnistat user-mode)
+
+This pattern lives in [material_science/models/HydraGNN/recipes/perf-analysis/](../../../material_science/models/HydraGNN/recipes/perf-analysis/) and is documented in detail in [.cursor/skills/ai4science-perf-analysis/SKILL.md](../ai4science-perf-analysis/SKILL.md). Lessons that apply to **any** ai4science model that wants the same workflow:
+
+### 1. ConfigParser does NOT interpolate `os.environ`
+
+Upstream Omnistat configs (`omnistat.ornl`, `omnistat.frontier`) include lines like:
+```ini
+victoria_datadir = /lustre/.../%(SLURM_JOB_ID)s
+```
+
+This **does not work** with stock omnistat-usermode because `omnistat.utils.readConfig` uses plain `configparser.ConfigParser()`, whose `BasicInterpolation` only resolves keys defined inside the config file (DEFAULT section or local section), **not** environment variables. Calling `.get()` on a key with `%(SLURM_JOB_ID)s` raises `InterpolationMissingOptionError`.
+
+**Fix:** Use a `@JOB_DIR@`-style placeholder in the template and run `sed` from the sbatch wrapper at submit time:
+```bash
+sed -e "s|@JOB_DIR@|${HG_OUTPUT_DIR}|g" omnistat-lux.config.template > $HG_OUTPUT_DIR/omnistat.config
+```
+
+### 2. omnistat-usermode `/tmp/omni_rmsjobinfo` permission collision
+
+If the cluster runs system-mode Omnistat as user `omnidc` (or any non-root, non-self user), `/tmp/omni_rmsjobinfo` will already exist and be owned by that user. omnistat-usermode's `omnistat-rms-env` step then fails with `PermissionError` on the very first `--start`, and **exporters never launch** (silent: VictoriaMetrics is up but DB stays empty).
+
+**Fix:** override the path in the user config:
+```ini
+[omnistat.collectors.rms]
+job_detection_mode = file-based
+job_detection_file = /shared/aaji/.../omni_rmsjobinfo
+step_detection_file = /shared/aaji/.../omni_rmsjobinfo_step
+```
+
+Path must be on a shared FS reachable by every compute node in the allocation (NFS/Lustre/etc.) because each node's `omnistat-rms-env` runs via `srun -N $NODES --ntasks-per-node=1` and writes to that path.
+
+### 3. VictoriaMetrics needs `-fs.disableMmap` on the login node
+
+Lux login node `vm.max_map_count` is 1048576 (high), but the cgroup memory limit is 8 GB, and even the small (1.6 MB) per-job Omnistat DB cannot mmap successfully — VM panics with:
+```
+FATAL: cannot mmap "/.../index.bin": cannot mmap file with size 4096 bytes; already memory mapped files: 0: no such device
+```
+
+**Fix:** add `-fs.disableMmap` to the VictoriaMetrics command line whenever you start VM on the login node to read a job-scoped DB:
+```bash
+victoria-metrics-prod -storageDataPath=$DB -fs.disableMmap -httpListenAddr=127.0.0.1:8428 ...
+```
+
+### 4. PromQL: filter by jobid via `rmsjob_info` join, not direct label
+
+`rocm_*` metrics from the omnistat collector do **not** carry a `jobid` label themselves. The omnistat-inspect query layer joins them with `rmsjob_info` per-instance:
+```promql
+avg(rocm_utilization_percentage * on (instance) group_left() (max by (instance) (rmsjob_info{jobid="6762"})))
+```
+
+A naive `rocm_utilization_percentage{jobid="6762"}` returns 0 series even though the data is there. Verifier subagents that re-derive metrics from raw curl must use the join pattern.
+
+### 5. HydraGNN `Profile` block goes under `NeuralNetwork`, not the top level
+
+`hydragnn.train.train_validate_test()` is invoked with `config["NeuralNetwork"]` and reads `config["Profile"]` inside that scope:
+```python
+profiler = Profiler("./logs/" + model_with_config_name)
+if "Profile" in config:
+    profiler.setup(config["Profile"])
+```
+
+So injecting `{"Profile": {"enable": 1, "target_epoch": 1}}` at the top of `gfm_mlip.json` does **nothing** — the profiler stays disabled. The block must be inside `NeuralNetwork`:
+```python
+cfg.setdefault("NeuralNetwork", {})["Profile"] = {"enable": 1, "target_epoch": epoch}
+```
+
+(This is specific to the HydraGNN multi-dataset HPO example; other HydraGNN entrypoints may pass the full config and behave differently. Always trace where the entrypoint reads `Profile` before injecting.)
+
+### 6. Two-phase analyst+verifier pattern
+
+Every claim from the analyst subagent (TraceLens or omnistat-inspect output) is independently re-derived by a verifier subagent from raw data:
+- TraceLens claims → re-derived by parsing the raw `.pt.trace.json` event stream.
+- Omnistat claims → re-derived via raw `curl` PromQL against the same VictoriaMetrics endpoint.
+
+This catches both tool bugs and analyst hallucinations. Refuted claims stay in `verified_claims.json` (so future iterations don't re-derive the same wrong conclusion); the synthesizer drops only `verdict=refuted` from the final report. In the smoke test, all 6 claims verified within tolerance, but the structure is what catches a future flaky case.
+
+### 7. Datetimes from omnistat-inspect are UTC
+
+`job_info.json` returns `start_time`/`end_time` as ISO-8601 strings without a `Z` or offset — they are **UTC**. Verifier subagents that build PromQL `start`/`end` from these MUST use `calendar.timegm(time.strptime(s, "%Y-%m-%dT%H:%M:%S"))`, **not** `time.mktime` (which interprets local time and produces wrong epoch on most clusters). Wrong epoch → empty PromQL result series → false "no data" finding.
+
+### 8. Never read configs from a `/shared` "staged copy" — always from the in-repo template
+
+We hit this once: the omnistat config under `/shared/aaji/models/HydraGNN/perf-runs/omnistat-lux.config` was a stale copy of the in-repo `omnistat-lux.config.template`, with `enable_rocprofiler = False` left over from an early v1 decision. The in-repo template had since been updated to `enable_rocprofiler = True` with the `hbm_flops_f64` profile, but the staged copy was never refreshed, so two perf-analysis runs (jobs 6762 and 6840) silently collected zero hardware counters (no HBM bandwidth, no fp64 VALU/MFMA FLOPs). The TraceLens analytical roofline still ran fine, masking the gap until someone asked "where are the HBM counters?"
+
+Two compounding mistakes:
+1. The sbatch script defaulted `OMNISTAT_TEMPLATE` to the `/shared` staged path, not the in-repo path.
+2. There was no run-time signal that rocprofiler was off — the run banner printed `Omnistat tmpl: <path>` but not the rocprofiler state of that template.
+
+**Fix pattern for any model recipe that reads a config template**:
+- Default the path to the in-repo source of truth: `${SCRIPT_DIR}/../recipes/<recipe>/<file>.template`. Only override via env var for ad-hoc probes.
+- After resolving the template, parse the critical knobs and **print them in the run banner**, with a `WARN` line if any expected counter group is disabled. Example:
+```bash
+_ROCPROF_STATE=$(awk -F'[= \t]+' '/^enable_rocprofiler/ {print $2; exit}' "$OMNISTAT_TEMPLATE")
+echo "  Rocprofiler  : enable_rocprofiler=${_ROCPROF_STATE:-?}"
+[[ "$_ROCPROF_STATE" != "True" ]] && echo "  WARN: rocprofiler counters DISABLED — HBM/FLOP counters NOT collected" >&2
+```
+- Don't `exit` on it: a config-only probe that intentionally turns rocprofiler off is legitimate. Just make it impossible to be silently wrong.
+
+**Detection rule for any future template**: if a config file lives **both** in `<repo>/.../<recipe>/*.template` and on a shared filesystem (`/shared`, `/lustre`, etc.), assume the shared copy is stale until proven otherwise. Either delete the staged copy or refresh it from the repo before submitting.
+
+### 9. Omnistat rocprofiler-sdk extension is a separate C++ build (not a pip extra)
+
+`pip install -e '.[query]'` does NOT build the rocprofiler-sdk Python binding even with `BUILD_ROCPROFILER_SDK_EXTENSION=1` exported, because pip skips the build step on a re-install of an editable package. Symptom: `enable_rocprofiler = True` in the config + `omnistat-monitor` exporter dies silently with `Missing ROCProfiler-SDK extension: build with installation required` and `ModuleNotFoundError: No module named 'rocprofiler_sdk'`.
+
+The doc-recommended path for editable installs (`docs/installation/extensions.md` upstream) is:
+```bash
+cd /path/to/omnistat-src
+BUILD_ROCPROFILER_SDK_EXTENSION=1 python setup.py build_ext --inplace
+```
+This must run on a **compute node** (where `/opt/rocm/include/rocprofiler-sdk/` headers and `/opt/rocm/share/rocprofiler-sdk/counter_defs.yaml` are present — login nodes often have neither). Build deps: `cmake-build-extension`, `nanobind`, `cmake>=3.15`, ninja, a C++20 compiler. Output is one `.so` file named `omnistat/rocprofiler_sdk_extension.cpython-3*-x86_64-linux-gnu.so`. Verify with `python3 -c "from omnistat.rocprofiler_sdk_extension import get_samplers, initialize"`.
+
+Imports note: omnistat's collector imports `from omnistat.rocprofiler_sdk_extension import ...`, NOT `import rocprofiler_sdk`. A naive sanity check that imports the upstream `rocprofiler_sdk` Python package will always fail (that package is unrelated, ships with ROCm, and is not on the omnistat venv path).
+
+### 10. gfx950 (MI355X) rocprofiler counter register limits — multiple counters in same hardware block fail
+
+ROCm rocprofiler-sdk on gfx950 returns
+```
+create profile failed with error code 38: Request exceeds the capabilities of the hardware to collect
+```
+when a single profile asks for more counters than fit in the per-block register file. Per-block limits are not documented uniformly; we determined empirically on `lux-mi355x-b1` (ROCm 7.2.2):
+
+- **TCC (L2 cache) block**: **`FETCH_SIZE` AND `WRITE_SIZE` together fail.** Either one alone works (with up to 4 SQ_INSTS_VALU_*_F64 counters in the same profile).
+- **SQ (scalar/vector unit) block**: ≥4 SQ_INSTS_VALU_*_F64 counters in one profile work.
+- Workaround: use `sampling_mode = periodic` and `counters = [[set0], [set1]]` to time-multiplex — but see lesson 11.
+
+Always probe interactively first (1 node `salloc`, run `omnistat-monitor --configfile <test>` foreground for 8 s, grep stderr for "create profile failed") before submitting an sbatch with new counters; the omnistat user-mode launcher swallows the error from the spawned exporter and the only symptom from the outside is "Missing exporter on <host>".
+
+### 11. Omnistat `sampling_mode = periodic` only fires the first set in this build
+
+In `omnistat 1.12.0+git.65ea9ac` (PR #271 + main merged) on `lux-mi355x-b1` with rocprofiler-sdk freshly built, `sampling_mode = periodic` with two counter sets only fires the FIRST set. Every subsequent set returns identically zero across all cards and timestamps. Half the intended counter coverage is silently lost.
+
+We surfaced this with the `[[FETCH+4×SQ_F64], [WRITE+4×SQ_F64]]` config: FETCH and MFMA_MOPS samples were healthy (~50% non-zero, ~7% non-zero respectively), but WRITE and FMA/ADD/MUL were uniformly zero. Workaround: use `sampling_mode = constant` and a single counter set that fits in the per-block hardware limits — for fp64 + HBM-read on gfx950 that's:
+```ini
+[omnistat.collectors.rocprofiler.fp64_and_hbm_read]
+sampling_mode = constant
+counters = ["FETCH_SIZE", "SQ_INSTS_VALU_ADD_F64", "SQ_INSTS_VALU_MUL_F64", "SQ_INSTS_VALU_FMA_F64", "SQ_INSTS_VALU_MFMA_MOPS_F64"]
+```
+Trade-off: lose `WRITE_SIZE` (HBM-write bandwidth) until the periodic-mode bug is fixed upstream. For most ML training workloads, FETCH_SIZE is the dominant traffic anyway.
+
+### 12. Three rocprofiler-sdk surfaces — pick the right one
+
+Omnistat exposes **two** independent rocprofiler-sdk integrations, and there's a **third** path that bypasses omnistat entirely. They sample disjoint things and can mostly co-exist; pick by the question you're answering.
+
+| Surface | What you get | Cardinality / cost | Build artifact | Switch in our stack |
+|---|---|---|---|---|
+| **Device counting** (omnistat collector, `enable_rocprofiler=True`) | Whole-card hardware counters: FETCH_SIZE, fp64 VALU/MFMA, etc. Sampled at 1 Hz, scraped by the omnistat exporter. | One time-series per (card, counter). Constant-mode is essentially free. | `omnistat/rocprofiler_sdk_extension.cpython-*.so` (Python binding, built in-place from `rocprofiler-sdk/CMakeLists.txt` with `BUILD_KERNEL_TRACE_LIB=OFF`) | `enable_rocprofiler = True` in `omnistat-lux.config.template` (default). |
+| **Kernel tracing** (omnistat collector, `enable_kernel_trace=True`) | Per-kernel-name dispatch count and total duration on every card, every node. Aggregated in 1 s bins, POSTed via HTTP to the omnistat exporter. | One time-series per (card, kernel-name) — can be 1k+ unique kernel names per training step. Light overhead per dispatch (microseconds), but cardinality blows up VictoriaMetrics if left on for hours. | `build-trace/libomnistat_trace.so` (standalone C++ tool library, built from same CMake with `-DBUILD_KERNEL_TRACE_LIB=ON`) | `OMNISTAT_KERNEL_TRACE=1` env-gate in `sbatch_train_perf_amd.sh` — flips `enable_kernel_trace=True` in the rendered config and exports `ROCP_TOOL_LIBRARIES=<path-to-libomnistat_trace.so>` into the apptainer exec. |
+| **`rocprofv3` standalone trace** (no omnistat) | Full kernel + memcpy + HSA API trace on a single rank. JSON / Perfetto / CSV output you read with TraceLens or chrome://tracing. | One trace file per rank, multi-MB per second of training. Heaviest of the three. | None — uses `/opt/rocm/bin/rocprofv3` directly. | Wrap a single rank's command in `rocprofv3 --kernel-trace -o ./rank0_trace -- python ...`. We don't run this in our standard sbatch yet. |
+
+**When to use which:**
+
+- **Device counting only** → multi-node, multi-hour runs. You want HBM bandwidth and fp64 FLOP rate per card across the whole job, with telemetry that survives a 2-hour wallclock without filling the DB.
+- **Kernel tracing on top of device counting** → short probe runs (≤ 5 minutes, ≤ 1 node) when you need to know *which kernel* dominated time without rank-0-only kineto traces. Combine with device counting to correlate per-kernel duration against whole-card FLOP rate.
+- **`rocprofv3` instead of omnistat** → deepest single-rank kernel detail (per-dispatch timing, register usage, kernel arguments). Use when omnistat's aggregate-by-name view loses the signal you need (e.g. tail latency on one specific dispatch). Don't run it across all ranks; the file size will eat the NFS share.
+
+**Cross-runtime gotcha:** `enable_kernel_trace=True` without the tool library loaded does *nothing* — the omnistat collector just sits with an empty endpoint. The `OMNISTAT_KERNEL_TRACE=1` switch in `sbatch_train_perf_amd.sh` enforces this with a hard `exit 2` if `libomnistat_trace.so` isn't found at the configured path; do **not** loosen that check, otherwise you re-create the exact "silent zero counters" failure mode from §8.
+
+**Build location:** Both omnistat artifacts (the Python extension and the kernel-trace tool library) must be built on a **compute node** inside the same SIF the workload runs in — login nodes don't have apptainer or the ROCm headers, and a host-side build will pick up the wrong libcurl / glibc. Standard build:
+
+```bash
+salloc -p lux -A vultr_lux -N 1 --time=00:15:00 --gpus-per-node=1 \
+  apptainer exec --rocm \
+    /shared/aaji/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif \
+    bash -c 'cd /shared/aaji/tools/omnistat-src && \
+      cmake -S rocprofiler-sdk/ -B build-trace/ -DBUILD_KERNEL_TRACE_LIB=ON && \
+      cmake --build build-trace/ -j 8'
+```
+
+Verify with `ls -la /shared/aaji/tools/omnistat-src/build-trace/libomnistat_trace.so` (~MB-class file, not the tiny 4 kB stub you get when CMake fails halfway).
+
+**Lux SIF build prerequisites (one-time gotchas):** the `pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif` ships **without** `cmake` and **without** libcurl headers (only the `.so.4` runtime). Both omnistat artifacts need cmake; the kernel-trace library additionally needs `<curl/curl.h>`. Working build recipe inside the SIF:
+
+1. `python -m venv /tmp/build_venv && /tmp/build_venv/bin/pip install cmake nanobind zstandard` — installs `cmake` 4.x as a wheel, plus `zstandard` for unpacking modern Ubuntu `.deb` files (which use `data.tar.zst`).
+2. Fetch `libcurl4-openssl-dev_*.deb` from `archive.ubuntu.com/ubuntu/pool/main/c/curl/`, `ar x` it, decompress `data.tar.zst` with `zstandard.ZstdDecompressor` in Python, untar to a tmp prefix.
+3. Pass `-DCURL_INCLUDE_DIR=<prefix>/usr/include/x86_64-linux-gnu -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/libcurl.so.4` to cmake configure. The library finds rocprofiler-sdk via `/opt/rocm/lib/cmake/rocprofiler-sdk/`.
+
+Verified on `lux-mi355x-b1` job 7030: 533 kB `libomnistat_trace.so`, ELF64, all rocprofiler-sdk symbols resolved, env vars `OMNISTAT_TRACE_{MAX_INTERVAL,BUFFER_SIZE,ENDPOINT_PORT,LOG}` baked in. `apt-get download` does **not** work in the SIF — apt sources aren't configured — so don't waste a node on it.
+
+**Endpoint port alignment:** the kernel-trace **collector** registers `/kernel_trace` on the same Flask app as the rest of the omnistat exporter, so it listens on `[omnistat.collectors] port` (8101 in our config). The kernel-trace **tool library**'s default `OMNISTAT_TRACE_ENDPOINT_PORT` is 8001, which is wrong for our setup — they must match or every dispatch record gets a connection-refused. The sbatch wrapper auto-derives the port from the rendered config; do not hard-code 8001.
+
+### 13. Empty-string env passthrough breaks `os.getenv() is not None` checks (HydraGNN-class trap)
+
+HydraGNN's `train_validate_test.py` and `preprocess/load_data.py` detect optional knobs with the idiom
+
+```python
+if os.getenv("HYDRAGNN_NUM_WORKERS") is not None:
+    num_workers = int(os.environ["HYDRAGNN_NUM_WORKERS"])
+```
+
+This **fails** when the variable is set to the empty string: `getenv() is not None` returns `True`, and `int("")` raises `ValueError`. Our older `--env KEY="${KEY:-}"` passthrough pattern passes empty strings whenever the wrapper's caller didn't set the var, which masquerades as "set but empty" inside the container.
+
+**Symptom (probe job 7033 on this repo):** all 8 ranks crash on the second line of `create_dataloaders` with `ValueError: invalid literal for int() with base 10: ''` before any kernel ever launches.
+
+**Fix:** build a separate array of `--env` flags that are only appended when the value is non-empty:
+
+```bash
+OPT_ENVS=()
+for k in HYDRAGNN_NUM_WORKERS HYDRAGNN_PERSISTENT_WORKERS HYDRAGNN_MAX_NUM_BATCH; do
+  v="${!k:-}"
+  [[ -n "$v" ]] && OPT_ENVS+=( --env "${k}=${v}" )
+done
+
+srun ... apptainer exec ... "${OPT_ENVS[@]}" "$HG_SIF" bash "$RANK_SCRIPT"
+```
+
+Applied to both `sbatch_train_amd.sh` and `sbatch_train_perf_amd.sh` for HydraGNN. **General lesson** for any model recipe: never use `--env KEY="${KEY:-}"` for variables that the workload reads with `os.getenv(K) is not None` — either pass a numeric default (`--env KEY="${KEY:-0}"`) or use the conditional-array pattern above. Audit every `*/models/*/examples/*.sh` against this when introducing new optional env knobs.
+
+### 14. Cluster-state checks before queueing build/probe jobs on Lux
+
+`sinfo -p lux,rad` is the cheapest way to know whether a build alloc will ever start. Lux and rad partitions get drained for slurm maintenance roughly every few weeks — the symptom is `salloc: job NNNN queued and waiting for resources` followed by `(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)`. The drain reasons are visible via `sinfo -R`; treat any `slurm deploy maintenance`-class reason as "wait, don't queue". Cancel pending allocs with `scancel` when you discover the drain — they don't time out on their own.
+
+When the cluster is up but you only need a one-shot interactive build inside the SIF, **prefer `srun --no-shell`-equivalent direct exec over `salloc + apptainer exec`**:
+
+```bash
+srun -p lux -A vultr_lux -N 1 --time=00:15:00 --gpus-per-node=1 --cpus-per-task=8 \
+  --job-name=<descriptive> \
+  apptainer exec --rocm \
+    --bind /shared/aaji/tools:/shared/aaji/tools \
+    "$HG_SIF" \
+    bash -c '<your build commands>'
+```
+
+Why: `salloc` doesn't auto-bind `/shared/aaji/tools/` into the apptainer namespace (only the `omnistat-src` symlink, which is bind-mounted system-wide, makes it through). Direct `srun … apptainer exec --bind …` forces the bind explicitly and writes the build artifact to its persistent NFS path in one shot. Verified during the kernel-trace library build (job 7030).
+
+### 15. Dual-failure debug pattern — separate "tool wired correctly?" from "workload happy?"
+
+When a probe job fails, ask **two** questions before retrying with a bigger run:
+
+1. **Did the new tool/instrumentation initialize?** Look for its banner / startup log even on a crashed run — a healthy tool prints something before the workload imports anything heavy.
+2. **Did the workload reach the part the tool measures?** A crash at line 863 of `gfm_mlip_all_mpnn.py` (HydraGNN's `create_dataloaders`) is upstream of any GPU kernel launch, so any "0 dispatches captured" is consistent with both "tool broken" and "workload broken before kernels".
+
+**Concrete example from this session (kernel-trace probe job 7033):** all 8 ranks crashed in `int(os.environ["HYDRAGNN_NUM_WORKERS"])` (`ValueError: invalid literal for int() with base 10: ''`). The kernel-trace library *also* printed `[hostname][pid][omnistat] Trace summary: 0/0 processed records (0/0 successful flushes)` per rank — meaning the rocprofiler-sdk tool loaded, registered its callback, and only saw zero dispatches because the workload crashed first. Fixing the env-passthrough bug (§13) and re-running gave 22.5k records/rank — the kernel-trace wiring was correct all along.
+
+**Action:** when you see "0 records / 0 events / 0 metrics" after a failed probe, *don't* re-build the instrumentation. Read the workload error, fix that, and re-run before changing anything tool-side. This saved one full rebuild cycle here.
diff --git a/material_science/models/HydraGNN/examples/build_overlay_amd.sh b/material_science/models/HydraGNN/examples/build_overlay_amd.sh
index d005b34..a915c28 100755
--- a/material_science/models/HydraGNN/examples/build_overlay_amd.sh
+++ b/material_science/models/HydraGNN/examples/build_overlay_amd.sh
@@ -40,7 +40,7 @@ HG_BASE="${AI4S_SHARED_DIR:?AI4S_SHARED_DIR must be set}/models/HydraGNN"
 HG_SIF="${HG_SIF:-${AI4S_SHARED_DIR}/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif}"
 OVERLAY="${HG_OVERLAY:-${HG_BASE}/overlays/hydragnn-overlay.img}"
 OVERLAY_SIZE_MB="${HG_OVERLAY_SIZE_MB:-4096}"
-HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-6c45f1682783e66dc89e9e23009f61716186432b}"
+HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-2fb0bd0157e3c85a74f9841887155095bd163303}"
 PYG_ROCM_RELEASE="${PYG_ROCM_RELEASE:-15}"
 PYG_PYTHON_TAG="${PYG_PYTHON_TAG:-py312}"
 
@@ -82,6 +82,27 @@ echo ""
 
 mkdir -p "$LOCAL_TMP" "$STAGE" "$PYG_WHEELS_DIR"
 
+# ---------------------------------------------------------------------------
+# Stage local patches (applied to the HydraGNN source inside the container)
+# ---------------------------------------------------------------------------
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PATCHES_SRC="${SCRIPT_DIR}/patches"
+PATCHES_STAGE="${LOCAL_TMP}/patches"
+mkdir -p "$PATCHES_STAGE"
+if [[ -d "$PATCHES_SRC" ]]; then
+  shopt -s nullglob
+  for p in "$PATCHES_SRC"/*.patch; do
+    cp "$p" "$PATCHES_STAGE/"
+  done
+  shopt -u nullglob
+fi
+PATCH_COUNT=$(ls "$PATCHES_STAGE"/*.patch 2>/dev/null | wc -l)
+echo "  Local patches to apply: $PATCH_COUNT"
+if [[ "$PATCH_COUNT" -gt 0 ]]; then
+  for p in "$PATCHES_STAGE"/*.patch; do echo "    - $(basename "$p")"; done
+fi
+echo ""
+
 # ---------------------------------------------------------------------------
 # Download and extract Looong01 PyG ROCm wheels
 # ---------------------------------------------------------------------------
@@ -153,15 +174,42 @@ cp "$ADIOS2_INST"/lib/libadios2*.so* "$STAGE/adios2/" 2>/dev/null || true
 echo "  adios2 installed (MPI + Python bindings)"
 cd /
 
-echo "--- Step 5: HydraGNN source (main branch) ---"
+echo "--- Step 5: HydraGNN source (pinned SHA: $BUILD_SHA) ---"
 if [[ ! -d /tmp/HydraGNN-src ]]; then
-    git clone -q --depth=1 https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src
-    cd /tmp/HydraGNN-src && git fetch --depth=1 origin "$BUILD_SHA" && git checkout "$BUILD_SHA"
+    # Full clone (not --depth=1) so arbitrary SHAs can be checked out; some
+    # remotes refuse fetch-by-sha on shallow clones depending on
+    # uploadpack.allowReachableSHA1InWant config.
+    git clone -q https://github.com/ORNL/HydraGNN.git /tmp/HydraGNN-src
+    cd /tmp/HydraGNN-src && git checkout -q "$BUILD_SHA"
+    HEAD_SHA="$(git rev-parse HEAD)"
+    if [[ "$HEAD_SHA" != "$BUILD_SHA" ]]; then
+      echo "ERROR: checkout produced $HEAD_SHA, expected $BUILD_SHA" >&2
+      exit 1
+    fi
     cd /
 fi
+
+echo "--- Step 5b: apply local patches (BUILD_PATCHES=$BUILD_PATCHES) ---"
+if [[ -d "$BUILD_PATCHES" ]]; then
+    cd /tmp/HydraGNN-src
+    shopt -s nullglob
+    for p in "$BUILD_PATCHES"/*.patch; do
+        echo "  applying: $(basename "$p")"
+        # --check first so we fail fast with a clear message if the patch
+        # was authored against a different SHA than BUILD_SHA.
+        if ! patch -p1 --dry-run --silent < "$p"; then
+            echo "ERROR: patch $(basename "$p") does not apply cleanly to SHA $BUILD_SHA" >&2
+            exit 1
+        fi
+        patch -p1 < "$p"
+    done
+    shopt -u nullglob
+    cd /
+fi
+
 # Copy the source tree directly rather than pip-installing (avoids partial package issues)
 cp -r /tmp/HydraGNN-src/hydragnn "$STAGE/hydragnn"
-echo "  HydraGNN source copied (SHA: $BUILD_SHA)"
+echo "  HydraGNN source copied (SHA: $BUILD_SHA + $(ls "$BUILD_PATCHES"/*.patch 2>/dev/null | wc -l) patches)"
 
 echo "--- Step 6: strip torch / nvidia / triton / cuda from staging dir ---"
 for pkg in torch torchvision torchaudio torchgen functorch nvidia triton cuda sympy; do
@@ -202,7 +250,14 @@ print('adios2        :', adios2.__version__, '(MPI: OK)')
 
 python3 -c "import vesin; print('vesin         : ok')"
 python3 -c "import e3nn; print('e3nn          :', e3nn.__version__)"
-python3 -c "import hydragnn; print('hydragnn      : ok')"
+python3 -c "
+import hydragnn, os, inspect
+print('hydragnn      : imported from', os.path.dirname(hydragnn.__file__))
+# Confirm the persistent_workers patch is present iff a patch file was applied.
+src = inspect.getsource(__import__('hydragnn.preprocess.load_data', fromlist=['_']))
+patched = 'HYDRAGNN_PERSISTENT_WORKERS' in src
+print('hydragnn      : persistent_workers patch =', patched)
+"
 
 echo "=== Overlay install complete ==="
 INNEREOF
@@ -224,6 +279,7 @@ apptainer exec \
     --env BUILD_STAGE="$STAGE" \
     --env BUILD_PYG_WHEELS="$PYG_WHEELS_DIR" \
     --env BUILD_SHA="$HG_HYDRAGNN_SHA" \
+    --env BUILD_PATCHES="$PATCHES_STAGE" \
     "$HG_SIF" \
     bash "${LOCAL_TMP}/overlay_install.sh"
 
diff --git a/material_science/models/HydraGNN/examples/patches/README.md b/material_science/models/HydraGNN/examples/patches/README.md
new file mode 100644
index 0000000..91aa844
--- /dev/null
+++ b/material_science/models/HydraGNN/examples/patches/README.md
@@ -0,0 +1,40 @@
+# HydraGNN local patches
+
+Small, opt-in patches applied during overlay build (see
+`examples/build_overlay_amd.sh`). Each patch is unified-format and applies
+cleanly with `patch -p1` against the pinned `HG_HYDRAGNN_SHA` in the build
+script.
+
+## load_data.persistent_workers.patch
+
+**Target SHA:** `2fb0bd0157e3c85a74f9841887155095bd163303` (upstream `main`, 2026-05-20)
+
+**Lines changed:** +4 / -0 in `hydragnn/preprocess/load_data.py`
+
+**What it does**
+
+Upstream already exposes `num_workers` as an env-var knob
+(`HYDRAGNN_NUM_WORKERS`) but hard-codes `persistent_workers = False`. This
+patch adds a companion opt-in env var:
+
+- `HYDRAGNN_PERSISTENT_WORKERS=1` and `num_workers > 0`
+  → `persistent_workers=True` is passed to all DataLoaders
+- Otherwise the default upstream behaviour (`persistent_workers=False`) is
+  unchanged.
+
+**Why**
+
+When `num_workers > 0` and `persistent_workers=False`, PyTorch forks the
+worker processes at the start of every epoch and re-imports the dataset
+module + re-mmaps ADIOS2 BP files. This shows up in our profiler traces as a
+bimodal iteration-time distribution — most iterations at the steady-state
+s/it, plus a long tail clustered around epoch boundaries and after
+validation hops. Setting `persistent_workers=True` keeps the workers alive
+across epochs and is PyTorch's documented cure for that pattern.
+
+**Status**
+
+Ad-hoc local patch for the `ai4science-studio` perf-analysis recipe. Once
+its impact is measured, the right follow-up is a small upstream PR to
+`ORNL/HydraGNN` adding the `HYDRAGNN_PERSISTENT_WORKERS` env knob with the
+same opt-in semantics.
diff --git a/material_science/models/HydraGNN/examples/patches/load_data.persistent_workers.patch b/material_science/models/HydraGNN/examples/patches/load_data.persistent_workers.patch
new file mode 100644
index 0000000..21be2bd
--- /dev/null
+++ b/material_science/models/HydraGNN/examples/patches/load_data.persistent_workers.patch
@@ -0,0 +1,15 @@
+diff --git a/hydragnn/preprocess/load_data.py b/hydragnn/preprocess/load_data.py
+index f595e2b..cabff01 100644
+--- a/hydragnn/preprocess/load_data.py
++++ b/hydragnn/preprocess/load_data.py
+@@ -286,6 +286,10 @@ def create_dataloaders(
+         num_workers = 0
+         if os.getenv("HYDRAGNN_NUM_WORKERS") is not None:
+             num_workers = int(os.environ["HYDRAGNN_NUM_WORKERS"])
++        # ai4science-studio perf-analysis patch: opt-in persistent workers.
++        # Default behaviour unchanged (False) unless HYDRAGNN_PERSISTENT_WORKERS=1.
++        if os.getenv("HYDRAGNN_PERSISTENT_WORKERS") is not None and num_workers > 0:
++            persistent_workers = bool(int(os.environ["HYDRAGNN_PERSISTENT_WORKERS"]))
+ 
+         use_custom_dataloader = 0
+         if os.getenv("HYDRAGNN_CUSTOM_DATALOADER") is not None:
diff --git a/material_science/models/HydraGNN/examples/sbatch_train_amd.sh b/material_science/models/HydraGNN/examples/sbatch_train_amd.sh
index 20dfe2c..b42b200 100755
--- a/material_science/models/HydraGNN/examples/sbatch_train_amd.sh
+++ b/material_science/models/HydraGNN/examples/sbatch_train_amd.sh
@@ -101,7 +101,7 @@ done
 # ---------------------------------------------------------------------------
 # Clone HydraGNN source if not present
 # ---------------------------------------------------------------------------
-HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-6c45f1682783e66dc89e9e23009f61716186432b}"
+HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-2fb0bd0157e3c85a74f9841887155095bd163303}"
 if [[ ! -d "${HG_REPO_DIR}/examples/multidataset_hpo_sc26" ]]; then
   echo "--- Cloning HydraGNN (pinned SHA: ${HG_HYDRAGNN_SHA}) ---"
   git clone https://github.com/ORNL/HydraGNN.git "$HG_REPO_DIR"
@@ -311,6 +311,18 @@ fi
 echo "--- Launching training: $TOTAL_RANKS ranks across $NODES nodes ---"
 echo ""
 
+# Optional env passthrough — HydraGNN's training code uses
+# `os.getenv(KEY) is not None` to detect these knobs, so an empty string
+# passes the check and then `int("")` blows up. Build a separate array of
+# `--env` flags that are only added when the var is set non-empty.
+OPT_ENVS=()
+for k in HYDRAGNN_MAX_NUM_BATCH HYDRAGNN_NUM_WORKERS HYDRAGNN_PERSISTENT_WORKERS; do
+  v="${!k:-}"
+  if [[ -n "$v" ]]; then
+    OPT_ENVS+=( --env "${k}=${v}" )
+  fi
+done
+
 srun --mpi=pmix \
     apptainer exec \
     --rocm \
@@ -328,11 +340,11 @@ srun --mpi=pmix \
     --env HG_PRECISION="$HG_PRECISION" \
     --env HG_BATCH_SIZE="${HG_BATCH_SIZE:-200}" \
     --env HG_NUM_EPOCH="${HG_NUM_EPOCH:-1}" \
-    --env HYDRAGNN_MAX_NUM_BATCH="${HYDRAGNN_MAX_NUM_BATCH:-}" \
     --env HYDRAGNN_VALTEST="${HYDRAGNN_VALTEST:-0}" \
     --env HYDRAGNN_TRACE_LEVEL="${HYDRAGNN_TRACE_LEVEL:-1}" \
     --env HG_EXAMPLE_DIR="$EXAMPLE_DIR" \
     --env HG_OUTPUT_DIR="$HG_OUTPUT_DIR" \
+    "${OPT_ENVS[@]}" \
     "${RCCL_MULTINODE_ENVS[@]}" \
     "$HG_SIF" \
     bash "$RANK_SCRIPT"
diff --git a/material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh b/material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh
new file mode 100755
index 0000000..effef88
--- /dev/null
+++ b/material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh
@@ -0,0 +1,505 @@
+#!/usr/bin/env bash
+# HydraGNN 2-node training with PyTorch profiling + Omnistat user-mode telemetry.
+#
+# This is a thin variant of sbatch_train_amd.sh tailored for the perf-analysis
+# recipe at material_science/models/HydraGNN/recipes/perf-analysis/.
+#
+# Differences from sbatch_train_amd.sh:
+#   1. Wraps srun with omnistat-usermode --start/--stopexporters/--stopserver.
+#   2. Generates a per-job copy of gfm_mlip.json with a "Profile" block injected
+#      so HydraGNN's built-in torch.profiler captures rank-0 of node-0 only.
+#   3. Hard-codes the lux partition / vultr_lux account so this script works
+#      out-of-the-box on Lux. Override via SBATCH_PARTITION/SBATCH_ACCOUNT if
+#      submitting to a different cluster.
+#   4. Defaults are tuned for a quick perf run: --nodes=2, NUM_EPOCH=2,
+#      MAX_NUM_BATCH=30, time=00:30:00 — enough for the wait=5/warmup=3/active=3
+#      profiler schedule plus a clean epoch 0 for warm caches.
+#
+# Quick start:
+#   export AI4S_SHARED_DIR=/shared/aaji
+#   sbatch material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh
+#
+# Required:
+#   AI4S_SHARED_DIR — shared base path
+#   /shared/aaji/tools/omnistat-pr271/    (created by the launcher subagent)
+#   /shared/aaji/tools/victoriametrics/   (created by the launcher subagent)
+#
+# Optional env-var overrides (with defaults):
+#   HG_NUM_EPOCH=2
+#   HYDRAGNN_MAX_NUM_BATCH=30
+#   HG_PRECISION=fp64
+#   HG_BATCH_SIZE=200
+#   PROFILE_TARGET_EPOCH=1
+#   OMNISTAT_USERMODE_INTERVAL=1   # seconds
+
+#SBATCH --job-name=hydragnn-perf
+#SBATCH --partition=lux
+#SBATCH --account=vultr_lux
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=8
+#SBATCH --gpus-per-node=8
+#SBATCH --cpus-per-task=16
+#SBATCH --time=00:30:00
+#SBATCH --output=hydragnn-train-%j.out
+#SBATCH --error=hydragnn-train-%j.out
+
+set -euo pipefail
+
+# ---------------------------------------------------------------------------
+# SCRIPT_DIR — works both at submit time and inside the SLURM spool
+# ---------------------------------------------------------------------------
+if [[ -n "${SLURM_JOB_ID:-}" ]]; then
+  _ORIG_CMD=$(scontrol show job "$SLURM_JOB_ID" | sed -n 's/.*Command=\(\S\+\).*/\1/p')
+  SCRIPT_DIR=$(cd "$(dirname "$_ORIG_CMD")" && pwd)
+else
+  SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+fi
+
+# ---------------------------------------------------------------------------
+# Paths and configuration
+# ---------------------------------------------------------------------------
+HG_BASE="${AI4S_SHARED_DIR:?AI4S_SHARED_DIR must be set}/models/HydraGNN"
+HG_SIF="${HG_SIF:-${AI4S_SHARED_DIR}/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif}"
+HG_OVERLAY="${HG_OVERLAY:-${HG_BASE}/overlays/hydragnn-overlay.img}"
+HG_DATASETS="${HG_DATASETS:-ANI1x,Alexandria}"
+HG_DATA_DIR="${HG_DATA_DIR:-${HG_BASE}/weights}"
+HG_BATCH_SIZE="${HG_BATCH_SIZE:-200}"
+HG_NUM_EPOCH="${HG_NUM_EPOCH:-2}"
+HG_PRECISION="${HG_PRECISION:-fp64}"
+HG_OUTPUT_DIR="${HG_OUTPUT_DIR:-${HG_BASE}/perf-runs/${SLURM_JOB_ID:-$$}}"
+HG_REPO_DIR="${HG_REPO_DIR:-${HG_BASE}/code/HydraGNN}"
+HYDRAGNN_MAX_NUM_BATCH="${HYDRAGNN_MAX_NUM_BATCH:-30}"
+PROFILE_TARGET_EPOCH="${PROFILE_TARGET_EPOCH:-1}"
+
+OMNISTAT_VENV="${OMNISTAT_VENV:-/shared/aaji/tools/omnistat-pr271}"
+# Read the omnistat config from the in-repo recipe (single source of truth).
+# Override with OMNISTAT_TEMPLATE=/path/to/file if you need a custom probe config.
+# A staged copy under ${HG_BASE}/perf-runs/ is intentionally NOT used as the
+# default — it has drifted in the past from the in-repo template, silently
+# disabling rocprofiler counters. See ai4science-studio SKILL.md for context.
+OMNISTAT_TEMPLATE="${OMNISTAT_TEMPLATE:-${SCRIPT_DIR}/../recipes/perf-analysis/omnistat-lux.config.template}"
+OMNISTAT_USERMODE_INTERVAL="${OMNISTAT_USERMODE_INTERVAL:-1}"
+
+# Kernel-dispatch tracing (per-kernel name + duration histogram from every
+# rank). OFF by default — opt in with OMNISTAT_KERNEL_TRACE=1. When enabled
+# we auto-flip enable_kernel_trace=True in the rendered config and load
+# libomnistat_trace.so via ROCP_TOOL_LIBRARIES inside the container.
+# See material_science/models/HydraGNN/recipes/perf-analysis/agents/launcher.md
+# for how to build libomnistat_trace.so on a compute node.
+OMNISTAT_KERNEL_TRACE="${OMNISTAT_KERNEL_TRACE:-0}"
+OMNISTAT_TRACE_LIB="${OMNISTAT_TRACE_LIB:-/shared/aaji/tools/omnistat-src/build-trace/libomnistat_trace.so}"
+# IMPORTANT: the kernel-trace collector registers /kernel_trace on the SAME
+# Flask app as the other omnistat endpoints, so the tool library must POST to
+# the [omnistat.collectors] `port` (8101 here), NOT the library's own default
+# of 8001. Mismatch = silent zero kernel metrics. Auto-derive from the
+# template so we can never drift.
+_OMNI_PORT=$(awk -F'[= \t]+' '/^[[:space:]]*port[[:space:]]*=/ {print $2; exit}' "$OMNISTAT_TEMPLATE")
+OMNISTAT_TRACE_ENDPOINT_PORT="${OMNISTAT_TRACE_ENDPOINT_PORT:-${_OMNI_PORT:-8101}}"
+
+NODES="${SLURM_JOB_NUM_NODES:-2}"
+GPUS_PER_NODE=8
+TOTAL_RANKS=$((NODES * GPUS_PER_NODE))
+
+# ---------------------------------------------------------------------------
+# Validate required files
+# ---------------------------------------------------------------------------
+for var in HG_SIF HG_OVERLAY OMNISTAT_TEMPLATE; do
+  if [[ ! -e "${!var}" ]]; then
+    echo "ERROR: $var not found: ${!var}" >&2
+    exit 2
+  fi
+done
+
+# Surface the rocprofiler state of the chosen template so a silent
+# "counters off" never happens again. Computed here, printed in the banner.
+_ROCPROF_STATE=$(awk -F'[= \t]+' '/^enable_rocprofiler/ {print $2; exit}' "$OMNISTAT_TEMPLATE")
+_ROCPROF_PROFILE=$(awk -F'[= \t]+' '/^profile/ {print $2; exit}' "$OMNISTAT_TEMPLATE")
+
+if [[ ! -x "${OMNISTAT_VENV}/bin/omnistat-usermode" ]]; then
+  echo "ERROR: omnistat-usermode not found at ${OMNISTAT_VENV}/bin/omnistat-usermode" >&2
+  echo "       Run the launcher subagent first to install (see recipes/perf-analysis/agents/launcher.md)" >&2
+  exit 2
+fi
+
+IFS=',' read -ra DATASET_ARRAY <<< "$HG_DATASETS"
+for ds in "${DATASET_ARRAY[@]}"; do
+  ds_path="${HG_DATA_DIR}/${ds}-v2.bp"
+  if [[ ! -d "$ds_path" ]]; then
+    echo "ERROR: Dataset not found: $ds_path" >&2
+    exit 2
+  fi
+done
+
+# ---------------------------------------------------------------------------
+# Clone HydraGNN source if not present (mirrors sbatch_train_amd.sh)
+# ---------------------------------------------------------------------------
+HG_HYDRAGNN_SHA="${HG_HYDRAGNN_SHA:-2fb0bd0157e3c85a74f9841887155095bd163303}"
+if [[ ! -d "${HG_REPO_DIR}/examples/multidataset_hpo_sc26" ]]; then
+  echo "--- Cloning HydraGNN (pinned SHA: ${HG_HYDRAGNN_SHA}) ---"
+  git clone https://github.com/ORNL/HydraGNN.git "$HG_REPO_DIR"
+  git -C "$HG_REPO_DIR" checkout "$HG_HYDRAGNN_SHA"
+fi
+
+# ---------------------------------------------------------------------------
+# Symlink datasets into the expected location
+# ---------------------------------------------------------------------------
+EXAMPLE_DIR="${HG_REPO_DIR}/examples/multidataset_hpo_sc26"
+DATASET_DIR="${EXAMPLE_DIR}/dataset"
+mkdir -p "$DATASET_DIR"
+
+for ds in "${DATASET_ARRAY[@]}"; do
+  target="${DATASET_DIR}/${ds}-v2.bp"
+  source="${HG_DATA_DIR}/${ds}-v2.bp"
+  if [[ ! -e "$target" ]]; then
+    ln -sfn "$source" "$target"
+  fi
+done
+
+# ---------------------------------------------------------------------------
+# Render the per-job omnistat config from the template (substitute @JOB_DIR@)
+# This sidesteps configparser's lack of os.environ interpolation.
+# ---------------------------------------------------------------------------
+mkdir -p "$HG_OUTPUT_DIR"
+OMNISTAT_CONFIG="${HG_OUTPUT_DIR}/omnistat.config"
+sed -e "s|@JOB_DIR@|${HG_OUTPUT_DIR}|g" "$OMNISTAT_TEMPLATE" > "$OMNISTAT_CONFIG"
+
+# Opt-in kernel tracing: flip enable_kernel_trace=True in the rendered config
+# and verify the rocprofiler-sdk tool library exists. Bail out loudly if
+# the user asked for it but the .so isn't present — silent fall-through is
+# exactly the "rocprofiler counters disabled" failure mode we already burned
+# a day on (see SKILL.md §11).
+if [[ "$OMNISTAT_KERNEL_TRACE" == "1" ]]; then
+  if [[ ! -f "$OMNISTAT_TRACE_LIB" ]]; then
+    echo "ERROR: OMNISTAT_KERNEL_TRACE=1 but tool library not found:" >&2
+    echo "       $OMNISTAT_TRACE_LIB" >&2
+    echo "       Build it on a compute node:" >&2
+    echo "         salloc -p lux -A vultr_lux -N 1 --time=00:15:00 --gpus-per-node=1 \\" >&2
+    echo "           apptainer exec --rocm \"\$HG_SIF\" bash -c '" >&2
+    echo "             cd /shared/aaji/tools/omnistat-src && \\" >&2
+    echo "             cmake -S rocprofiler-sdk/ -B build-trace/ -DBUILD_KERNEL_TRACE_LIB=ON && \\" >&2
+    echo "             cmake --build build-trace/ -j 8'" >&2
+    exit 2
+  fi
+  sed -i 's/^enable_kernel_trace = False/enable_kernel_trace = True/' "$OMNISTAT_CONFIG"
+fi
+_KTRACE_STATE=$(awk -F'[= \t]+' '/^enable_kernel_trace/ {print $2; exit}' "$OMNISTAT_CONFIG")
+
+# ---------------------------------------------------------------------------
+# Generate the per-job profile config (inject "Profile" block)
+# Done on the submit/host side (login or compute), uses python3 if available.
+# ---------------------------------------------------------------------------
+HG_CONFIG_OVERRIDE="${HG_OUTPUT_DIR}/gfm_mlip_profile.json"
+SRC_CONFIG="${EXAMPLE_DIR}/gfm_mlip.json"
+
+python3 - "$SRC_CONFIG" "$HG_CONFIG_OVERRIDE" "$PROFILE_TARGET_EPOCH" <<'PYEOF'
+import json, sys
+src, dst, epoch = sys.argv[1], sys.argv[2], int(sys.argv[3])
+with open(src) as f:
+    cfg = json.load(f)
+# HydraGNN's train_validate_test() is called with config["NeuralNetwork"];
+# its Profiler reads config["Profile"] from THAT scope, so the block must
+# go under NeuralNetwork, not at the top level.
+cfg.setdefault("NeuralNetwork", {})["Profile"] = {"enable": 1, "target_epoch": epoch}
+with open(dst, "w") as f:
+    json.dump(cfg, f, indent=2)
+print(f"Wrote {dst} with NeuralNetwork.Profile.enable=1, target_epoch={epoch}")
+PYEOF
+
+# ---------------------------------------------------------------------------
+# Print job configuration
+# ---------------------------------------------------------------------------
+echo "=== HydraGNN Perf Analysis Run ==="
+echo "  Nodes        : $NODES"
+echo "  GPUs/node    : $GPUS_PER_NODE"
+echo "  Total ranks  : $TOTAL_RANKS"
+echo "  SIF          : $HG_SIF"
+echo "  Overlay      : $HG_OVERLAY"
+echo "  Datasets     : $HG_DATASETS"
+echo "  Precision    : $HG_PRECISION"
+echo "  Batch size   : $HG_BATCH_SIZE"
+echo "  Num epochs   : $HG_NUM_EPOCH"
+echo "  Max batches  : $HYDRAGNN_MAX_NUM_BATCH"
+echo "  Profile epoch: $PROFILE_TARGET_EPOCH (rank-0 only)"
+echo "  Output dir   : $HG_OUTPUT_DIR"
+echo "  Repo dir     : $HG_REPO_DIR"
+echo "  Profile cfg  : $HG_CONFIG_OVERRIDE"
+echo "  Omnistat venv: $OMNISTAT_VENV"
+echo "  Omnistat tmpl: $OMNISTAT_TEMPLATE"
+echo "  Omnistat cfg : $OMNISTAT_CONFIG (rendered)"
+echo "  Rocprofiler  : enable_rocprofiler=${_ROCPROF_STATE:-?} profile=${_ROCPROF_PROFILE:-?}"
+if [[ "$_ROCPROF_STATE" != "True" ]]; then
+  echo "  WARN         : rocprofiler counters DISABLED — HBM/FLOP counters will NOT be collected" >&2
+fi
+echo "  KernelTrace  : enable_kernel_trace=${_KTRACE_STATE:-?} (opt-in via OMNISTAT_KERNEL_TRACE=1)"
+if [[ "$OMNISTAT_KERNEL_TRACE" == "1" ]]; then
+  echo "  TraceLib     : $OMNISTAT_TRACE_LIB"
+  echo "  TracePort    : $OMNISTAT_TRACE_ENDPOINT_PORT"
+fi
+echo "  Node(s)      : ${SLURM_NODELIST:-$(hostname)}"
+echo "  Date         : $(date)"
+echo ""
+
+# ---------------------------------------------------------------------------
+# Start Omnistat user-mode (datadir from omnistat-lux.config: under HG_OUTPUT_DIR)
+# ---------------------------------------------------------------------------
+echo "--- Starting Omnistat user-mode (interval=${OMNISTAT_USERMODE_INTERVAL}s) ---"
+export PATH="${OMNISTAT_VENV}/bin:${PATH}"
+"${OMNISTAT_VENV}/bin/omnistat-usermode" --configfile "$OMNISTAT_CONFIG" --start --interval "$OMNISTAT_USERMODE_INTERVAL" \
+    2>&1 | tee "${HG_OUTPUT_DIR}/omnistat_start.log" || {
+  echo "WARN: omnistat-usermode --start returned nonzero; continuing without telemetry" >&2
+}
+
+# Make sure we tear down even if the training fails
+cleanup_omnistat() {
+  echo "--- Stopping Omnistat user-mode ---"
+  "${OMNISTAT_VENV}/bin/omnistat-usermode" --configfile "$OMNISTAT_CONFIG" --stopexporters || true
+  "${OMNISTAT_VENV}/bin/omnistat-usermode" --configfile "$OMNISTAT_CONFIG" --stopserver || true
+}
+trap cleanup_omnistat EXIT
+
+# ---------------------------------------------------------------------------
+# Write rank script (runs inside the container on every rank)
+# Mirrors sbatch_train_amd.sh but uses HG_CONFIG_OVERRIDE and gates the
+# profiler to rank 0 only via PROFILE_RANK0_ONLY.
+# ---------------------------------------------------------------------------
+RANK_SCRIPT="${HG_OUTPUT_DIR}/hydragnn-rank-${SLURM_JOB_ID:-$$}.sh"
+cat > "$RANK_SCRIPT" << 'RANKEOF'
+#!/usr/bin/env bash
+set -euo pipefail
+source /opt/venv/bin/activate
+
+# The overlay's /opt/hydragnn-pkgs is rebuilt from HG_HYDRAGNN_SHA + patches/
+# (see build_overlay_amd.sh), so we import directly from there. Keep the
+# bind-mounted clone available for `examples/multidataset_hpo_sc26/*.py`.
+export PYTHONPATH="/opt/hydragnn-pkgs:${PYTHONPATH:-}"
+export LD_LIBRARY_PATH="/opt/hydragnn-pkgs/adios2:/opt/ompi/lib:${LD_LIBRARY_PATH:-}"
+
+SCRATCH_RANK="${SCRATCH_LOCAL:-/scratch}/${USER:?}/hydragnn-${SLURM_JOB_ID:-$$}/${SLURM_PROCID:-0}"
+mkdir -p "$SCRATCH_RANK"
+export TMPDIR="$SCRATCH_RANK"
+
+export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-16}"
+export MIOPEN_DISABLE_CACHE=1
+export MIOPEN_USER_DB_PATH="${SCRATCH_RANK}/miopen"
+mkdir -p "$MIOPEN_USER_DB_PATH"
+export PYTHONNOUSERSITE=1
+
+export HYDRAGNN_USE_VARIABLE_GRAPH_SIZE=1
+export HYDRAGNN_AGGR_BACKEND=mpi
+export HYDRAGNN_VALTEST="${HYDRAGNN_VALTEST:-0}"
+export HYDRAGNN_USE_FSDP=0
+export HYDRAGNN_TRACE_LEVEL="${HYDRAGNN_TRACE_LEVEL:-1}"
+
+export HSA_NO_SCRATCH_RECLAIM=1
+
+cd "$HG_OUTPUT_DIR"
+# NOTE: do NOT `import hydragnn` from a separate rank-0-only python here.
+# That import triggers `from mpi4py import MPI` which calls MPI_Init() in
+# rank 0 only and leaves ranks 1..N-1 hanging in their later MPI_Init()
+# because the PMIX rendezvous requires all ranks concurrently. We log the
+# imported path inside the main python's first stanza instead.
+
+export HYDRAGNN_AVG_NUM_NEIGHBORS="${HYDRAGNN_AVG_NUM_NEIGHBORS:-13.735293601560318}"
+
+# Profiler is enabled for the whole world (the JSON config already has
+# Profile.enable=1) but we want trace files from RANK 0 only so we don't
+# stomp on each other in $HG_OUTPUT_DIR/logs/. We disable it in the JSON
+# at runtime for non-zero ranks via a tiny monkey-patch in HydraGNN's
+# Profiler class.
+exec python -u -c "
+import sys, os, runpy
+
+# Rank 0 logs which HydraGNN copy is being imported and from what SHA.
+# Done inside the main python (after MPI ranks align) — see the comment
+# above about why this cannot live in a separate rank-0-only python.
+if os.environ.get('SLURM_PROCID', '0') == '0':
+    import inspect, hydragnn, subprocess
+    pkg_dir = os.path.dirname(hydragnn.__file__)
+    src = inspect.getsource(__import__('hydragnn.preprocess.load_data', fromlist=['_']))
+    print('[hydragnn] imported from:', pkg_dir, flush=True)
+    print('[hydragnn] persistent_workers patch present:', 'HYDRAGNN_PERSISTENT_WORKERS' in src, flush=True)
+
+# Rank 0 keeps the Profile block; others zero it out so HydraGNN's Profiler
+# falls through to the null context. Block lives under NeuralNetwork because
+# that is the dict train_validate_test() receives.
+if os.environ.get('PROFILE_RANK0_ONLY', '0') == '1' and os.environ.get('SLURM_PROCID', '0') != '0':
+    import json
+    cfg_path = os.environ['HG_CONFIG_OVERRIDE']
+    with open(cfg_path) as f:
+        cfg = json.load(f)
+    cfg.setdefault('NeuralNetwork', {}).setdefault('Profile', {})['enable'] = 0
+    # Write to a per-rank scratch path to avoid races with rank 0
+    rank_cfg = os.path.join(os.environ.get('TMPDIR', '/tmp'), 'gfm_mlip_profile_rank.json')
+    with open(rank_cfg, 'w') as f:
+        json.dump(cfg, f)
+    cfg_arg = rank_cfg
+else:
+    cfg_arg = os.environ['HG_CONFIG_OVERRIDE']
+
+avg_nn = float(os.environ.get('HYDRAGNN_AVG_NUM_NEIGHBORS', '0'))
+if avg_nn > 0:
+    import hydragnn.utils.datasets.adiosdataset as adm
+    _orig_init = adm.AdiosMultiDataset.__init__
+    def _patched_init(self, *a, **kw):
+        _orig_init(self, *a, **kw)
+        self.avg_num_neighbors = avg_nn
+    adm.AdiosMultiDataset.__init__ = _patched_init
+
+batch_size = int(os.environ.get('HG_BATCH_SIZE', '0') or 200)
+max_num_batch = int(os.environ.get('HYDRAGNN_MAX_NUM_BATCH', '0'))
+num_samples_val = max_num_batch * batch_size if max_num_batch > 0 else 0
+
+sys.argv = [
+    os.environ['HG_EXAMPLE_DIR'] + '/gfm_mlip_all_mpnn.py',
+    '--inputfile=' + cfg_arg,
+    '--multi',
+    '--everyone',
+    '--multi_model_list=' + os.environ['HG_DATASETS'],
+    '--precision=' + os.environ['HG_PRECISION'],
+    '--batch_size=' + str(batch_size),
+    '--num_epoch=' + os.environ.get('HG_NUM_EPOCH', '1'),
+    '--startfrom=none',
+    '--log=hydragnn-train-' + os.environ.get('SLURM_JOB_ID','0') + '-N' + os.environ.get('SLURM_JOB_NUM_NODES','1'),
+]
+
+if num_samples_val > 0:
+    sys.argv += [
+        '--num_samples=' + str(num_samples_val),
+        '--oversampling',
+        '--oversampling_num_samples=' + str(num_samples_val),
+    ]
+
+runpy.run_path(sys.argv[0], run_name='__main__')
+"
+RANKEOF
+chmod +x "$RANK_SCRIPT"
+
+# ---------------------------------------------------------------------------
+# Multi-node network environment (read from .cluster-config.yaml)
+# Identical to sbatch_train_amd.sh — kept verbatim to avoid drift.
+# ---------------------------------------------------------------------------
+RCCL_MULTINODE_ENVS=()
+MPI_MULTINODE_ENVS=()
+if [[ "$NODES" -gt 1 ]]; then
+  _CLUSTER_CFG="${SCRIPT_DIR}/../../../../.cluster-config.yaml"
+  if [[ -f "$_CLUSTER_CFG" ]]; then
+    _yaml_get() { grep "^  $1:" "$_CLUSTER_CFG" 2>/dev/null | sed 's/.*: *"\?\([^"]*\)"\?.*/\1/' | grep -v '^$'; }
+    : "${NCCL_IB_HCA:=$(_yaml_get ib_hca)}"
+    : "${NCCL_SOCKET_IFNAME:=$(_yaml_get mgmt_iface)}"
+    : "${RCCL_ANP_PLUGIN:=$(_yaml_get rccl_anp_plugin)}"
+    : "${LIBIONIC_PATH:=$(_yaml_get libionic_path)}"
+  fi
+
+  : "${NCCL_IB_HCA:?Set NCCL_IB_HCA or configure network.ib_hca in .cluster-config.yaml}"
+  : "${NCCL_SOCKET_IFNAME:?Set NCCL_SOCKET_IFNAME or configure network.mgmt_iface in .cluster-config.yaml}"
+  : "${RCCL_ANP_PLUGIN:?Set RCCL_ANP_PLUGIN or configure network.rccl_anp_plugin in .cluster-config.yaml}"
+  : "${LIBIONIC_PATH:?Set LIBIONIC_PATH or configure network.libionic_path in .cluster-config.yaml}"
+
+  MPI_MULTINODE_ENVS=(
+    --env OMPI_MCA_pml=ob1
+    --env OMPI_MCA_btl=tcp,self
+    --env OMPI_MCA_btl_tcp_if_include="$NCCL_SOCKET_IFNAME"
+    --env MPI4PY_RC_THREADS=false
+  )
+
+  RCCL_MULTINODE_ENVS=(
+    --bind "${RCCL_ANP_PLUGIN}:${RCCL_ANP_PLUGIN}:ro"
+    --bind "${LIBIONIC_PATH}:${LIBIONIC_PATH}:ro"
+    --env NCCL_NET_PLUGIN="$RCCL_ANP_PLUGIN"
+    --env NCCL_IB_HCA="$NCCL_IB_HCA"
+    --env NCCL_IB_GID_INDEX=1
+    --env NCCL_GDR_FLUSH_DISABLE=1
+    --env RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING=0
+    --env NCCL_GDRCOPY_ENABLE=0
+    --env NCCL_IB_QPS_PER_CONNECTION=1
+    --env HSA_NO_SCRATCH_RECLAIM=1
+    --env NCCL_IB_TC=96
+    --env NCCL_IB_FIFO_TC=192
+    --env NCCL_IGNORE_CPU_AFFINITY=1
+    --env NCCL_PXN_DISABLE=0
+    --env NET_OPTIONAL_RECV_COMPLETION=1
+    --env NCCL_IB_USE_INLINE=1
+    --env NCCL_SOCKET_IFNAME="$NCCL_SOCKET_IFNAME"
+    --env RCCL_LL128_FORCE_ENABLE=1
+    --env NCCL_IB_PCI_RELAXED_ORDERING=1
+    --env NCCL_DMABUF_ENABLE=1
+    --env NCCL_DEBUG="${NCCL_DEBUG:-WARN}"
+  )
+fi
+
+# ---------------------------------------------------------------------------
+# Optional: kernel-trace tool library bind + env (loaded by rocprofiler-sdk
+# when ROCP_TOOL_LIBRARIES points at the .so). The library POSTs dispatch
+# records to the omnistat collector on localhost:OMNISTAT_TRACE_ENDPOINT_PORT.
+# ---------------------------------------------------------------------------
+KTRACE_BIND_ENVS=()
+if [[ "$OMNISTAT_KERNEL_TRACE" == "1" ]]; then
+  KTRACE_BIND_ENVS=(
+    --bind "${OMNISTAT_TRACE_LIB}:${OMNISTAT_TRACE_LIB}:ro"
+    --env ROCP_TOOL_LIBRARIES="${OMNISTAT_TRACE_LIB}"
+    --env OMNISTAT_TRACE_ENDPOINT_PORT="${OMNISTAT_TRACE_ENDPOINT_PORT}"
+    --env OMNISTAT_TRACE_LOG="${OMNISTAT_TRACE_LOG:-1}"
+  )
+fi
+
+# ---------------------------------------------------------------------------
+# Optional env passthrough — only forward HYDRAGNN_NUM_WORKERS /
+# HYDRAGNN_PERSISTENT_WORKERS / HYDRAGNN_MAX_NUM_BATCH when set NON-EMPTY.
+# HydraGNN's load_data.py uses `os.getenv(KEY) is not None` to detect the
+# var, so an empty string passes the check and then `int("")` blows up.
+# Building a separate array keeps `apptainer exec` from receiving stale
+# `--env KEY=` flags. (See SKILL §11; observed on probe job 7033.)
+# ---------------------------------------------------------------------------
+OPT_ENVS=()
+for k in HYDRAGNN_NUM_WORKERS HYDRAGNN_PERSISTENT_WORKERS HYDRAGNN_MAX_NUM_BATCH; do
+  v="${!k:-}"
+  if [[ -n "$v" ]]; then
+    OPT_ENVS+=( --env "${k}=${v}" )
+  fi
+done
+
+# ---------------------------------------------------------------------------
+# Launch distributed training via srun + apptainer
+# ---------------------------------------------------------------------------
+echo "--- Launching training: $TOTAL_RANKS ranks across $NODES nodes ---"
+echo ""
+
+srun --mpi=pmix \
+    apptainer exec \
+    --rocm \
+    --overlay "${HG_OVERLAY}:ro" \
+    --bind "/opt/ompi:/opt/ompi:ro" \
+    --bind "${SCRATCH_LOCAL:-/scratch}:${SCRATCH_LOCAL:-/scratch}" \
+    --bind "${HG_REPO_DIR}:${HG_REPO_DIR}:ro" \
+    --bind "${HG_DATA_DIR}:${HG_DATA_DIR}:ro" \
+    --bind "${HG_OUTPUT_DIR}:${HG_OUTPUT_DIR}" \
+    "${KTRACE_BIND_ENVS[@]}" \
+    --env PMIX_MCA_gds=hash \
+    --env PMIX_MCA_psec=native \
+    "${MPI_MULTINODE_ENVS[@]}" \
+    --env SCRATCH_LOCAL="${SCRATCH_LOCAL:-/scratch}" \
+    --env HG_DATASETS="$HG_DATASETS" \
+    --env HG_PRECISION="$HG_PRECISION" \
+    --env HG_BATCH_SIZE="${HG_BATCH_SIZE:-200}" \
+    --env HG_NUM_EPOCH="${HG_NUM_EPOCH:-1}" \
+    --env HYDRAGNN_VALTEST="${HYDRAGNN_VALTEST:-0}" \
+    --env HYDRAGNN_TRACE_LEVEL="${HYDRAGNN_TRACE_LEVEL:-1}" \
+    --env HG_EXAMPLE_DIR="$EXAMPLE_DIR" \
+    --env HG_OUTPUT_DIR="$HG_OUTPUT_DIR" \
+    --env HG_REPO_DIR="$HG_REPO_DIR" \
+    --env HG_CONFIG_OVERRIDE="$HG_CONFIG_OVERRIDE" \
+    "${OPT_ENVS[@]}" \
+    --env PROFILE_RANK0_ONLY=1 \
+    "${RCCL_MULTINODE_ENVS[@]}" \
+    "$HG_SIF" \
+    bash "$RANK_SCRIPT"
+
+TRAIN_RC=$?
+
+echo ""
+echo "=== HydraGNN perf-analysis training complete (rc=$TRAIN_RC) ==="
+echo "  Logs: ${HG_OUTPUT_DIR}/"
+echo "  Trace dir: ${HG_OUTPUT_DIR}/logs/"
+echo "  Omnistat DB: ${HG_OUTPUT_DIR}/omnistat-db/"
+
+exit $TRAIN_RC
diff --git a/material_science/models/HydraGNN/model.yaml b/material_science/models/HydraGNN/model.yaml
index 9f10928..6f84764 100644
--- a/material_science/models/HydraGNN/model.yaml
+++ b/material_science/models/HydraGNN/model.yaml
@@ -25,6 +25,10 @@ recipes:
     recipe_path: recipes/train/
     script: examples/run_train.sh
     sbatch_script: examples/sbatch_train_amd.sh
+  - task: perf-analysis
+    description: 2-node bottleneck analysis with TraceLens + Omnistat (user-mode) via paired analyst/verifier subagents
+    recipe_path: recipes/perf-analysis/
+    sbatch_script: examples/sbatch_train_perf_amd.sh
 
 env_vars:
   # Required for HPC scripts (sbatch_*_amd.sh, build_overlay_amd.sh)
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/README.md b/material_science/models/HydraGNN/recipes/perf-analysis/README.md
new file mode 100644
index 0000000..cdea3ce
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/README.md
@@ -0,0 +1,124 @@
+# HydraGNN Performance Analysis with AMD AI Agents
+
+Multi-subagent workflow that launches a 2-node HydraGNN training on Lux (MI355X), captures PyTorch traces and Omnistat telemetry, then runs paired analytics + verifier subagents to identify bottlenecks and propose remedies.
+
+> **Audience:** internal AMD performance-engineering. The output is a diagnosis of the run, not a scientific claim about HydraGNN.
+
+## What this recipe does
+
+1. **Launcher** subagent submits a 2-node training (`sbatch_train_perf_amd.sh`) on the `lux` partition with PyTorch profiling armed for one target epoch and Omnistat user-mode collecting on every node.
+2. **Analytics** subagents (TraceLens + Omnistat) run after the job, in parallel, and each emits a `claims.json`.
+3. **Verifier** subagents independently re-derive the top claims from raw data and optionally probe cheap remedies on a 1-node interactive `srun`.
+4. **Synthesizer** merges, ranks, and writes `combined_report.md`.
+
+```mermaid
+flowchart TD
+    User([User]) --> Main[Main agent: orchestrator]
+    Main -->|Task| Launcher[launcher: sbatch + manifest]
+    Launcher -->|jobid + paths| Main
+    Main -->|Task parallel| TLA[tracelens_analyst]
+    Main -->|Task parallel| OSA[omnistat_analyst]
+    TLA -->|claims.json| TLV[tracelens_verifier]
+    OSA -->|claims.json| OSV[omnistat_verifier]
+    TLV --> Synth[synthesizer]
+    OSV --> Synth
+    Synth -->|combined_report.md| User
+```
+
+## Quick start
+
+```bash
+# 1. From the ai4science-studio repo root, ensure the cluster config is present
+ls .cluster-config.yaml
+
+# 2. Set the shared dir and submit the perf-analysis job
+export AI4S_SHARED_DIR=/shared/aaji
+sbatch material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh
+
+# 3. Once the job completes, drive the analyst + verifier + synthesizer subagents
+#    via the Cursor agent (or Claude Code). The agent reads
+#    .cursor/skills/ai4science-perf-analysis/SKILL.md to dispatch them.
+```
+
+The orchestrating agent is expected to run on the **login node** (`rad-vultr-login`) with shell + filesystem access to `/shared/aaji`. Verifier subagents that need a compute node use short interactive `srun -p lux -A vultr_lux -N1 --time=00:05:00` allocations; they never grab 2-node allocations.
+
+## Artifacts
+
+After a complete run:
+
+```
+/shared/aaji/models/HydraGNN/perf-runs/<jobid>/
+├── manifest.json                      # written by launcher
+├── omnistat-db/                       # VictoriaMetrics datadir
+├── logs/<model_name>/*.pt.trace.json  # PyTorch trace (rank 0 only)
+├── hydragnn-train-<jobid>.out         # SLURM stdout
+├── tracelens/
+│   ├── report.xlsx                    # TraceLens output
+│   ├── claims.json                    # tracelens_analyst output
+│   └── verified_claims.json           # tracelens_verifier output
+├── omnistat/
+│   ├── inspect_outputs/*.json         # omnistat-inspect raw outputs
+│   ├── claims.json                    # omnistat_analyst output
+│   └── verified_claims.json           # omnistat_verifier output
+└── combined_report.md                 # synthesizer output
+```
+
+The `combined_report.md` is the deliverable: ranked bottlenecks, remedies tried with deltas, remedies proposed but not tried, and clearly-flagged "system limit reached" cases.
+
+## Prerequisites
+
+- Lux access (`vultr_lux` account) with the standard HydraGNN setup at `/shared/aaji/models/HydraGNN/` — overlay, weights, code clone. See [HydraGNN/recipes/train/](../train/).
+- One-time install of Omnistat (PR #271 branch `jorda/skills`, with `origin/main` merged in), VictoriaMetrics, and TraceLens — performed lazily by the launcher subagent into `/shared/aaji/tools/`.
+- The launcher writes `gfm_mlip_with_profile.json` at submit time (does not modify the upstream `gfm_mlip.json`).
+
+## Telemetry knobs (set at `sbatch` submit time)
+
+| Env var | Default | What it does |
+|---|---|---|
+| `OMNISTAT_KERNEL_TRACE` | `0` | When `1`, loads `libomnistat_trace.so` via `ROCP_TOOL_LIBRARIES` on every rank and turns on the omnistat kernel-trace collector. Adds per-kernel dispatch count + duration time series across all 8 cards on every node. Requires the library to be built once (see `agents/launcher.md` step 1e). Validated end-to-end on job 7034. |
+| `OMNISTAT_TRACE_LIB` | `/shared/aaji/tools/omnistat-src/build-trace/libomnistat_trace.so` | Override only for development. The wrapper hard-fails if the path is missing when `OMNISTAT_KERNEL_TRACE=1`. |
+| `OMNISTAT_TRACE_LOG` | `1` | Library prints `[host][pid][omnistat] Trace summary: N/N processed records (M/M successful flushes)` on rank exit; useful as a smoke-signal that the tool initialized even when the workload crashed. |
+
+`enable_rocprofiler=True` is **on** in `omnistat-lux.config.template` by default (since commit `a23e5c4`); the rendered config's state is printed in the sbatch banner so a silent "counters off" cannot recur. Kernel tracing and device-counting can co-exist on a single GPU but raise per-job VictoriaMetrics cardinality — pick by run length and the question you're asking. See `ai4science-studio` SKILL §12 for the device-counting vs kernel-trace vs `rocprofv3` decision matrix.
+
+## Bottleneck taxonomy (used by analysts and verifier)
+
+Each claim must map to one of:
+
+| class | examples |
+|---|---|
+| `gpu_compute` | matmul/MFMA bound; low MFU; low TFLOPS vs roofline |
+| `gpu_memory_hbm` | HBM bandwidth-bound; high `FETCH_SIZE`+`WRITE_SIZE` |
+| `cpu_dispatch` | Python overhead between kernels; aten dispatch dominating; "GPU starvation" |
+| `comm_xgmi` | intra-node allreduce / scatter-gather over XGMI |
+| `comm_scaleout` | inter-node RCCL over ionic; ANP plugin throughput |
+| `dataloader` | torch DataLoader; ADIOS2 read time; preprocessing |
+| `host_io` | `/shared` NFS reads, MIOpen cache misses |
+| `host_cpu` | OMP_NUM_THREADS, CPU saturation, NUMA effects |
+
+## Agent prompt files
+
+Per-subagent prompts (each ≤200 lines, self-contained):
+
+- [agents/launcher.md](agents/launcher.md)
+- [agents/tracelens_analyst.md](agents/tracelens_analyst.md)
+- [agents/tracelens_verifier.md](agents/tracelens_verifier.md)
+- [agents/omnistat_analyst.md](agents/omnistat_analyst.md)
+- [agents/omnistat_verifier.md](agents/omnistat_verifier.md)
+- [agents/synthesizer.md](agents/synthesizer.md)
+
+The orchestrating agent dispatches each as a Task subagent (or shell script in iteration 2 with LangGraph). Each subagent's output JSON is the typed state for the next.
+
+## Out of scope (iteration 1)
+
+- LangGraph or any async backplane — subagents communicate via JSON files only.
+- Multi-rank trace fusion (`TraceLens_generate_multi_rank_collective_report_pytorch`) — single-rank trace for now.
+- A/B comparative runs — recommended in `combined_report.md` but not auto-launched.
+- Editing upstream HydraGNN — everything driven via JSON config + env vars.
+
+## Safety / etiquette
+
+- Verifier remedy probes use **1-node** interactive allocations only, ≤5 min each.
+- Never re-runs the full 2-node job during analysis — that's a deliberate choice for the agent.
+- All artifacts under `/shared/aaji/...`; large traces and DBs are not committed (see repo `.gitignore`).
+- Research/perf-engineering only — no scientific claims about HydraGNN should be derived from these short runs.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/agents/launcher.md b/material_science/models/HydraGNN/recipes/perf-analysis/agents/launcher.md
new file mode 100644
index 0000000..65b66bb
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/agents/launcher.md
@@ -0,0 +1,214 @@
+# launcher subagent
+
+Submits a 2-node HydraGNN training on Lux with PyTorch profiling and Omnistat user-mode telemetry, waits for completion, and writes `manifest.json` for downstream subagents.
+
+## Inputs
+
+- Repo at `/home/aaji/git/ai4science-studio` on branch `aaji/perf-agents-hydragnn`.
+- Cluster config at `.cluster-config.yaml` (partition `lux`, account `vultr_lux`).
+- Optional env-var overrides: `HG_BATCH_SIZE`, `HYDRAGNN_MAX_NUM_BATCH`, `HG_NUM_EPOCH`, `HG_PRECISION`, `PROFILE_TARGET_EPOCH`.
+
+## Outputs
+
+- `/shared/aaji/tools/omnistat-pr271/` — Python venv with omnistat (PR #271 + main merged) and TraceLens.
+- `/shared/aaji/tools/victoriametrics/victoria-metrics-prod` — VictoriaMetrics binary.
+- `/shared/aaji/models/HydraGNN/perf-runs/omnistat-lux.config` — Omnistat user-mode config.
+- `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/manifest.json` — manifest schema below.
+- `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/hydragnn-train-<jobid>.out` — symlinked from output dir.
+- `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/logs/` — symlinked rank-0 trace.
+- `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/omnistat-db/` — VictoriaMetrics datadir.
+
+## Manifest schema
+
+```json
+{
+  "jobid": "<int as str>",
+  "submitted_at": "ISO 8601 UTC",
+  "completed_at": "ISO 8601 UTC",
+  "exit_state": "COMPLETED|FAILED|TIMEOUT|...",
+  "nodes": 2,
+  "gpus_per_node": 8,
+  "ranks": 16,
+  "nodelist": "lux-mi355x-aN,lux-mi355x-bM",
+  "partition": "lux",
+  "account": "vultr_lux",
+  "runtime_seconds": <float>,
+  "config_used": "<path to gfm_mlip_with_profile.json>",
+  "profile_target_epoch": <int>,
+  "trace_paths": ["<path to .pt.trace.json>"],
+  "omnistat_db_path": "<dir>",
+  "training_log_path": "<path>",
+  "perf_run_dir": "<dir>",
+  "hg_precision": "fp64",
+  "hg_batch_size": 200,
+  "hg_num_epoch": 2,
+  "hydragnn_max_num_batch": 30,
+  "tools_versions": {
+    "omnistat_commit": "<sha>",
+    "tracelens_commit": "<sha>",
+    "victoriametrics_version": "<v>"
+  }
+}
+```
+
+## Steps
+
+### 1. Lazy install tools (idempotent)
+
+```bash
+set -euo pipefail
+TOOLS=/shared/aaji/tools
+mkdir -p "$TOOLS"
+
+# 1a. Omnistat: jorda/skills branch with origin/main merged in
+SRC=$TOOLS/omnistat-src
+if [[ ! -d $SRC/.git ]]; then
+  git clone https://github.com/ROCm/omnistat.git "$SRC"
+fi
+cd "$SRC"
+git fetch --all --prune
+git checkout jorda/skills 2>/dev/null || git checkout -b jorda/skills origin/jorda/skills
+git reset --hard origin/jorda/skills
+if ! git merge --no-edit origin/main; then
+  echo "ERROR: merging origin/main into jorda/skills produced conflicts; resolve manually" >&2
+  exit 2
+fi
+OMNISTAT_COMMIT=$(git rev-parse HEAD)
+
+# 1b. Venv (py3.12 to match cluster python)
+VENV=$TOOLS/omnistat-pr271
+if [[ ! -d $VENV ]]; then
+  python3 -m venv "$VENV"
+fi
+"$VENV/bin/pip" install --quiet --upgrade pip
+"$VENV/bin/pip" install --quiet -e "$SRC[query]"
+
+# 1c. TraceLens (same venv)
+"$VENV/bin/pip" install --quiet "git+https://github.com/AMD-AGI/TraceLens.git"
+TRACELENS_COMMIT=$("$VENV/bin/python" -c "import TraceLens, importlib.metadata; print(importlib.metadata.version('TraceLens'))")
+
+# 1d. VictoriaMetrics binary
+VM_DIR=$TOOLS/victoriametrics
+if [[ ! -x $VM_DIR/victoria-metrics-prod ]]; then
+  mkdir -p "$VM_DIR"
+  cd "$VM_DIR"
+  VM_VERSION=$(curl -fsSL https://api.github.com/repos/VictoriaMetrics/VictoriaMetrics/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
+  wget -q "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz"
+  tar xzf "victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz"
+  echo "$VM_VERSION" > VERSION
+fi
+VM_VERSION=$(cat $VM_DIR/VERSION)
+```
+
+### 1e. (Optional) Build the omnistat kernel-trace tool library
+
+Only required if the run will set `OMNISTAT_KERNEL_TRACE=1` to enable per-kernel dispatch tracing. Idempotent — skip if `build-trace/libomnistat_trace.so` already exists.
+
+```bash
+TRACE_LIB=$TOOLS/omnistat-src/build-trace/libomnistat_trace.so
+if [[ "${OMNISTAT_KERNEL_TRACE:-0}" == "1" && ! -f "$TRACE_LIB" ]]; then
+  # Must build inside the same SIF the workload uses — login nodes lack
+  # apptainer and ROCm headers. C++20 + libcurl + rocprofiler-sdk needed.
+  salloc -p lux -A vultr_lux -N 1 --time=00:15:00 --gpus-per-node=1 \
+    apptainer exec --rocm \
+      "/shared/aaji/images/pytorch_rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0.sif" \
+      bash -c "
+        set -e
+        cd $TOOLS/omnistat-src
+        cmake -S rocprofiler-sdk/ -B build-trace/ -DBUILD_KERNEL_TRACE_LIB=ON
+        cmake --build build-trace/ -j 8
+      "
+  test -f "$TRACE_LIB" || { echo "ERROR: kernel-trace build did not produce $TRACE_LIB" >&2; exit 2; }
+fi
+```
+
+The sbatch wrapper expects the `.so` at `/shared/aaji/tools/omnistat-src/build-trace/libomnistat_trace.so` by default; override with `OMNISTAT_TRACE_LIB=/path/to/lib.so`. See `ai4science-studio` SKILL §12 for when to enable kernel tracing vs the default device-counting collector.
+
+### 2. Author the Omnistat user-mode config (once, in repo `perf-runs/`)
+
+If `/shared/aaji/models/HydraGNN/perf-runs/omnistat-lux.config` doesn't exist, write it from the template at `/home/aaji/git/ai4science-studio/material_science/models/HydraGNN/recipes/perf-analysis/omnistat-lux.config.template`. The template uses `%(SLURM_JOB_ID)s` placeholders that omnistat-usermode resolves at runtime.
+
+### 3. Generate the per-job profile config
+
+Read the upstream `gfm_mlip.json` from `/shared/aaji/models/HydraGNN/code/HydraGNN/examples/multidataset_hpo_sc26/gfm_mlip.json` and inject the `Profile` block **inside `NeuralNetwork`** (not at the top level — `train_validate_test()` is invoked with `config["NeuralNetwork"]` as `config`, so its Profiler reads `config["Profile"]` from that scope):
+
+```python
+cfg.setdefault("NeuralNetwork", {})["Profile"] = {
+    "enable": 1,
+    "target_epoch": int(os.environ.get("PROFILE_TARGET_EPOCH", "1")),
+}
+```
+
+Write the result to `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/gfm_mlip_profile.json`. Use Python (`json.load`/`json.dump`) — do NOT do this with sed.
+
+(The wrapper `examples/sbatch_train_perf_amd.sh` already does this; this step exists in the launcher prompt for the case where the launcher is reused for a non-HydraGNN model.)
+
+### 4. Submit the job
+
+```bash
+export AI4S_SHARED_DIR=/shared/aaji
+export HG_OUTPUT_DIR=/shared/aaji/models/HydraGNN/perf-runs/PENDING-$$
+mkdir -p "$HG_OUTPUT_DIR"
+
+cd /home/aaji/git/ai4science-studio
+OUT=$(sbatch \
+  --partition=lux \
+  --account=vultr_lux \
+  --nodes=2 \
+  --ntasks-per-node=8 \
+  --gpus-per-node=8 \
+  --cpus-per-task=16 \
+  --time=00:30:00 \
+  --output="${HG_OUTPUT_DIR}/hydragnn-train-%j.out" \
+  --error="${HG_OUTPUT_DIR}/hydragnn-train-%j.out" \
+  material_science/models/HydraGNN/examples/sbatch_train_perf_amd.sh)
+JOBID=$(echo "$OUT" | awk '{print $NF}')
+echo "Submitted JOBID=$JOBID"
+```
+
+The sbatch wrapper inherits these env vars and exports `HG_CONFIG_OVERRIDE`, `OMNISTAT_VENV`, `OMNISTAT_CONFIG`, `OMNISTAT_USERMODE_INTERVAL`, `PROFILE_RANK0_ONLY=1` to the rank script.
+
+### 5. Wait for terminal state
+
+Poll `sacct -j $JOBID -X -n --format=State,ExitCode,Elapsed,NodeList -P` every 30 s. Treat any of `COMPLETED|FAILED|TIMEOUT|CANCELLED|NODE_FAIL|OUT_OF_MEMORY` as terminal. **Hard ceiling 45 minutes** of polling — if not terminal, write `state=TIMEOUT_WAITING` to manifest and exit 1.
+
+### 6. Move artifacts into final perf-run dir
+
+Once terminal:
+
+```bash
+PERF_RUN=/shared/aaji/models/HydraGNN/perf-runs/${JOBID}
+mv "$HG_OUTPUT_DIR" "$PERF_RUN"
+# Find the rank-0 trace
+TRACE=$(find "$PERF_RUN/logs" -name '*.pt.trace.json*' 2>/dev/null | head -1 || true)
+# Find the omnistat DB (the sbatch wrapper writes it under the perf-run dir)
+DB="$PERF_RUN/omnistat-db"
+```
+
+### 7. Write the manifest
+
+Use Python in the omnistat venv to write the manifest JSON exactly per the schema above. Include all `tools_versions` captured in step 1.
+
+### 8. Final stdout
+
+Print:
+```
+STATUS=ok; reason=jobid=<jobid> state=<state> runtime=<sec>s perf_run=<path>
+```
+
+## Failure modes (must surface, never silently swallow)
+
+| Failure | Action |
+|---|---|
+| Merge conflict in step 1 | Exit 2 with the message in the script. |
+| `sbatch` returns nonzero | Exit 3, print sbatch stderr verbatim. |
+| Job state == `FAILED` | Still write the manifest (with `exit_state=FAILED`), but **also** print the last 50 lines of the SLURM out to help downstream subagents skip cleanly. |
+| No trace file found | Manifest gets `trace_paths=[]`. tracelens_analyst will gracefully no-op. |
+| omnistat-db empty | Manifest still written; omnistat_analyst will detect and report it. |
+
+## Notes for the implementing agent
+
+- The launcher subagent runs on the **login node**. Do not call `srun` from inside the launcher; only `sbatch`.
+- Do not import torch in the launcher's venv — it should stay lightweight.
+- All paths absolute, never relative.
+- Use `flock` on the install step if multiple invocations may race: `( flock -x 9; install_steps; ) 9>$TOOLS/.install.lock`.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/agents/omnistat_analyst.md b/material_science/models/HydraGNN/recipes/perf-analysis/agents/omnistat_analyst.md
new file mode 100644
index 0000000..c34f6e4
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/agents/omnistat_analyst.md
@@ -0,0 +1,143 @@
+# omnistat_analyst subagent
+
+Drive `omnistat-inspect` (PR #271) through the analyze-job phases on the user-mode VictoriaMetrics database, then map findings to the bottleneck taxonomy.
+
+## Inputs
+
+- `<perf_run_dir>/manifest.json`
+- The Omnistat DB at `manifest.omnistat_db_path`.
+- The `omnistat-pr271` venv at `/shared/aaji/tools/omnistat-pr271/`.
+- The VictoriaMetrics binary at `/shared/aaji/tools/victoriametrics/victoria-metrics-prod`.
+
+## Outputs
+
+- `<perf_run_dir>/omnistat/inspect_outputs/db_info.json`
+- `<perf_run_dir>/omnistat/inspect_outputs/job_info.json`
+- `<perf_run_dir>/omnistat/inspect_outputs/data_check.json`
+- `<perf_run_dir>/omnistat/inspect_outputs/health.json`
+- `<perf_run_dir>/omnistat/inspect_outputs/stats_global.json`
+- `<perf_run_dir>/omnistat/inspect_outputs/stats_<category>_<level>.json` — for any anomalous category, drilled to `node` and `gpu-id`/`interface-id`.
+- `<perf_run_dir>/omnistat/report_summary.md`
+- `<perf_run_dir>/omnistat/claims.json` (same schema as `tracelens/claims.json`)
+
+## Steps
+
+### 1. Start VictoriaMetrics on the DB (login node)
+
+Following the [load-database SKILL](https://github.com/ROCm/omnistat/blob/jorda/skills/skills/load-database/SKILL.md), with one Lux-specific addition: pass `-fs.disableMmap` so VM can open a job-scoped DB inside the login-node cgroup (without it, even a 1.6 MB DB panics with "cannot mmap file with size 4096 bytes ... no such device").
+
+```bash
+DB=$(jq -r .omnistat_db_path <manifest>)
+[[ -d "$DB/data" ]] || { echo "STATUS=fail; reason=no_db_data"; exit 1; }
+
+# Pick free port
+PORT=8428
+ss -lnt "sport = :$PORT" 2>/dev/null | grep -q LISTEN && PORT=8429
+
+VMLOG=$PERF_RUN/omnistat/vm.log
+mkdir -p "$PERF_RUN/omnistat/inspect_outputs"
+nohup /shared/aaji/tools/victoriametrics/victoria-metrics-prod \
+    -storageDataPath="$DB" \
+    -httpListenAddr=127.0.0.1:$PORT \
+    -retentionPeriod=100y \
+    -search.disableCache \
+    -search.latencyOffset=0 \
+    -search.maxPointsPerTimeseries=90000 \
+    -fs.disableMmap \
+    > "$VMLOG" 2>&1 &
+VM_PID=$!
+echo $VM_PID > "$PERF_RUN/omnistat/vm.pid"
+TSDB_URL="http://127.0.0.1:$PORT"
+
+# Wait up to 15 s for ready
+for i in {1..15}; do
+  sleep 1
+  curl -sf "$TSDB_URL/api/v1/status/tsdb" > /dev/null && break
+done
+
+# Trap-style cleanup is the orchestrator's job; this subagent leaves VM running
+# so the omnistat_verifier can hit the same endpoint.
+```
+
+### 2. Run analyze-job phases (per the SKILL)
+
+```bash
+. /shared/aaji/tools/omnistat-pr271/bin/activate
+SCRATCH=$PERF_RUN/omnistat/scratch
+mkdir -p "$SCRATCH"
+JOBID=$(jq -r .jobid <manifest>)
+
+OUTD=$PERF_RUN/omnistat/inspect_outputs
+
+omnistat-inspect --tsdb-url $TSDB_URL db info > $OUTD/db_info.json
+omnistat-inspect --tsdb-url $TSDB_URL --scratch-dir $SCRATCH job $JOBID info > $OUTD/job_info.json
+omnistat-inspect --tsdb-url $TSDB_URL --scratch-dir $SCRATCH job $JOBID data-check > $OUTD/data_check.json
+omnistat-inspect --tsdb-url $TSDB_URL --scratch-dir $SCRATCH job $JOBID health > $OUTD/health.json
+omnistat-inspect --tsdb-url $TSDB_URL --scratch-dir $SCRATCH job $JOBID stats --level global > $OUTD/stats_global.json
+```
+
+### 3. Identify anomalous categories and drill down
+
+For each category in `stats_global.json` (`gpu`, `host`, `network`, `xgmi`, `vendor`):
+
+- Compute coefficient of variation (`stddev/mean`) for utilization-like gauges.
+- Flag a category as anomalous if **any** of:
+  - Mean GPU utilization < 60% (high cpu_dispatch / dataloader / comm risk)
+  - Network rx/tx rate stddev/mean > 0.3 (interface imbalance)
+  - VRAM stddev/mean > 0.2 (workload heterogeneity or memory leak)
+  - HBM throttle / power throttle counts > 0
+  - XGMI traffic stddev/mean > 0.3 (unbalanced intra-node comm)
+
+For each flagged category, run `--level node` and either `--level gpu-id` (for `gpu`/`xgmi`) or `--level interface-id` (for `network`):
+
+```bash
+for cat in gpu network xgmi host vendor; do
+  if is_anomalous "$cat"; then
+    omnistat-inspect --tsdb-url $TSDB_URL --scratch-dir $SCRATCH job $JOBID stats --category $cat --level node > $OUTD/stats_${cat}_node.json
+    case $cat in
+      gpu|xgmi)    fine=gpu-id ;;
+      network)     fine=interface-id ;;
+      *)           fine="" ;;
+    esac
+    [[ -n "$fine" ]] && omnistat-inspect --tsdb-url $TSDB_URL --scratch-dir $SCRATCH job $JOBID stats --category $cat --level $fine > $OUTD/stats_${cat}_${fine//-/_}.json
+  fi
+done
+```
+
+### 4. Translate findings into claims
+
+Concrete rules for the first iteration:
+
+- Mean GPU util < 50% → `class=cpu_dispatch` (or `dataloader`, lower confidence) — magnitude = `1 - mean_util`.
+- HBM `vram_used_percentage` near 100% **and** `omnistat_fom` low → `class=gpu_memory_hbm`.
+- Power throttling > 0 events → `class=gpu_compute` with remedy "raise power cap if available; otherwise system limit".
+- Inter-node `network` rate << intra-node `xgmi` rate AND comm-heavy workload → `class=comm_scaleout` with remedy "verify ANP plugin via `NCCL_DEBUG=INFO`".
+- High `network` interface CV at `interface-id` level → `class=comm_scaleout` with remedy "specific interface (e.g. ionic_3) is degraded; investigate cable/firmware".
+- Hot GPU (>peak temp threshold from `gpus/mi355x.md` if present, else conservative 90°C) → `class=gpu_compute` with remedy "thermal-bound; check airflow".
+- Per-node CPU mem usage CV > 0.3 → `class=host_cpu` with remedy "rebalance num_workers or check NUMA pinning".
+
+### 5. Emit report and claims
+
+`report_summary.md` should mirror Phase 1-3 of the analyze-job SKILL: discovery → data check → health → global stats → drill-down narrative. Cap at ~80 lines.
+
+`claims.json` schema is identical to the tracelens version. Cap at 6 claims.
+
+### 6. Final stdout
+
+```
+STATUS=ok; reason=<n_claims> claims; vm_port=<port>; vm_pid=<pid>
+```
+
+The orchestrator (or the verifier, when finished) is responsible for stopping VM via `kill $(cat <perf_run>/omnistat/vm.pid)`.
+
+## Failure modes
+
+- DB empty (no metrics): write a single claim `class=host_io` `hypothesis="omnistat-usermode never collected"` and STATUS=partial. Likely cause: collector/server didn't start; check sbatch wrapper logs.
+- omnistat-inspect not on PATH: STATUS=fail; the launcher should have installed it.
+- VictoriaMetrics startup timeout (15 s): STATUS=fail; print last 20 lines of `vm.log`.
+
+## Notes
+
+- This subagent has read-write access to the perf-run dir.
+- It must NOT modify or delete the omnistat DB.
+- Never query `atlvrmonad01:9090` — it's IP-blocked and not the right database for this job anyway.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/agents/omnistat_verifier.md b/material_science/models/HydraGNN/recipes/perf-analysis/agents/omnistat_verifier.md
new file mode 100644
index 0000000..e1d4cc1
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/agents/omnistat_verifier.md
@@ -0,0 +1,92 @@
+# omnistat_verifier subagent
+
+Independently re-derive the top 2-3 claims from `omnistat/claims.json` by issuing raw PromQL via `curl` against the same VictoriaMetrics endpoint, at the **finest sampling step**. Optionally probe one cheap remedy on a 1-node interactive `srun`.
+
+## Inputs
+
+- `<perf_run_dir>/manifest.json`
+- `<perf_run_dir>/omnistat/claims.json`
+- VictoriaMetrics already running (PID at `<perf_run_dir>/omnistat/vm.pid`, port from analyst output).
+- `<perf_run_dir>/omnistat/inspect_outputs/job_info.json` (for time range and sampling interval).
+
+## Outputs
+
+- `<perf_run_dir>/omnistat/verified_claims.json` (same schema as `tracelens/verified_claims.json`).
+
+## Steps
+
+### 1. Re-derive top claims via raw PromQL
+
+Read the time range from `job_info.json` (`start_time`, `end_time`, `sampling_interval`). For each top claim, issue a `query_range` at the finest step (`max(sampling_interval, runtime/90000)`).
+
+Examples:
+
+**Important:** `rocm_*` and `omnistat_host_*` metrics do **not** carry a `jobid` label directly. Filter by joining with `rmsjob_info` per-instance:
+
+```promql
+metric_name * on (instance) group_left() (max by (instance) (rmsjob_info{jobid="$JOBID"}))
+```
+
+A naive `rocm_utilization_percentage{jobid="..."}` will return 0 series even though the data is there. Use the join in every query.
+
+Also: `start_time`/`end_time` from `job_info.json` are **UTC**. Convert with `calendar.timegm(time.strptime(...))` (NOT `time.mktime`, which assumes local time) before passing to PromQL.
+
+| Claim | PromQL |
+|---|---|
+| "Mean GPU util = X%" | `avg(rocm_utilization_percentage * on (instance) group_left() (max by (instance) (rmsjob_info{jobid="$JOBID"})))` |
+| "VRAM near 100% on N nodes" | `max(rocm_vram_used_percentage * on (instance) group_left() (max by (instance) (rmsjob_info{jobid="$JOBID"})))` per `instance` |
+| "Network tx rate plateau" | `rate(omnistat_network_tx_bytes * on (instance) group_left() (max by (instance) (rmsjob_info{jobid="$JOBID"}))[60s])` peak |
+| "XGMI imbalance" | per-card stddev of `rocm_xgmi_*_bytes` rate, joined as above |
+| "Power throttling events" | sum increase of throttle counters over the job span, joined as above |
+
+```bash
+PORT=$(jq -r '.[0].port' <vm_info>)  # or pull from analyst stdout
+curl -sG "http://127.0.0.1:$PORT/api/v1/query_range" \
+    --data-urlencode "query=avg_over_time(rocm_utilization_percentage{jobid=\"$JOBID\"}[1m])" \
+    --data-urlencode "start=$START" \
+    --data-urlencode "end=$END" \
+    --data-urlencode "step=$STEP" \
+    | jq '.data.result | map(.values[][1] | tonumber) | add/length'
+```
+
+The verifier should re-derive the actual number from the PromQL result and compare to `claim.magnitude.value`. ±20% → `verdict=verified`. Outside → `refuted`. Empty result series → `inconclusive`.
+
+### 2. Optional remedy probe
+
+Same constraints as the tracelens verifier — at most one 1-node `srun -p lux -A vultr_lux -N1 --time=00:05:00`. Telemetry-side probes are typically configuration changes (e.g. enable rocprofiler `hbm` counter set in the omnistat config and re-run for 1 minute on 1 node to see if HBM bandwidth is actually saturating).
+
+If a probe needs the omnistat config to change, write a temp config file based on `omnistat-lux.config` with the rocprofiler section uncommented, point `OMNISTAT_CONFIG` at it, and submit via the existing sbatch script as a tiny job with `HG_NUM_EPOCH=1 HYDRAGNN_MAX_NUM_BATCH=10 --nodes=1`. Time-budget that branch generously (10 min) since it does include the omnistat collector startup.
+
+If multi-node is required (e.g. ANP plugin retest), set `remedy_probe.ran=false; notes="multi-node probe deferred"`.
+
+### 3. Emit `verified_claims.json`
+
+Same shape as the tracelens version. Each claim copied through with `verdict`, `verifier_evidence`, `verifier_value`, `remedy_probe`.
+
+### 4. Stop VictoriaMetrics
+
+After all queries are done:
+
+```bash
+kill "$(cat <perf_run>/omnistat/vm.pid)" 2>/dev/null || true
+```
+
+The synthesizer doesn't need VM running.
+
+### 5. Final stdout
+
+```
+STATUS=ok; reason=<n_verified>v/<n_refuted>r/<n_inconclusive>i; probe=<class|none>
+```
+
+## Failure modes
+
+- VM not running: try to start it the same way the analyst did before giving up; if still down, STATUS=fail.
+- Query returns no data for a key metric the claim relies on: that's an `inconclusive` verdict, not a fail.
+- Network blip mid-query: retry once with 5s sleep.
+
+## Notes
+
+- Use **only** `curl` for re-derivation, **not** `omnistat-inspect`. The whole point is an independent path.
+- Use the finest step that VM allows (`runtime / 90000` floor); document the actual step used in `verifier_evidence`.
+- Never delete the DB or any analyst output.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/agents/synthesizer.md b/material_science/models/HydraGNN/recipes/perf-analysis/agents/synthesizer.md
new file mode 100644
index 0000000..476d884
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/agents/synthesizer.md
@@ -0,0 +1,131 @@
+# synthesizer subagent
+
+Merge `tracelens/verified_claims.json` and `omnistat/verified_claims.json` into a single ranked, deduplicated bottleneck report.
+
+## Inputs
+
+- `<perf_run_dir>/manifest.json`
+- `<perf_run_dir>/tracelens/verified_claims.json`
+- `<perf_run_dir>/omnistat/verified_claims.json`
+- (Optional) `<perf_run_dir>/tracelens/report_summary.md`, `<perf_run_dir>/omnistat/report_summary.md`
+
+## Outputs
+
+- `<perf_run_dir>/combined_report.md`
+
+## Steps
+
+### 1. Load and filter
+
+```python
+import json
+tl = json.load(open(f"{perf_run}/tracelens/verified_claims.json"))
+om = json.load(open(f"{perf_run}/omnistat/verified_claims.json"))
+all_claims = [{"src":"TL", **c} for c in tl] + [{"src":"OS", **c} for c in om]
+```
+
+Drop `verdict == "refuted"`. Keep `inconclusive` claims but tag them so the user sees the gap.
+
+### 2. Deduplicate by class + topic
+
+Group claims by `class`. Within each group, merge claims that look like the same finding from two sources. Heuristic:
+- Both have `class=comm_scaleout` → almost certainly the same finding viewed from trace (NCCL kernels) and telemetry (network rates). Merge into one entry that lists both `src` values and both magnitudes.
+- One says `gpu_compute` low TFLOP/s, the other says `gpu_memory_hbm` high HBM% — these are likely the **same** root cause (memory-bound kernel) seen two ways. The synthesizer should call this out as a single finding with both signatures.
+
+### 3. Rank
+
+Score = `magnitude.value × confidence_weight × (corroborated ? 1.5 : 1.0)`.
+`confidence_weight`: high=1.0, medium=0.7, low=0.4. `corroborated` = both TL and OS contributed to the merged entry.
+
+### 4. Tag "system limit reached"
+
+A claim is a system limit if any of:
+- `proposed_remedy is null`
+- both `verdict=verified` and `remedy_probe.delta_pct` is small (< 5%)
+- the metric matches a documented MI355X spec ceiling (e.g. fp64 39 TFLOP/s, HBM 8 TB/s, ANP scale-out ~25 GB/s)
+
+Mark these explicitly so the report doesn't promise a fix that won't materialize.
+
+### 5. Write `combined_report.md`
+
+Layout:
+
+```markdown
+# HydraGNN bottleneck analysis — JOBID <id>
+
+**Run:** 2 nodes × 8 GPUs (MI355X), <runtime>s, precision=<p>, batch=<b>
+**Sources:** TraceLens v<x> on rank-0 trace, Omnistat user-mode (<sampling>s sampling)
+**Verdict counts:** <verified>/<refuted>/<inconclusive>
+
+## Executive summary
+
+<1 paragraph, 4-5 sentences. Named bottleneck class first, magnitude, then "system-limit" or "improvable">
+
+## Ranked bottlenecks
+
+### 1. <Class> — <pithy title>  [score: <s>] [<corroborated|TL-only|OS-only>] [<system-limit|improvable>]
+
+**Hypothesis:** <merged hypothesis>
+**Magnitude:** <number, sources>
+**Evidence:** <bullets, with paths>
+**Remedy:** <if any> — **tried?** <yes/no>; **delta:** <if tried>
+**Verifier note:** <verifier_evidence digest>
+
+### 2. ...
+
+## Remedies tried in this session
+
+| Class | Probe | Baseline | After remedy | Delta |
+|---|---|---|---|---|
+| ... |
+
+## Remedies proposed but not tried (cost/risk)
+
+| Class | Remedy | Why deferred |
+|---|---|---|
+
+## Limits reached
+
+| Class | Spec ceiling | Observed |
+|---|---|---|
+
+## Inconclusive claims
+
+| Source | Class | Hypothesis | Why inconclusive |
+|---|---|---|---|
+
+## Next steps
+
+1. <most informative A/B run, e.g. fp64 vs bf16 at 1 node>
+2. <next observability gap to fill, e.g. enable rocprofiler hbm counters>
+3. <next iteration: multi-rank trace fusion / rank-N comparison / longer run>
+
+## Artifacts
+
+- `tracelens/report.xlsx`
+- `tracelens/verified_claims.json`
+- `omnistat/inspect_outputs/`
+- `omnistat/verified_claims.json`
+- `manifest.json`
+
+---
+Generated <ISO timestamp> by `material_science/models/HydraGNN/recipes/perf-analysis/`.
+```
+
+### 6. Final stdout
+
+```
+STATUS=ok; reason=<n_findings>; top=<class>; tried=<n_probes>
+```
+
+## Style rules
+
+- Specific numbers from evidence — never vague ("most of the time", "huge").
+- If TL and OS disagree on a number that's not refuted, show **both** with sources rather than pick one.
+- When `remedy_probe.ran=false` because of multi-node requirement, say so explicitly.
+- Do not editorialize about HydraGNN as a model; only about the run.
+
+## Failure modes
+
+- Both inputs missing → STATUS=fail.
+- Only one input present → emit a partial report with a banner explaining which side is missing; STATUS=partial.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/agents/tracelens_analyst.md b/material_science/models/HydraGNN/recipes/perf-analysis/agents/tracelens_analyst.md
new file mode 100644
index 0000000..b33572d
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/agents/tracelens_analyst.md
@@ -0,0 +1,124 @@
+# tracelens_analyst subagent
+
+Generate a TraceLens performance report for the rank-0 PyTorch trace and produce structured `claims.json` mapped to the bottleneck taxonomy.
+
+## Inputs
+
+- `/shared/aaji/models/HydraGNN/perf-runs/<jobid>/manifest.json`
+
+## Outputs
+
+- `<perf_run_dir>/tracelens/report.xlsx` — the raw TraceLens Excel.
+- `<perf_run_dir>/tracelens/report_summary.md` — human-readable digest of the workbook (top-N kernels, GPU idle %, comm vs compute split, roofline categorization).
+- `<perf_run_dir>/tracelens/claims.json` — structured claims (schema below).
+
+## Claim schema
+
+Every entry:
+
+```json
+{
+  "id": "TL-001",
+  "class": "gpu_compute|gpu_memory_hbm|cpu_dispatch|comm_xgmi|comm_scaleout|dataloader|host_io|host_cpu",
+  "hypothesis": "<one-sentence claim>",
+  "magnitude": {
+    "metric": "<e.g. 'fraction_of_step_time'>",
+    "value": 0.34,
+    "unit": "fraction|seconds|GBps|TFLOPs|count"
+  },
+  "evidence_paths": ["<path to xlsx>:<sheet>:<row>", "<path to raw json>:<grep pattern>"],
+  "evidence_excerpt": "<200 chars max>",
+  "confidence": "high|medium|low",
+  "proposed_remedy": "<short imperative>",
+  "remedy_test_command_or_null": "srun -p lux -A vultr_lux -N1 --time=00:05:00 ... | null"
+}
+```
+
+## Steps
+
+### 1. Sanity check
+
+```bash
+. /shared/aaji/tools/omnistat-pr271/bin/activate
+PERF_RUN=$(jq -r .perf_run_dir <manifest>)
+TRACE=$(jq -r '.trace_paths[0] // empty' <manifest>)
+[[ -z "$TRACE" ]] && { echo "STATUS=partial; reason=no_trace_in_manifest"; exit 0; }
+mkdir -p "$PERF_RUN/tracelens"
+```
+
+### 2. Run TraceLens
+
+```bash
+cd "$PERF_RUN/tracelens"
+TraceLens_generate_perf_report_pytorch \
+    --profile_json_path "$TRACE" \
+    --output_xlsx_path report.xlsx \
+    --output_csvs_dir csvs/ \
+    --enable_kernel_summary \
+    --short_kernel_study 2>&1 | tee tracelens.log
+```
+
+CLI confirmed against TraceLens v0.1.0+ on Lux: `--output_xlsx_path` controls the workbook path; `--output_csvs_dir` writes per-sheet CSVs (handy for the verifier). `--enable_kernel_summary` adds the kernel-level summary sheet. `--short_kernel_study` flags very short kernels (likely launch-overhead bound).
+
+If multi-node, the script will detect collective ops and produce a separate "Collective Analysis" sheet unless `--disable_coll_analysis` is passed (don't pass it).
+
+### 3. Extract digest from the workbook
+
+Use `openpyxl` (already a TraceLens dep) in the venv:
+
+```python
+import openpyxl, json, pathlib
+wb = openpyxl.load_workbook("report.xlsx", data_only=True, read_only=True)
+sheets = wb.sheetnames  # expected: GPU_Timeline, Ops_Summary, Roofline, NN_Module, ...
+```
+
+Pull these specific signals (skip a sheet gracefully if absent):
+
+| Sheet | Signals to extract |
+|---|---|
+| `GPU_Timeline` | total_gpu_time, idle_time, kernel_time, comm_time (sum durations of NCCL/RCCL events) |
+| `Ops_Summary` | top-10 ops by total time + their percent |
+| `Roofline` | per-op TFLOP/s and TB/s; flag any in compute-bound vs memory-bound region |
+| `NN_Module` | top-3 modules by GPU time |
+
+Write the digest to `report_summary.md` as a few markdown tables.
+
+### 4. Translate findings into claims
+
+Concrete rules for the first iteration (the agent may add more):
+
+- If `comm_time / total_step_time > 0.20` → `class=comm_xgmi` if intra-node only OR `comm_scaleout` if inter-node (detect by NCCL kernel name `AllReduce` + `nNodes>1` flag). Magnitude = the ratio.
+- If `idle_time / total_step_time > 0.15` → `class=cpu_dispatch` (could also be dataloader; raise as `confidence=medium`).
+- For each top-3 op: classify by Roofline region. Compute-bound + low TFLOP/s → `class=gpu_compute`. Memory-bound + high TB/s → `class=gpu_memory_hbm`. Memory-bound + low TB/s → `class=gpu_memory_hbm` with confidence=low + remedy "verify with HBM counters via Omnistat rocprofiler probe".
+- If `dataloader` events appear in the timeline (look for `enumerate(DataLoader)#`) and their cumulative time > 5% → `class=dataloader`.
+
+### 5. Emit `claims.json`
+
+Cap at 6 claims. Order by `magnitude.value` (when comparable). Each claim must include a remedy that is concrete and **verifiable**. Examples:
+
+| Hypothesis | Remedy |
+|---|---|
+| "RCCL allreduce dominates 35% of step time" | "Try `RCCL_LL128_FORCE_ENABLE=0` or larger bucket size via `find_unused_parameters=False`" |
+| "fp64 GEMM at 12 TFLOP/s, well below MI355X 39 TFLOP/s peak" | "Test `HG_PRECISION=bf16` for 5 batches on 1 node" |
+| "DataLoader events occupy 18% of step time" | "Increase `HYDRAGNN_NUM_WORKERS` from 0 to 4" |
+| "fp64 SQ FMA approaches MI355X spec ceiling" | "null — system limit; only path is bf16/fp32" |
+
+A null remedy is acceptable for "limit reached" findings.
+
+### 6. Final stdout
+
+```
+STATUS=ok; reason=<n_claims> claims, top=<class>:<magnitude>
+```
+
+## Failure modes
+
+- TraceLens import error → STATUS=fail; do not attempt to "patch" the install.
+- Trace too large to load (>2 GB) → write a single claim of class `cpu_dispatch` with hypothesis "trace too large to fully analyze; profiler captured >X seconds — schedule may be misconfigured".
+- No NCCL events in trace → multi-node still ran but profiler missed comm; flag as `inconclusive` rather than claiming "comm is fast".
+
+## Notes
+
+- Do NOT run `omnistat-inspect`; that's the other analyst's job. If you need the omnistat DB to validate something, hand the question to the omnistat_verifier instead.
+- Do NOT modify the raw trace file.
+- Keep `evidence_paths` machine-parseable so the verifier can re-derive the number from the same source.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/agents/tracelens_verifier.md b/material_science/models/HydraGNN/recipes/perf-analysis/agents/tracelens_verifier.md
new file mode 100644
index 0000000..a9494ae
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/agents/tracelens_verifier.md
@@ -0,0 +1,108 @@
+# tracelens_verifier subagent
+
+Independently re-derive the top 2-3 claims from `tracelens/claims.json` against the raw `.pt.trace.json` and (optionally) probe **one** cheap remedy on a 1-node interactive `srun`.
+
+## Inputs
+
+- `<perf_run_dir>/manifest.json`
+- `<perf_run_dir>/tracelens/claims.json`
+- The raw trace at `manifest.trace_paths[0]`.
+
+## Outputs
+
+- `<perf_run_dir>/tracelens/verified_claims.json` — same schema as `claims.json` plus a `verdict` field per entry.
+
+## Verdict schema
+
+Each input claim is copied through with one new field:
+
+```json
+{
+  "...all fields from claims.json...": "...",
+  "verdict": "verified|refuted|inconclusive",
+  "verifier_evidence": "<200 chars: how you re-derived it>",
+  "verifier_value": <re-derived number>,
+  "remedy_probe": {
+    "ran": true|false,
+    "command": "<srun command or null>",
+    "baseline_seconds_per_step": <float|null>,
+    "remedy_seconds_per_step": <float|null>,
+    "delta_pct": <float|null>,
+    "notes": "<short>"
+  }
+}
+```
+
+## Steps
+
+### 1. Re-derive top 3 claims from the raw JSON
+
+```python
+import json, gzip
+from pathlib import Path
+def open_trace(p):
+    return gzip.open(p, 'rt') if str(p).endswith('.gz') else open(p, 'r')
+with open_trace(trace) as f:
+    data = json.load(f)
+events = data["traceEvents"]
+```
+
+For each top claim, do a parallel re-derivation:
+
+| Claim class | Re-derivation |
+|---|---|
+| `comm_xgmi` / `comm_scaleout` | sum `dur` for events whose `name` matches `^(nccl|rccl|ncclKernel|AllReduce|AllGather|ReduceScatter)`; divide by total span (`max(ts+dur)-min(ts)` of GPU events). Compare to claim's `magnitude.value`. |
+| `gpu_compute` | filter `cat=='kernel'`, find top-N by `sum(dur)`; check the top kernel name + percent matches the claim. |
+| `cpu_dispatch` | `(total_span - sum(GPU kernel dur)) / total_span` should match "idle %". |
+| `dataloader` | sum `dur` of events with `name` containing `enumerate(DataLoader)`. |
+
+If verifier_value within ±20% of claim's value → `verdict=verified`. If outside → `verdict=refuted` with the actual number. If we can't isolate the events → `verdict=inconclusive`.
+
+### 2. Pick at most ONE remedy probe
+
+Choose the highest-impact `verified` claim with a non-null `remedy_test_command_or_null`. Constraints:
+
+- 1 node only (`-N1`).
+- ≤5 minutes wall-time.
+- Must produce a comparable number (steps/sec or seconds/step) to the original 2-node baseline. If the remedy can only be tested at scale (e.g. RCCL inter-node tweak), set `remedy_probe.ran=false; notes="remedy requires multi-node; deferred to next iteration"`.
+
+Example probe for the bf16 / fp64 ceiling claim — run via the existing HydraGNN sbatch script in interactive mode:
+
+```bash
+srun -p lux -A vultr_lux -N1 --ntasks=8 --gpus-per-node=8 --cpus-per-task=16 \
+    --time=00:05:00 --pty bash -c '
+export AI4S_SHARED_DIR=/shared/aaji
+export HG_PRECISION=bf16
+export HYDRAGNN_MAX_NUM_BATCH=10
+export HG_NUM_EPOCH=1
+export HG_OUTPUT_DIR=/tmp/perf-probe-$$
+mkdir -p $HG_OUTPUT_DIR
+bash /home/aaji/git/ai4science-studio/material_science/models/HydraGNN/examples/sbatch_train_amd.sh \
+    2>&1 | tee /tmp/probe-$SLURM_JOB_ID.out
+grep -oE "[0-9.]+s/it" /tmp/probe-$SLURM_JOB_ID.out | tail -5
+'
+```
+
+Note: `sbatch_train_amd.sh` is designed for `sbatch`, not `srun --pty`. For probes prefer using its rank-script logic directly via a small ad-hoc wrapper rather than running the whole sbatch under srun. The verifier may inline a minimal Python invocation that bypasses sbatch and just runs the upstream `gfm_mlip_all_mpnn.py` for 10 batches inside the container with the alternate precision.
+
+If the verifier judges that no cheap probe is feasible, set `remedy_probe.ran=false` with a clear note and proceed.
+
+### 3. Emit `verified_claims.json`
+
+Same length and order as `claims.json`. Even refuted claims stay in the file — the synthesizer will drop them, but they should remain visible.
+
+### 4. Final stdout
+
+```
+STATUS=ok; reason=<n_verified>v/<n_refuted>r/<n_inconclusive>i; probe=<class|none>
+```
+
+## Constraints
+
+- Never modify `claims.json` — only write `verified_claims.json`.
+- Never run a 2-node `srun` or `sbatch` from this subagent.
+- If the trace file is missing, write `verified_claims.json` as `[]` and `STATUS=partial; reason=no_trace`.
+
+## Why two phases (analyst + verifier)?
+
+Hallucination defense. The analyst can produce a number from a tool's report; the verifier re-derives it from raw events. Disagreement => the user gets to see both numbers and the synthesizer flags the conflict in the final report rather than silently picking one. This is the same approach the omnistat side uses with PromQL-from-curl re-derivation.
diff --git a/material_science/models/HydraGNN/recipes/perf-analysis/omnistat-lux.config.template b/material_science/models/HydraGNN/recipes/perf-analysis/omnistat-lux.config.template
new file mode 100644
index 0000000..1eee37e
--- /dev/null
+++ b/material_science/models/HydraGNN/recipes/perf-analysis/omnistat-lux.config.template
@@ -0,0 +1,133 @@
+#--
+# Omnistat user-mode config for Lux (MI355X)
+#
+# Adapted from HydraGNN's upstream omnistat.hydragnn-external.config.
+# Differences from upstream / from system-mode default on Lux:
+#   - enable_amd_smi (newer SMI), not rocm_smi
+#   - rocprofiler counters DISABLED in v1 (turn on per-probe via verifier)
+#   - allowed_ips includes 0.0.0.0 so the user-mode VM scrapes
+#   - victoria_datadir under /shared/aaji (NFS, persists after job)
+#--
+
+[omnistat.collectors]
+
+port = 8101
+enable_rocm_smi = False
+enable_amd_smi = True
+enable_ras_ecc = True
+enable_rms = True
+enable_rocprofiler = True
+enable_network = True
+enable_vendor_counters = True
+enable_xgmi = True
+enable_cu_occupancy = False
+
+# Kernel-dispatch tracing collector (per-kernel name + duration histogram on
+# every node). Disabled by default: it requires the standalone tool library
+# `libomnistat_trace.so` to be built (see rocprofiler-sdk/README.md
+# "Kernel Tracing Library") and loaded into the workload via
+# ROCP_TOOL_LIBRARIES. The sbatch wrapper enables this automatically when
+# OMNISTAT_KERNEL_TRACE=1 is set; otherwise it stays False.
+#
+# IMPORTANT: kernel tracing and the rocprofiler device-counting service
+# (enable_rocprofiler above) BOTH talk to rocprofiler-sdk. They can co-exist
+# on a single GPU, but if you're chasing per-kernel attribution it's usually
+# cleaner to either (a) run kernel-trace alone, or (b) run device counting
+# alone — see ai4science-studio SKILL §11 for the tradeoffs.
+enable_kernel_trace = False
+
+# Path to local ROCm install (Lux: 7.2.x)
+rocm_path = /opt/rocm
+
+# allowed scrapers: localhost (the user-mode VM sits on the same node) + 0.0.0.0
+# wildcard for any user-mode VictoriaMetrics that might run on a sibling node.
+allowed_ips = 127.0.0.1, 0.0.0.0
+
+[omnistat.collectors.rms]
+
+host_skip = "login.*"
+enable_annotations = True
+# Lux note: /tmp/omni_rmsjobinfo (the upstream default) is owned by user
+# `omnidc` on Lux compute nodes for the system-mode collector, so user-mode
+# omnistat-rms-env hits PermissionError if it tries to write there. The
+# sbatch wrapper substitutes @JOB_DIR@ with the job-scoped perf-run dir
+# (NFS-mounted on every compute node) before passing the resolved config
+# to omnistat-usermode --configfile.
+job_detection_mode = file-based
+job_detection_file = @JOB_DIR@/omni_rmsjobinfo
+step_detection_file = @JOB_DIR@/omni_rmsjobinfo_step
+
+[omnistat.collectors.host]
+enable_proc_io_stats = True
+
+# Rocprofiler — HBM bandwidth + FP64 VALU/MFMA counters in one block.
+# Counter group selection follows the user-validated pattern (one profile,
+# one sub-section, all counters listed together; rocprofiler-sdk
+# time-multiplexes hardware counter sets internally as needed).
+#
+# All eight counters below are defined for gfx950 (MI355X) in
+# /opt/rocm/share/rocprofiler-sdk/counter_defs.yaml; verified before pinning.
+#
+# Useful derivations from these raw counters (compute in the verifier or
+# downstream; rocprofiler-sdk also exposes derived counters
+# MfmaFlopsF64 and TotalFlopsF64 if a downstream consumer prefers those):
+#   peak_fp64_GFLOP/s = (SQ_INSTS_VALU_MFMA_MOPS_F64 * 512
+#                        + (SQ_INSTS_VALU_FMA_F64*2
+#                           + SQ_INSTS_VALU_ADD_F64
+#                           + SQ_INSTS_VALU_MUL_F64) * 64
+#                       ) / wall_seconds / 1e9
+#   HBM_GB/s = (FETCH_SIZE + WRITE_SIZE) / wall_seconds / 1e9
+[omnistat.collectors.rocprofiler]
+profile = hbm_flops_f64
+
+[omnistat.collectors.rocprofiler.hbm_flops_f64]
+# MI355X (gfx950) rocprofiler counter limits — see ai4science-studio SKILL §10.
+#
+# Constraints surfaced on lux-mi355x-b1 / ROCm 7.2.2:
+#   * FETCH_SIZE + WRITE_SIZE in ONE profile fails: "error code 38: Request
+#     exceeds the capabilities of the hardware to collect" (both in TCC block).
+#   * FETCH_SIZE OR WRITE_SIZE + 4 x SQ_INSTS_VALU_*_F64 in one profile works.
+#   * sampling_mode = periodic in omnistat 1.12.0+git.65ea9ac (PR #271 + main)
+#     ONLY fires the first counter set — half the data silently lost. (See
+#     SKILL §11.)
+#
+# Workaround: use `constant` mode with a single counter set that fits the
+# hardware limits and the omnistat bug. We capture HBM read traffic only
+# (FETCH_SIZE) plus the 4 fp64 instruction counters needed to derive the
+# fp64 FLOP rate using the formula:
+#     fp64_GFLOP/s = (MFMA_MOPS_F64 * 512 + (FMA_F64*2 + ADD_F64 + MUL_F64) * 64)
+#                    / wall_seconds / 1e9
+# Drop SQ_INSTS_VALU_TRANS_F64 (not relevant for GEMM-heavy ML training).
+#
+# WRITE_SIZE (HBM-write bandwidth) is NOT collected in this profile until
+# upstream omnistat fixes the periodic-mode bug. For most ML training
+# workloads, FETCH_SIZE is the dominant component of HBM traffic anyway.
+#
+# All counters on one line: configparser multi-line values require every
+# continuation line (including the closing `]`) to have leading whitespace.
+sampling_mode = constant
+counters = ["FETCH_SIZE", "SQ_INSTS_VALU_ADD_F64", "SQ_INSTS_VALU_MUL_F64", "SQ_INSTS_VALU_FMA_F64", "SQ_INSTS_VALU_MFMA_MOPS_F64"]
+
+[omnistat.query]
+prometheus_url = http://localhost:8428
+system_name = lux
+
+#--
+# User-mode settings — VictoriaMetrics + per-job datadir
+# (Like the [rms] section above, @JOB_DIR@ is substituted by the sbatch
+#  wrapper at runtime; ConfigParser's %(VAR)s interpolation does NOT pull
+#  from os.environ, so we can't rely on %(SLURM_JOB_ID)s here.)
+#--
+[omnistat.usermode]
+
+victoria_binary = /shared/aaji/tools/victoriametrics/victoria-metrics-prod
+victoria_datadir = @JOB_DIR@/omnistat-db
+victoria_logfile = vic_server.log
+
+# No external VM; we keep everything local to the job.
+external_victoria = False
+
+# Bind exporter to a single core to keep monitoring overhead off the data path.
+exporter_corebinding = 0
+
+ssh_key = ~/.ssh/id_rsa