feat(power): vendor-agnostic GPU power/telemetry aggregation core by arygupt · Pull Request #1635 · SemiAnalysisAI/InferenceX

arygupt · 2026-06-01T19:46:24Z

Summary

Lifts the vendor-agnostic measured-power aggregation core onto main as its own PR, decoupled from the in-flight AMD sweep on #1574. aggregate_power.py parses per-GPU samples from both nvidia-smi (gb300) and amd-smi (mi355x) through one path and emits:

avg_power_w + per-stage prefill/decode_avg_power_w
per-worker workers[] rollup (role, num_gpus, temp/util/mem)
joules_per_{input,output,total}_token (decode-only J/out on disagg)
temp/util/mem (AMD gfx_activity, used_vram, hotspot)

run() accepts Path | Iterable[Path], so a multinode worker spanning N CSVs aggregates as N×GPUs. Per-GPU mean denominators count only the GPUs that reported each metric, so heterogeneous CSV schemas don't dilute temp/util/mem averages.

Backward compatibility

run()'s signature is a backward-compatible superset of main's (Path | Iterable[Path], new keyword-only disagg=False). process_result.py keeps working unchanged — verified locally: test_process_result.py 25/25 green against this core.

CI

Adds .github/workflows/test-aggregate-power.yml. The 82 tests in test_aggregate_power.py previously gated no workflow — this closes that gap. The workflow triggers only on changes to aggregate_power.py / test_aggregate_power.py.

Test plan

pytest test_aggregate_power.py → 82 passed (covers gb300 + mi355x CSV parsing, per-stage rollup, J/token, per-GPU denominators)
pytest test_process_result.py → 25 passed (coupling unchanged)
CI: test-aggregate-power green on this PR

🤖 Generated with Claude Code

Note

Medium Risk
Large change to benchmark metric semantics (disagg joules attribution, multinode GPU counting) that downstream ETL/dashboards may rely on; mitigated by extensive tests and backward-compatible run()/aggregate_power() APIs.

Overview
Expands aggregate_power.py from power-only, single-CSV aggregation into a vendor-agnostic telemetry pipeline (NVIDIA, AMD amd-smi, srt-slurm) that patches agg JSON for InferenceX ETL.

Multinode & correctness: run() and aggregate_metrics() accept Path | Iterable[Path] plus --csv-glob; GPU IDs are namespaced by CSV stem so repeated local indices across nodes do not under-count GPUs. _load_bench_window adds fallbacks for multinode bench JSON (date + duration, then file mtime).

New metrics & schema: Regex detection for temp (hotspot/junction preferred on AMD), util (gfx_activity), and mem used; cluster fields avg_temp_c, peak_temp_c, avg_util_pct, avg_mem_used_mb; optional metrics average only over GPUs that reported them.

Disaggregated serving: Parses perf_samples_<role>_w<idx>_<host>.csv, rolls up workers[], and with --disagg applies per-stage joules_per_input_token / joules_per_output_token when both prefill and decode exist; frontend/agg excluded from stage energy.

Tests & CI: Large expansion of test_aggregate_power.py (~82 tests) and new .github/workflows/test-aggregate-power.yml on changes to those files. aggregate_power() remains a thin wrapper for backward compatibility; process_result.py callers unchanged.

^{Reviewed by Cursor Bugbot for commit a829fa8. Bugbot is set up for automated code reviews on this repo. Configure here.}

Lift the full measured-power aggregation core onto main, independent of the in-flight AMD sweep on #1574. aggregate_power.py parses per-GPU samples from both nvidia-smi (gb300) and amd-smi (mi355x) through one vendor-agnostic path and emits: - avg_power_w + per-stage prefill/decode_avg_power_w - per-worker workers[] rollup (role, num_gpus, temp/util/mem) - joules_per_{input,output,total}_token (decode-only J/out on disagg) - temp/util/mem (AMD gfx_activity, used_vram, hotspot) run() accepts Path | Iterable[Path] so a multinode worker spanning N CSVs aggregates as N x GPUs. Per-GPU mean denominators count only the GPUs that reported each metric, so heterogeneous CSV schemas don't dilute temp/util/mem averages. run()'s signature is a backward-compatible superset of main's, so process_result.py keeps working unchanged (verified: test_process_result 25/25 green against this core). Adds .github/workflows/test-aggregate-power.yml -- the 82 tests in test_aggregate_power.py previously gated no CI workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>



+# Back-compat shim — some external callers may have imported _parse_power.
+_parse_power = _parse_numeric_cell


claude · 2026-06-01T20:00:48Z

+        hosts_by_worker.setdefault(key, set()).add(host)
+    if not by_worker:
+        return None
+
+    out: list[dict] = []
+    for (role, worker_idx), worker_paths in by_worker.items():
+        sources: list[tuple[Path, list[tuple[float, str | None, dict[str, float]]], bool]] = []
+        for path in worker_paths:
+            read = _read_samples(path, start_unix, end_unix)
+            if read is None:
+                continue
+            rows, saw_gpu_col = read
+            sources.append((path, rows, saw_gpu_col))
+        if not sources:
+            continue
+        # Namespace across paths within a worker too — a 16-GPU decode worker
+        # spans 4 nodes, each reporting local indices 0..3.
+        result = _aggregate_rows(sources, namespace=len(sources) > 1)
+        if result is None:
+            continue
+        entry: dict = {
+            "role": role,
+            "worker_idx": worker_idx,
+            "hosts": sorted(hosts_by_worker[(role, worker_idx)]),


🟡 The per-worker hosts list includes nodes whose CSVs failed to read. hosts_by_worker is populated from every parseable filename in the first pass, before _read_samples runs — so when a CSV is missing/empty/lacks a power column, its host is silently dropped from the metric computation but still appears in hosts[], leaving the emitted entry's num_gpus / avg_power_w reflecting fewer nodes than the hosts list claims. Fix by tracking hosts from the successful sources list (e.g. via _parse_perfmon_label(path) on each surviving source) rather than the pre-read parse.

Extended reasoning...

The bug. aggregate_power_by_worker runs two passes over the input paths. The first pass (the for p in paths loop) extracts the (role, worker_idx, host) label from each parseable filename and stuffs the host into hosts_by_worker[(role, worker_idx)] — before any CSV has been opened. The second pass then opens each CSV via _read_samples, which returns None and gets skipped if the file is missing / empty / has no detectable power column / hits an OS or CSV error. When emitting the worker entry, hosts: sorted(hosts_by_worker[(role, worker_idx)]) is computed from the pre-read superset, while num_gpus and avg_power_w are computed only from the sources list of CSVs that successfully read. The result: the emitted hosts list overstates the nodes that actually contributed.\n\nStep-by-step proof. Topology: a 4-node decode worker, hosts {dn0, dn1, dn2, dn3}, with perf_samples_decode_w0_dn0..3.csv written by srt-slurm perfmon. Node dn3's perfmon failed to start mid-run, leaving its CSV absent.\n 1. First pass: _parse_perfmon_label parses all four filenames and adds dn0..dn3 to hosts_by_worker[(decode, 0)].\n 2. Second pass: _read_samples returns None for dn3 (file missing → path.is_file() False), so only dn0/dn1/dn2 enter sources.\n 3. _aggregate_rows correctly returns {num_gpus: 3*<gpus_per_node>, power: <avg over dn0..dn2 readings>}.\n 4. The entry is constructed with hosts: ['dn0', 'dn1', 'dn2', 'dn3'] (4 entries) but num_gpus: 24 and avg_power_w reflecting only 3 nodes. Internally inconsistent.\n\nWhy existing code doesn't prevent it. The function does go to lengths to handle CSV-read failures (skip on None, return None for the worker if no sources survived) — but the hosts set is populated before any read happens, so the skip-on-None path never touches it. There's no test for the partial-read case (test_aggregate_power_by_worker_one_worker_spans_multiple_nodes uses all-successful CSVs), so this gap isn't currently exercised.\n\nImpact. Severity is nit: the computed metrics (num_gpus, avg_power_w, avg_temp_c, etc.) are correct — only the hosts[] observability list is misleading. A downstream consumer counting num_gpus / len(hosts) to infer GPUs/node, or eyeballing the run to confirm all expected nodes contributed, would be misled. The numeric energy metrics that drive J/token computations are unaffected.\n\nFix. Build hosts_by_worker (or the per-worker entry's hosts field) from the surviving sources list instead of from the pre-read filename parse — e.g. inside the for path in worker_paths loop, after sources.append(...) succeeds, add host to a per-worker set, and use that set for the emitted hosts: sorted(...) value. Trivial refactor; only the hosts metadata changes.

arygupt requested a review from a team June 1, 2026 19:46

github-project-automation Bot added this to InferenceMAX Board Jun 1, 2026

github-code-quality Bot found potential problems Jun 1, 2026

View reviewed changes

Comment thread utils/aggregate_power.py

# Back-compat shim — some external callers may have imported _parse_power.

_parse_power = _parse_numeric_cell

claude Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(power): vendor-agnostic GPU power/telemetry aggregation core#1635

feat(power): vendor-agnostic GPU power/telemetry aggregation core#1635
arygupt wants to merge 1 commit into
mainfrom
feat/agg-power-core

arygupt commented Jun 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		# Back-compat shim — some external callers may have imported _parse_power.
		_parse_power = _parse_numeric_cell

Conversation

arygupt commented Jun 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Backward compatibility

CI

Test plan

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented Jun 1, 2026 •

edited by cursor Bot

Loading