Skip to content

feat(power): vendor-agnostic GPU power/telemetry aggregation core#1635

Open
arygupt wants to merge 1 commit into
mainfrom
feat/agg-power-core
Open

feat(power): vendor-agnostic GPU power/telemetry aggregation core#1635
arygupt wants to merge 1 commit into
mainfrom
feat/agg-power-core

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented Jun 1, 2026

Summary

Lifts the vendor-agnostic measured-power aggregation core onto main as its own PR, decoupled from the in-flight AMD sweep on #1574. aggregate_power.py parses per-GPU samples from both nvidia-smi (gb300) and amd-smi (mi355x) through one path and emits:

  • avg_power_w + per-stage prefill/decode_avg_power_w
  • per-worker workers[] rollup (role, num_gpus, temp/util/mem)
  • joules_per_{input,output,total}_token (decode-only J/out on disagg)
  • temp/util/mem (AMD gfx_activity, used_vram, hotspot)

run() accepts Path | Iterable[Path], so a multinode worker spanning N CSVs aggregates as N×GPUs. Per-GPU mean denominators count only the GPUs that reported each metric, so heterogeneous CSV schemas don't dilute temp/util/mem averages.

Backward compatibility

run()'s signature is a backward-compatible superset of main's (Path | Iterable[Path], new keyword-only disagg=False). process_result.py keeps working unchanged — verified locally: test_process_result.py 25/25 green against this core.

CI

Adds .github/workflows/test-aggregate-power.yml. The 82 tests in test_aggregate_power.py previously gated no workflow — this closes that gap. The workflow triggers only on changes to aggregate_power.py / test_aggregate_power.py.

Test plan

  • pytest test_aggregate_power.py82 passed (covers gb300 + mi355x CSV parsing, per-stage rollup, J/token, per-GPU denominators)
  • pytest test_process_result.py25 passed (coupling unchanged)
  • CI: test-aggregate-power green on this PR

🤖 Generated with Claude Code


Note

Medium Risk
Large change to benchmark metric semantics (disagg joules attribution, multinode GPU counting) that downstream ETL/dashboards may rely on; mitigated by extensive tests and backward-compatible run()/aggregate_power() APIs.

Overview
Expands aggregate_power.py from power-only, single-CSV aggregation into a vendor-agnostic telemetry pipeline (NVIDIA, AMD amd-smi, srt-slurm) that patches agg JSON for InferenceX ETL.

Multinode & correctness: run() and aggregate_metrics() accept Path | Iterable[Path] plus --csv-glob; GPU IDs are namespaced by CSV stem so repeated local indices across nodes do not under-count GPUs. _load_bench_window adds fallbacks for multinode bench JSON (date + duration, then file mtime).

New metrics & schema: Regex detection for temp (hotspot/junction preferred on AMD), util (gfx_activity), and mem used; cluster fields avg_temp_c, peak_temp_c, avg_util_pct, avg_mem_used_mb; optional metrics average only over GPUs that reported them.

Disaggregated serving: Parses perf_samples_<role>_w<idx>_<host>.csv, rolls up workers[], and with --disagg applies per-stage joules_per_input_token / joules_per_output_token when both prefill and decode exist; frontend/agg excluded from stage energy.

Tests & CI: Large expansion of test_aggregate_power.py (~82 tests) and new .github/workflows/test-aggregate-power.yml on changes to those files. aggregate_power() remains a thin wrapper for backward compatibility; process_result.py callers unchanged.

Reviewed by Cursor Bugbot for commit a829fa8. Bugbot is set up for automated code reviews on this repo. Configure here.

Lift the full measured-power aggregation core onto main, independent of
the in-flight AMD sweep on #1574. aggregate_power.py parses per-GPU
samples from both nvidia-smi (gb300) and amd-smi (mi355x) through one
vendor-agnostic path and emits:

- avg_power_w + per-stage prefill/decode_avg_power_w
- per-worker workers[] rollup (role, num_gpus, temp/util/mem)
- joules_per_{input,output,total}_token (decode-only J/out on disagg)
- temp/util/mem (AMD gfx_activity, used_vram, hotspot)

run() accepts Path | Iterable[Path] so a multinode worker spanning N
CSVs aggregates as N x GPUs. Per-GPU mean denominators count only the
GPUs that reported each metric, so heterogeneous CSV schemas don't
dilute temp/util/mem averages.

run()'s signature is a backward-compatible superset of main's, so
process_result.py keeps working unchanged (verified: test_process_result
25/25 green against this core).

Adds .github/workflows/test-aggregate-power.yml -- the 82 tests in
test_aggregate_power.py previously gated no CI workflow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team June 1, 2026 19:46
Comment thread utils/aggregate_power.py


# Back-compat shim — some external callers may have imported _parse_power.
_parse_power = _parse_numeric_cell
Comment thread utils/aggregate_power.py
Comment on lines +513 to +536
hosts_by_worker.setdefault(key, set()).add(host)
if not by_worker:
return None

out: list[dict] = []
for (role, worker_idx), worker_paths in by_worker.items():
sources: list[tuple[Path, list[tuple[float, str | None, dict[str, float]]], bool]] = []
for path in worker_paths:
read = _read_samples(path, start_unix, end_unix)
if read is None:
continue
rows, saw_gpu_col = read
sources.append((path, rows, saw_gpu_col))
if not sources:
continue
# Namespace across paths within a worker too — a 16-GPU decode worker
# spans 4 nodes, each reporting local indices 0..3.
result = _aggregate_rows(sources, namespace=len(sources) > 1)
if result is None:
continue
entry: dict = {
"role": role,
"worker_idx": worker_idx,
"hosts": sorted(hosts_by_worker[(role, worker_idx)]),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The per-worker hosts list includes nodes whose CSVs failed to read. hosts_by_worker is populated from every parseable filename in the first pass, before _read_samples runs — so when a CSV is missing/empty/lacks a power column, its host is silently dropped from the metric computation but still appears in hosts[], leaving the emitted entry's num_gpus / avg_power_w reflecting fewer nodes than the hosts list claims. Fix by tracking hosts from the successful sources list (e.g. via _parse_perfmon_label(path) on each surviving source) rather than the pre-read parse.

Extended reasoning...

The bug. aggregate_power_by_worker runs two passes over the input paths. The first pass (the for p in paths loop) extracts the (role, worker_idx, host) label from each parseable filename and stuffs the host into hosts_by_worker[(role, worker_idx)]before any CSV has been opened. The second pass then opens each CSV via _read_samples, which returns None and gets skipped if the file is missing / empty / has no detectable power column / hits an OS or CSV error. When emitting the worker entry, hosts: sorted(hosts_by_worker[(role, worker_idx)]) is computed from the pre-read superset, while num_gpus and avg_power_w are computed only from the sources list of CSVs that successfully read. The result: the emitted hosts list overstates the nodes that actually contributed.\n\nStep-by-step proof. Topology: a 4-node decode worker, hosts {dn0, dn1, dn2, dn3}, with perf_samples_decode_w0_dn0..3.csv written by srt-slurm perfmon. Node dn3's perfmon failed to start mid-run, leaving its CSV absent.\n 1. First pass: _parse_perfmon_label parses all four filenames and adds dn0..dn3 to hosts_by_worker[(decode, 0)].\n 2. Second pass: _read_samples returns None for dn3 (file missing → path.is_file() False), so only dn0/dn1/dn2 enter sources.\n 3. _aggregate_rows correctly returns {num_gpus: 3*<gpus_per_node>, power: <avg over dn0..dn2 readings>}.\n 4. The entry is constructed with hosts: ['dn0', 'dn1', 'dn2', 'dn3'] (4 entries) but num_gpus: 24 and avg_power_w reflecting only 3 nodes. Internally inconsistent.\n\nWhy existing code doesn't prevent it. The function does go to lengths to handle CSV-read failures (skip on None, return None for the worker if no sources survived) — but the hosts set is populated before any read happens, so the skip-on-None path never touches it. There's no test for the partial-read case (test_aggregate_power_by_worker_one_worker_spans_multiple_nodes uses all-successful CSVs), so this gap isn't currently exercised.\n\nImpact. Severity is nit: the computed metrics (num_gpus, avg_power_w, avg_temp_c, etc.) are correct — only the hosts[] observability list is misleading. A downstream consumer counting num_gpus / len(hosts) to infer GPUs/node, or eyeballing the run to confirm all expected nodes contributed, would be misled. The numeric energy metrics that drive J/token computations are unaffected.\n\nFix. Build hosts_by_worker (or the per-worker entry's hosts field) from the surviving sources list instead of from the pre-read filename parse — e.g. inside the for path in worker_paths loop, after sources.append(...) succeeds, add host to a per-worker set, and use that set for the emitted hosts: sorted(...) value. Trivial refactor; only the hosts metadata changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant