feat(power): vendor-agnostic GPU power/telemetry aggregation core#1635
feat(power): vendor-agnostic GPU power/telemetry aggregation core#1635arygupt wants to merge 1 commit into
Conversation
Lift the full measured-power aggregation core onto main, independent of the in-flight AMD sweep on #1574. aggregate_power.py parses per-GPU samples from both nvidia-smi (gb300) and amd-smi (mi355x) through one vendor-agnostic path and emits: - avg_power_w + per-stage prefill/decode_avg_power_w - per-worker workers[] rollup (role, num_gpus, temp/util/mem) - joules_per_{input,output,total}_token (decode-only J/out on disagg) - temp/util/mem (AMD gfx_activity, used_vram, hotspot) run() accepts Path | Iterable[Path] so a multinode worker spanning N CSVs aggregates as N x GPUs. Per-GPU mean denominators count only the GPUs that reported each metric, so heterogeneous CSV schemas don't dilute temp/util/mem averages. run()'s signature is a backward-compatible superset of main's, so process_result.py keeps working unchanged (verified: test_process_result 25/25 green against this core). Adds .github/workflows/test-aggregate-power.yml -- the 82 tests in test_aggregate_power.py previously gated no CI workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
||
|
|
||
| # Back-compat shim — some external callers may have imported _parse_power. | ||
| _parse_power = _parse_numeric_cell |
| hosts_by_worker.setdefault(key, set()).add(host) | ||
| if not by_worker: | ||
| return None | ||
|
|
||
| out: list[dict] = [] | ||
| for (role, worker_idx), worker_paths in by_worker.items(): | ||
| sources: list[tuple[Path, list[tuple[float, str | None, dict[str, float]]], bool]] = [] | ||
| for path in worker_paths: | ||
| read = _read_samples(path, start_unix, end_unix) | ||
| if read is None: | ||
| continue | ||
| rows, saw_gpu_col = read | ||
| sources.append((path, rows, saw_gpu_col)) | ||
| if not sources: | ||
| continue | ||
| # Namespace across paths within a worker too — a 16-GPU decode worker | ||
| # spans 4 nodes, each reporting local indices 0..3. | ||
| result = _aggregate_rows(sources, namespace=len(sources) > 1) | ||
| if result is None: | ||
| continue | ||
| entry: dict = { | ||
| "role": role, | ||
| "worker_idx": worker_idx, | ||
| "hosts": sorted(hosts_by_worker[(role, worker_idx)]), |
There was a problem hiding this comment.
🟡 The per-worker hosts list includes nodes whose CSVs failed to read. hosts_by_worker is populated from every parseable filename in the first pass, before _read_samples runs — so when a CSV is missing/empty/lacks a power column, its host is silently dropped from the metric computation but still appears in hosts[], leaving the emitted entry's num_gpus / avg_power_w reflecting fewer nodes than the hosts list claims. Fix by tracking hosts from the successful sources list (e.g. via _parse_perfmon_label(path) on each surviving source) rather than the pre-read parse.
Extended reasoning...
The bug. aggregate_power_by_worker runs two passes over the input paths. The first pass (the for p in paths loop) extracts the (role, worker_idx, host) label from each parseable filename and stuffs the host into hosts_by_worker[(role, worker_idx)] — before any CSV has been opened. The second pass then opens each CSV via _read_samples, which returns None and gets skipped if the file is missing / empty / has no detectable power column / hits an OS or CSV error. When emitting the worker entry, hosts: sorted(hosts_by_worker[(role, worker_idx)]) is computed from the pre-read superset, while num_gpus and avg_power_w are computed only from the sources list of CSVs that successfully read. The result: the emitted hosts list overstates the nodes that actually contributed.\n\nStep-by-step proof. Topology: a 4-node decode worker, hosts {dn0, dn1, dn2, dn3}, with perf_samples_decode_w0_dn0..3.csv written by srt-slurm perfmon. Node dn3's perfmon failed to start mid-run, leaving its CSV absent.\n 1. First pass: _parse_perfmon_label parses all four filenames and adds dn0..dn3 to hosts_by_worker[(decode, 0)].\n 2. Second pass: _read_samples returns None for dn3 (file missing → path.is_file() False), so only dn0/dn1/dn2 enter sources.\n 3. _aggregate_rows correctly returns {num_gpus: 3*<gpus_per_node>, power: <avg over dn0..dn2 readings>}.\n 4. The entry is constructed with hosts: ['dn0', 'dn1', 'dn2', 'dn3'] (4 entries) but num_gpus: 24 and avg_power_w reflecting only 3 nodes. Internally inconsistent.\n\nWhy existing code doesn't prevent it. The function does go to lengths to handle CSV-read failures (skip on None, return None for the worker if no sources survived) — but the hosts set is populated before any read happens, so the skip-on-None path never touches it. There's no test for the partial-read case (test_aggregate_power_by_worker_one_worker_spans_multiple_nodes uses all-successful CSVs), so this gap isn't currently exercised.\n\nImpact. Severity is nit: the computed metrics (num_gpus, avg_power_w, avg_temp_c, etc.) are correct — only the hosts[] observability list is misleading. A downstream consumer counting num_gpus / len(hosts) to infer GPUs/node, or eyeballing the run to confirm all expected nodes contributed, would be misled. The numeric energy metrics that drive J/token computations are unaffected.\n\nFix. Build hosts_by_worker (or the per-worker entry's hosts field) from the surviving sources list instead of from the pre-read filename parse — e.g. inside the for path in worker_paths loop, after sources.append(...) succeeds, add host to a per-worker set, and use that set for the emitted hosts: sorted(...) value. Trivial refactor; only the hosts metadata changes.
Summary
Lifts the vendor-agnostic measured-power aggregation core onto
mainas its own PR, decoupled from the in-flight AMD sweep on #1574.aggregate_power.pyparses per-GPU samples from bothnvidia-smi(gb300) andamd-smi(mi355x) through one path and emits:avg_power_w+ per-stageprefill/decode_avg_power_wworkers[]rollup (role, num_gpus, temp/util/mem)joules_per_{input,output,total}_token(decode-only J/out on disagg)gfx_activity,used_vram, hotspot)run()acceptsPath | Iterable[Path], so a multinode worker spanning N CSVs aggregates as N×GPUs. Per-GPU mean denominators count only the GPUs that reported each metric, so heterogeneous CSV schemas don't dilute temp/util/mem averages.Backward compatibility
run()'s signature is a backward-compatible superset of main's (Path | Iterable[Path], new keyword-onlydisagg=False).process_result.pykeeps working unchanged — verified locally:test_process_result.py25/25 green against this core.CI
Adds
.github/workflows/test-aggregate-power.yml. The 82 tests intest_aggregate_power.pypreviously gated no workflow — this closes that gap. The workflow triggers only on changes toaggregate_power.py/test_aggregate_power.py.Test plan
pytest test_aggregate_power.py→ 82 passed (covers gb300 + mi355x CSV parsing, per-stage rollup, J/token, per-GPU denominators)pytest test_process_result.py→ 25 passed (coupling unchanged)test-aggregate-powergreen on this PR🤖 Generated with Claude Code
Note
Medium Risk
Large change to benchmark metric semantics (disagg joules attribution, multinode GPU counting) that downstream ETL/dashboards may rely on; mitigated by extensive tests and backward-compatible
run()/aggregate_power()APIs.Overview
Expands
aggregate_power.pyfrom power-only, single-CSV aggregation into a vendor-agnostic telemetry pipeline (NVIDIA, AMD amd-smi, srt-slurm) that patches agg JSON for InferenceX ETL.Multinode & correctness:
run()andaggregate_metrics()acceptPath | Iterable[Path]plus--csv-glob; GPU IDs are namespaced by CSV stem so repeated local indices across nodes do not under-count GPUs._load_bench_windowadds fallbacks for multinode bench JSON (date+duration, then file mtime).New metrics & schema: Regex detection for temp (hotspot/junction preferred on AMD), util (
gfx_activity), and mem used; cluster fieldsavg_temp_c,peak_temp_c,avg_util_pct,avg_mem_used_mb; optional metrics average only over GPUs that reported them.Disaggregated serving: Parses
perf_samples_<role>_w<idx>_<host>.csv, rolls upworkers[], and with--disaggapplies per-stagejoules_per_input_token/joules_per_output_tokenwhen both prefill and decode exist; frontend/agg excluded from stage energy.Tests & CI: Large expansion of
test_aggregate_power.py(~82 tests) and new.github/workflows/test-aggregate-power.ymlon changes to those files.aggregate_power()remains a thin wrapper for backward compatibility;process_result.pycallers unchanged.Reviewed by Cursor Bugbot for commit a829fa8. Bugbot is set up for automated code reviews on this repo. Configure here.