fix(power): classify zero-decode-GPU multinode runs as aggregated#1646
fix(power): classify zero-decode-GPU multinode runs as aggregated#1646arygupt wants to merge 1 commit into
Conversation
process_result.py assumed multinode implies disaggregated — it hard-required the DECODE_* env vars and labeled every multinode run disagg=true. An aggregated (agg) multinode run has no separate decode pool: it reports DECODE_GPUS=0 and may not export the DECODE_* detail vars at all, which both crashed the script and (when it survived) mislabeled the run as disaggregated on the dashboard. Detect topology from the GPU split instead: decode_gpus == 0 => aggregated. For that case, read the DECODE_* detail vars tolerantly (default 0/empty) and set disagg=false, overriding the DISAGG env flag. Disaggregated runs (decode_gpus > 0) keep the full strict env contract, unchanged. The per-GPU throughput denominators already degraded correctly for decode_gpus == 0; this makes the classification honest too. Unit-tested only — there is no agg-multinode recipe in the sweep matrix yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| detail_env = { | ||
| 'PREFILL_NUM_WORKERS': os.environ.get('PREFILL_NUM_WORKERS', '0'), | ||
| 'PREFILL_TP': os.environ.get('PREFILL_TP', '0'), | ||
| 'PREFILL_EP': os.environ.get('PREFILL_EP', '0'), | ||
| 'PREFILL_DP_ATTN': os.environ.get('PREFILL_DP_ATTN', ''), | ||
| 'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'), | ||
| 'DECODE_TP': os.environ.get('DECODE_TP', '0'), | ||
| 'DECODE_EP': os.environ.get('DECODE_EP', '0'), | ||
| 'DECODE_DP_ATTN': os.environ.get('DECODE_DP_ATTN', ''), | ||
| } |
There was a problem hiding this comment.
🔴 In the new is_agg branch, all eight detail env vars are read tolerantly with os.environ.get(..., '0'/''), including PREFILL_NUM_WORKERS/TP/EP/DP_ATTN. This contradicts both the PR description ("the DECODE_* detail vars are read tolerantly") and the inline comment one line above ("the DECODE_* detail vars may be absent"). An agg multinode run still has a colocated prefill pool (line 105 enforces prefill_gpus > 0), so the PREFILL_* topology vars should remain strictly required — otherwise a launcher that forgets to export PREFILL_TP will silently emit prefill_tp=0 into the agg JSON consumed by downstream ETL, masking a real misconfiguration. Fix: in the is_agg branch, call get_required_env_vars(['PREFILL_NUM_WORKERS','PREFILL_TP','PREFILL_EP','PREFILL_DP_ATTN']) and only default the DECODE_* vars tolerantly.
Extended reasoning...
What the bug is
The new is_agg branch in utils/process_result.py (lines 74-83) reads all eight detail env vars tolerantly:
detail_env = {
'PREFILL_NUM_WORKERS': os.environ.get('PREFILL_NUM_WORKERS', '0'),
'PREFILL_TP': os.environ.get('PREFILL_TP', '0'),
'PREFILL_EP': os.environ.get('PREFILL_EP', '0'),
'PREFILL_DP_ATTN': os.environ.get('PREFILL_DP_ATTN', ''),
'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'),
...
}This is broader than what the PR description and the inline comment justify. Both state the intent is only to tolerate the DECODE_* vars:
- PR description: "for the agg case, the DECODE_ detail vars are read tolerantly (default 0/empty) instead of being hard-required — an agg run has no decode pool and may not export them."*
- Inline comment (line 72-73): "Aggregated multinode has no decode pool, so the DECODE_ detail vars may be absent."*
Why the existing code does not prevent it
The agg branch only enforces topology (PREFILL_GPUS, DECODE_GPUS) via get_required_env_vars. Lines 105-106 then enforce prefill_gpus > 0 — so by the time we use prefill_tp/prefill_ep/etc., we know there is a prefill pool whose parallelism is topologically meaningful. Yet the PREFILL_* detail vars describing that pool are silently defaulted to 0/empty if missing.
The new test test_multinode_agg_detected_when_decode_gpus_zero only pop()s the DECODE_* keys (lines 432-433) and keeps all PREFILL_* keys set, so it neither verifies the strict-PREFILL contract nor exercises the over-broad tolerance.
Impact
For an agg run that forgets to export PREFILL_TP (or any other PREFILL_* detail var), the script produces an agg JSON with prefill_tp=0 (or empty prefill_dp_attention) instead of failing fast like the disagg branch would. Since the agg JSON is consumed by downstream ETL, a 0 silently passes through as valid data — a data-quality regression introduced by this PR.
Step-by-step proof
Launcher misconfiguration scenario for an aggregated multinode run:
IS_MULTINODE=true,PREFILL_GPUS=16,DECODE_GPUS=0→ exported correctly.PREFILL_TPis forgotten by the launcher (not exported).PREFILL_EP,PREFILL_NUM_WORKERS,PREFILL_DP_ATTNare exported correctly.- Script reaches the multinode block.
is_agg = (decode_gpus == 0)→ True. - The
is_aggdict at lines 74-83 reads'PREFILL_TP': os.environ.get('PREFILL_TP', '0')→'0'. prefill_tp = int(detail_env['PREFILL_TP'])→0.multi_node_data['prefill_tp'] = 0written toagg_*.json.- ETL ingests
prefill_tp=0for a 16-GPU prefill pool — silently wrong; no error.
Compare to the disagg branch (lines 87-90): the same forgotten PREFILL_TP would have raised EnvironmentError: Missing required environment variables: PREFILL_TP, failing the run cleanly.
Fix
In the is_agg branch, keep PREFILL_* strict and only relax DECODE_*:
if is_agg:
prefill_env = get_required_env_vars([
'PREFILL_NUM_WORKERS', 'PREFILL_TP', 'PREFILL_EP', 'PREFILL_DP_ATTN',
])
detail_env = {
**prefill_env,
'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'),
'DECODE_TP': os.environ.get('DECODE_TP', '0'),
'DECODE_EP': os.environ.get('DECODE_EP', '0'),
'DECODE_DP_ATTN': os.environ.get('DECODE_DP_ATTN', ''),
}This matches the PR's stated contract, matches the inline comment, and matches what the existing agg test already exercises (it sets all PREFILL_* and only omits DECODE_*).
What
process_result.pyassumed multinode ⇒ disaggregated. It hard-required theDECODE_*env vars and labeled every multinode rundisagg=true.Per Bryan's request: detect aggregated multinode by the GPU split — if a multinode run reports
DECODE_GPUS == 0, it's aggregated (one colocated pool), not disaggregated.Changes
is_agg = decode_gpus == 0. Topology — not theDISAGGenv flag — is the source of truth for the displayeddisaggfield.DECODE_*detail vars are read tolerantly (default0/empty) instead of being hard-required — an agg run has no decode pool and may not export them. Disaggregated runs (decode_gpus > 0) keep the full strict contract, unchanged.mainaggregation call is already cluster-wide (doesn't passdisagg), which is what agg wants.The per-GPU throughput denominators already degraded correctly when
decode_gpus == 0(output_tput_denominator = total_gpus); this PR makes the classification honest to match.Tests
utils/test_process_result.py— addedtest_multinode_agg_detected_when_decode_gpus_zero(assertsdisagg=False, decode fields zeroed, cluster-wide denominators, and that the run succeeds without theDECODE_*vars). Full suite green: 26 passed.Scope / caveat
Unit-tested only. There is no agg-multinode recipe in the sweep matrix and no hardware to validate against, so this is a defensive correctness fix (stop reporting "disagg" for a zero-decode run), not an end-to-end-validated path like the gb300/mi355x disagg work. No
perf-changelog.yamlchange → does not trigger a sweep; onlytest-process-result.ymlruns.🤖 Generated with Claude Code
Note
Low Risk
Benchmark metadata labeling and env validation only; disagg multinode behavior is unchanged and coverage is unit-test scoped.
Overview
Multinode result processing no longer treats every multinode run as disaggregated. When
DECODE_GPUSis 0, the run is classified as aggregated: outputdisaggis set to false (overriding a misleadingDISAGG=trueenv), andDECODE_*detail variables are optional with safe defaults instead of hard-required.Disaggregated multinode (
decode_gpus > 0) still requires the full prefill/decode env contract. A new unit test covers agg detection, missingDECODE_*vars, and cluster-wide per-GPU throughput denominators.Reviewed by Cursor Bugbot for commit 3c08c30. Bugbot is set up for automated code reviews on this repo. Configure here.