Skip to content

fix(power): classify zero-decode-GPU multinode runs as aggregated#1646

Open
arygupt wants to merge 1 commit into
mainfrom
feat/multinode-agg-detect
Open

fix(power): classify zero-decode-GPU multinode runs as aggregated#1646
arygupt wants to merge 1 commit into
mainfrom
feat/multinode-agg-detect

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented Jun 2, 2026

What

process_result.py assumed multinode ⇒ disaggregated. It hard-required the DECODE_* env vars and labeled every multinode run disagg=true.

Per Bryan's request: detect aggregated multinode by the GPU split — if a multinode run reports DECODE_GPUS == 0, it's aggregated (one colocated pool), not disaggregated.

Changes

  • Detector: is_agg = decode_gpus == 0. Topology — not the DISAGG env flag — is the source of truth for the displayed disagg field.
  • Env contract: for the agg case, the DECODE_* detail vars are read tolerantly (default 0/empty) instead of being hard-required — an agg run has no decode pool and may not export them. Disaggregated runs (decode_gpus > 0) keep the full strict contract, unchanged.
  • Power: untouched. The main aggregation call is already cluster-wide (doesn't pass disagg), which is what agg wants.

The per-GPU throughput denominators already degraded correctly when decode_gpus == 0 (output_tput_denominator = total_gpus); this PR makes the classification honest to match.

Tests

utils/test_process_result.py — added test_multinode_agg_detected_when_decode_gpus_zero (asserts disagg=False, decode fields zeroed, cluster-wide denominators, and that the run succeeds without the DECODE_* vars). Full suite green: 26 passed.

Scope / caveat

Unit-tested only. There is no agg-multinode recipe in the sweep matrix and no hardware to validate against, so this is a defensive correctness fix (stop reporting "disagg" for a zero-decode run), not an end-to-end-validated path like the gb300/mi355x disagg work. No perf-changelog.yaml change → does not trigger a sweep; only test-process-result.yml runs.

🤖 Generated with Claude Code


Note

Low Risk
Benchmark metadata labeling and env validation only; disagg multinode behavior is unchanged and coverage is unit-test scoped.

Overview
Multinode result processing no longer treats every multinode run as disaggregated. When DECODE_GPUS is 0, the run is classified as aggregated: output disagg is set to false (overriding a misleading DISAGG=true env), and DECODE_* detail variables are optional with safe defaults instead of hard-required.

Disaggregated multinode (decode_gpus > 0) still requires the full prefill/decode env contract. A new unit test covers agg detection, missing DECODE_* vars, and cluster-wide per-GPU throughput denominators.

Reviewed by Cursor Bugbot for commit 3c08c30. Bugbot is set up for automated code reviews on this repo. Configure here.

process_result.py assumed multinode implies disaggregated — it hard-required
the DECODE_* env vars and labeled every multinode run disagg=true. An
aggregated (agg) multinode run has no separate decode pool: it reports
DECODE_GPUS=0 and may not export the DECODE_* detail vars at all, which both
crashed the script and (when it survived) mislabeled the run as disaggregated
on the dashboard.

Detect topology from the GPU split instead: decode_gpus == 0 => aggregated.
For that case, read the DECODE_* detail vars tolerantly (default 0/empty) and
set disagg=false, overriding the DISAGG env flag. Disaggregated runs
(decode_gpus > 0) keep the full strict env contract, unchanged. The per-GPU
throughput denominators already degraded correctly for decode_gpus == 0; this
makes the classification honest too.

Unit-tested only — there is no agg-multinode recipe in the sweep matrix yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team June 2, 2026 20:13
Comment thread utils/process_result.py
Comment on lines +74 to +83
detail_env = {
'PREFILL_NUM_WORKERS': os.environ.get('PREFILL_NUM_WORKERS', '0'),
'PREFILL_TP': os.environ.get('PREFILL_TP', '0'),
'PREFILL_EP': os.environ.get('PREFILL_EP', '0'),
'PREFILL_DP_ATTN': os.environ.get('PREFILL_DP_ATTN', ''),
'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'),
'DECODE_TP': os.environ.get('DECODE_TP', '0'),
'DECODE_EP': os.environ.get('DECODE_EP', '0'),
'DECODE_DP_ATTN': os.environ.get('DECODE_DP_ATTN', ''),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In the new is_agg branch, all eight detail env vars are read tolerantly with os.environ.get(..., '0'/''), including PREFILL_NUM_WORKERS/TP/EP/DP_ATTN. This contradicts both the PR description ("the DECODE_* detail vars are read tolerantly") and the inline comment one line above ("the DECODE_* detail vars may be absent"). An agg multinode run still has a colocated prefill pool (line 105 enforces prefill_gpus > 0), so the PREFILL_* topology vars should remain strictly required — otherwise a launcher that forgets to export PREFILL_TP will silently emit prefill_tp=0 into the agg JSON consumed by downstream ETL, masking a real misconfiguration. Fix: in the is_agg branch, call get_required_env_vars(['PREFILL_NUM_WORKERS','PREFILL_TP','PREFILL_EP','PREFILL_DP_ATTN']) and only default the DECODE_* vars tolerantly.

Extended reasoning...

What the bug is

The new is_agg branch in utils/process_result.py (lines 74-83) reads all eight detail env vars tolerantly:

detail_env = {
    'PREFILL_NUM_WORKERS': os.environ.get('PREFILL_NUM_WORKERS', '0'),
    'PREFILL_TP':          os.environ.get('PREFILL_TP', '0'),
    'PREFILL_EP':          os.environ.get('PREFILL_EP', '0'),
    'PREFILL_DP_ATTN':     os.environ.get('PREFILL_DP_ATTN', ''),
    'DECODE_NUM_WORKERS':  os.environ.get('DECODE_NUM_WORKERS', '0'),
    ...
}

This is broader than what the PR description and the inline comment justify. Both state the intent is only to tolerate the DECODE_* vars:

  • PR description: "for the agg case, the DECODE_ detail vars are read tolerantly (default 0/empty) instead of being hard-required — an agg run has no decode pool and may not export them."*
  • Inline comment (line 72-73): "Aggregated multinode has no decode pool, so the DECODE_ detail vars may be absent."*

Why the existing code does not prevent it

The agg branch only enforces topology (PREFILL_GPUS, DECODE_GPUS) via get_required_env_vars. Lines 105-106 then enforce prefill_gpus > 0 — so by the time we use prefill_tp/prefill_ep/etc., we know there is a prefill pool whose parallelism is topologically meaningful. Yet the PREFILL_* detail vars describing that pool are silently defaulted to 0/empty if missing.

The new test test_multinode_agg_detected_when_decode_gpus_zero only pop()s the DECODE_* keys (lines 432-433) and keeps all PREFILL_* keys set, so it neither verifies the strict-PREFILL contract nor exercises the over-broad tolerance.

Impact

For an agg run that forgets to export PREFILL_TP (or any other PREFILL_* detail var), the script produces an agg JSON with prefill_tp=0 (or empty prefill_dp_attention) instead of failing fast like the disagg branch would. Since the agg JSON is consumed by downstream ETL, a 0 silently passes through as valid data — a data-quality regression introduced by this PR.

Step-by-step proof

Launcher misconfiguration scenario for an aggregated multinode run:

  1. IS_MULTINODE=true, PREFILL_GPUS=16, DECODE_GPUS=0 → exported correctly.
  2. PREFILL_TP is forgotten by the launcher (not exported).
  3. PREFILL_EP, PREFILL_NUM_WORKERS, PREFILL_DP_ATTN are exported correctly.
  4. Script reaches the multinode block. is_agg = (decode_gpus == 0) → True.
  5. The is_agg dict at lines 74-83 reads 'PREFILL_TP': os.environ.get('PREFILL_TP', '0')'0'.
  6. prefill_tp = int(detail_env['PREFILL_TP'])0.
  7. multi_node_data['prefill_tp'] = 0 written to agg_*.json.
  8. ETL ingests prefill_tp=0 for a 16-GPU prefill pool — silently wrong; no error.

Compare to the disagg branch (lines 87-90): the same forgotten PREFILL_TP would have raised EnvironmentError: Missing required environment variables: PREFILL_TP, failing the run cleanly.

Fix

In the is_agg branch, keep PREFILL_* strict and only relax DECODE_*:

if is_agg:
    prefill_env = get_required_env_vars([
        'PREFILL_NUM_WORKERS', 'PREFILL_TP', 'PREFILL_EP', 'PREFILL_DP_ATTN',
    ])
    detail_env = {
        **prefill_env,
        'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'),
        'DECODE_TP':          os.environ.get('DECODE_TP', '0'),
        'DECODE_EP':          os.environ.get('DECODE_EP', '0'),
        'DECODE_DP_ATTN':     os.environ.get('DECODE_DP_ATTN', ''),
    }

This matches the PR's stated contract, matches the inline comment, and matches what the existing agg test already exercises (it sets all PREFILL_* and only omits DECODE_*).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant