fix(power): classify zero-decode-GPU multinode runs as aggregated by arygupt · Pull Request #1646 · SemiAnalysisAI/InferenceX

arygupt · 2026-06-02T20:13:01Z

What

process_result.py assumed multinode ⇒ disaggregated. It hard-required the DECODE_* env vars and labeled every multinode run disagg=true.

Per Bryan's request: detect aggregated multinode by the GPU split — if a multinode run reports DECODE_GPUS == 0, it's aggregated (one colocated pool), not disaggregated.

Changes

Detector: is_agg = decode_gpus == 0. Topology — not the DISAGG env flag — is the source of truth for the displayed disagg field.
Env contract: for the agg case, the DECODE_* detail vars are read tolerantly (default 0/empty) instead of being hard-required — an agg run has no decode pool and may not export them. Disaggregated runs (decode_gpus > 0) keep the full strict contract, unchanged.
Power: untouched. The main aggregation call is already cluster-wide (doesn't pass disagg), which is what agg wants.

The per-GPU throughput denominators already degraded correctly when decode_gpus == 0 (output_tput_denominator = total_gpus); this PR makes the classification honest to match.

Tests

utils/test_process_result.py — added test_multinode_agg_detected_when_decode_gpus_zero (asserts disagg=False, decode fields zeroed, cluster-wide denominators, and that the run succeeds without the DECODE_* vars). Full suite green: 26 passed.

Scope / caveat

Unit-tested only. There is no agg-multinode recipe in the sweep matrix and no hardware to validate against, so this is a defensive correctness fix (stop reporting "disagg" for a zero-decode run), not an end-to-end-validated path like the gb300/mi355x disagg work. No perf-changelog.yaml change → does not trigger a sweep; only test-process-result.yml runs.

🤖 Generated with Claude Code

Note

Low Risk
Benchmark metadata labeling and env validation only; disagg multinode behavior is unchanged and coverage is unit-test scoped.

Overview
Multinode result processing no longer treats every multinode run as disaggregated. When DECODE_GPUS is 0, the run is classified as aggregated: output disagg is set to false (overriding a misleading DISAGG=true env), and DECODE_* detail variables are optional with safe defaults instead of hard-required.

Disaggregated multinode (decode_gpus > 0) still requires the full prefill/decode env contract. A new unit test covers agg detection, missing DECODE_* vars, and cluster-wide per-GPU throughput denominators.

^{Reviewed by Cursor Bugbot for commit 3c08c30. Bugbot is set up for automated code reviews on this repo. Configure here.}

process_result.py assumed multinode implies disaggregated — it hard-required the DECODE_* env vars and labeled every multinode run disagg=true. An aggregated (agg) multinode run has no separate decode pool: it reports DECODE_GPUS=0 and may not export the DECODE_* detail vars at all, which both crashed the script and (when it survived) mislabeled the run as disaggregated on the dashboard. Detect topology from the GPU split instead: decode_gpus == 0 => aggregated. For that case, read the DECODE_* detail vars tolerantly (default 0/empty) and set disagg=false, overriding the DISAGG env flag. Disaggregated runs (decode_gpus > 0) keep the full strict env contract, unchanged. The per-GPU throughput denominators already degraded correctly for decode_gpus == 0; this makes the classification honest too. Unit-tested only — there is no agg-multinode recipe in the sweep matrix yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

claude · 2026-06-02T20:20:59Z

+        detail_env = {
+            'PREFILL_NUM_WORKERS': os.environ.get('PREFILL_NUM_WORKERS', '0'),
+            'PREFILL_TP': os.environ.get('PREFILL_TP', '0'),
+            'PREFILL_EP': os.environ.get('PREFILL_EP', '0'),
+            'PREFILL_DP_ATTN': os.environ.get('PREFILL_DP_ATTN', ''),
+            'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'),
+            'DECODE_TP': os.environ.get('DECODE_TP', '0'),
+            'DECODE_EP': os.environ.get('DECODE_EP', '0'),
+            'DECODE_DP_ATTN': os.environ.get('DECODE_DP_ATTN', ''),
+        }


🔴 In the new is_agg branch, all eight detail env vars are read tolerantly with os.environ.get(..., '0'/''), including PREFILL_NUM_WORKERS/TP/EP/DP_ATTN. This contradicts both the PR description ("the DECODE_* detail vars are read tolerantly") and the inline comment one line above ("the DECODE_* detail vars may be absent"). An agg multinode run still has a colocated prefill pool (line 105 enforces prefill_gpus > 0), so the PREFILL_* topology vars should remain strictly required — otherwise a launcher that forgets to export PREFILL_TP will silently emit prefill_tp=0 into the agg JSON consumed by downstream ETL, masking a real misconfiguration. Fix: in the is_agg branch, call get_required_env_vars(['PREFILL_NUM_WORKERS','PREFILL_TP','PREFILL_EP','PREFILL_DP_ATTN']) and only default the DECODE_* vars tolerantly.

Extended reasoning...

What the bug is

The new is_agg branch in utils/process_result.py (lines 74-83) reads all eight detail env vars tolerantly:

detail_env = { 'PREFILL_NUM_WORKERS': os.environ.get('PREFILL_NUM_WORKERS', '0'), 'PREFILL_TP': os.environ.get('PREFILL_TP', '0'), 'PREFILL_EP': os.environ.get('PREFILL_EP', '0'), 'PREFILL_DP_ATTN': os.environ.get('PREFILL_DP_ATTN', ''), 'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'), ... }

This is broader than what the PR description and the inline comment justify. Both state the intent is only to tolerate the DECODE_* vars:

PR description: "for the agg case, the DECODE_ detail vars are read tolerantly (default 0/empty) instead of being hard-required — an agg run has no decode pool and may not export them."*

Inline comment (line 72-73): "Aggregated multinode has no decode pool, so the DECODE_ detail vars may be absent."*

Why the existing code does not prevent it

The agg branch only enforces topology (PREFILL_GPUS, DECODE_GPUS) via get_required_env_vars. Lines 105-106 then enforce prefill_gpus > 0 — so by the time we use prefill_tp/prefill_ep/etc., we know there is a prefill pool whose parallelism is topologically meaningful. Yet the PREFILL_* detail vars describing that pool are silently defaulted to 0/empty if missing.

The new test test_multinode_agg_detected_when_decode_gpus_zero only pop()s the DECODE_* keys (lines 432-433) and keeps all PREFILL_* keys set, so it neither verifies the strict-PREFILL contract nor exercises the over-broad tolerance.

Impact

For an agg run that forgets to export PREFILL_TP (or any other PREFILL_* detail var), the script produces an agg JSON with prefill_tp=0 (or empty prefill_dp_attention) instead of failing fast like the disagg branch would. Since the agg JSON is consumed by downstream ETL, a 0 silently passes through as valid data — a data-quality regression introduced by this PR.

Step-by-step proof

Launcher misconfiguration scenario for an aggregated multinode run:

IS_MULTINODE=true, PREFILL_GPUS=16, DECODE_GPUS=0 → exported correctly.

PREFILL_TP is forgotten by the launcher (not exported).

PREFILL_EP, PREFILL_NUM_WORKERS, PREFILL_DP_ATTN are exported correctly.

Script reaches the multinode block. is_agg = (decode_gpus == 0) → True.

The is_agg dict at lines 74-83 reads 'PREFILL_TP': os.environ.get('PREFILL_TP', '0') → '0'.

prefill_tp = int(detail_env['PREFILL_TP']) → 0.

multi_node_data['prefill_tp'] = 0 written to agg_*.json.

ETL ingests prefill_tp=0 for a 16-GPU prefill pool — silently wrong; no error.

Compare to the disagg branch (lines 87-90): the same forgotten PREFILL_TP would have raised EnvironmentError: Missing required environment variables: PREFILL_TP, failing the run cleanly.

Fix

In the is_agg branch, keep PREFILL_* strict and only relax DECODE_*:

if is_agg: prefill_env = get_required_env_vars([ 'PREFILL_NUM_WORKERS', 'PREFILL_TP', 'PREFILL_EP', 'PREFILL_DP_ATTN', ]) detail_env = { **prefill_env, 'DECODE_NUM_WORKERS': os.environ.get('DECODE_NUM_WORKERS', '0'), 'DECODE_TP': os.environ.get('DECODE_TP', '0'), 'DECODE_EP': os.environ.get('DECODE_EP', '0'), 'DECODE_DP_ATTN': os.environ.get('DECODE_DP_ATTN', ''), }

This matches the PR's stated contract, matches the inline comment, and matches what the existing agg test already exercises (it sets all PREFILL_* and only omits DECODE_*).

arygupt requested a review from a team June 2, 2026 20:13

github-project-automation Bot added this to InferenceMAX Board Jun 2, 2026

claude Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(power): classify zero-decode-GPU multinode runs as aggregated#1646

fix(power): classify zero-decode-GPU multinode runs as aggregated#1646
arygupt wants to merge 1 commit into
mainfrom
feat/multinode-agg-detect

arygupt commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Tests

Scope / caveat

Uh oh!

claude Bot Jun 2, 2026

Choose a reason for hiding this comment

What the bug is

Why the existing code does not prevent it

Impact

Step-by-step proof

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented Jun 2, 2026 •

edited by cursor Bot

Loading