[B300][vLLM] Add MiniMax-M2.5 FP4 disagg Dynamo configs#1652
Conversation
Split of #1560 — B300 half. - Add minimaxm2.5-fp4-b300-dynamo-vllm to nvidia-master.yaml (1k1k + 8k1k search spaces; image vllm/vllm-openai:v0.20.1, model nvidia/MiniMax-M2.5-NVFP4). - Add srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/. - Wire minimax + dynamo-vllm routing into runners/launch_b300-nv.sh. - Append perf-changelog entry.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| # B300-only: full-node TP=8 decode (the 8 GPUs of a single B300 node). | ||
| # Cousin of tp4-1p1d.yaml but exercises the wider TP that B300's per-node | ||
| # GPU count makes available. Only the smallest concurrencies (1,4,8) — | ||
| # this topology is decode-latency focused, not throughput. |
There was a problem hiding this comment.
🟡 The header comments in both new tp8-1p1d.yaml files claim the recipe exercises 'the smallest concurrencies (1,4,8)', but the benchmark.concurrencies field is just "4" in both files (and the corresponding nvidia-master.yaml entries use conc-list: [4]). Either update the comments to say (4), or extend the conc-lists to include 1 and 8 if those were intended.
Extended reasoning...
Both new B300 tp8-1p1d recipes contain a self-contradictory header comment vs. their actual benchmark configuration:
benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/1k1k/tp8-1p1d.yamlline 5:# available. Only the smallest concurrencies (1,4,8) —versus line 77:concurrencies: "4"benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/8k1k/tp8-1p1d.yamlline 5:# available. Smallest concurrencies only (1,4,8).versus line 72:concurrencies: "4"
The master config corroborates the file-level value: in .github/configs/nvidia-master.yaml, both tp8-1p1d entries (the 1k1k and 8k1k slots under minimaxm2.5-fp4-b300-dynamo-vllm) use conc-list: [4] — not [1, 4, 8]. So the comment is factually wrong about what the recipe exercises.
Impact: Documentation-only. The recipes still run correctly (only concurrency 4 is benchmarked, matching the nvidia-master.yaml wiring). The risk is misleading future readers who look at the recipe in isolation, infer that 1 and 8 are also covered, and either skip adding them to a follow-up sweep or get confused when results only show one point.
Why existing review didn't catch it: The conc-list values are spread across three files (the recipe yaml, the recipe yaml's concurrencies string, and nvidia-master.yaml). The comment was likely an earlier intent that got narrowed during tuning, but the prose wasn't updated to match.
Step-by-step proof:
- Open
benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/1k1k/tp8-1p1d.yaml. The header (lines 3–6) says:# B300-only: full-node TP=8 decode … Only the smallest concurrencies (1,4,8) — this topology is decode-latency focused, not throughput. - Scroll to the
benchmark:block at the bottom —concurrencies: "4". The recipe's own format for multiple values would be"1x4x8"(cf. the siblingtp4-1p1d.yamlwhich uses"4x16"). - Open
.github/configs/nvidia-master.yamland find thetp8-1p1dcells (1k1k and 8k1k) underminimaxm2.5-fp4-b300-dynamo-vllm: both haveconc-list: [4]. No 1 or 8. - So at runtime the orchestrator and the recipe agree on concurrency = 4 only. The comment, in two separate files, asserts (1,4,8). Mismatch confirmed.
Fix (trivial, pick one):
- Update both header comments to say
Smallest concurrency only (4).and adjust the surrounding prose, OR - Expand
concurrenciesto"1x4x8"in both recipe yamls andconc-list: [1, 4, 8]in both nvidia-master.yaml cells if 1 and 8 were intentionally scoped out (the comment says they should be there).
| - "Same 1k/1k and 8k/1k search space as gb300, plus a new tp8-1p1d at low concurrencies for both ISLs" | ||
| pr-link: https://github.com/NVIDIA/InferenceMAX/pull/83 |
There was a problem hiding this comment.
🟡 perf-changelog.yaml:3418-3419 has two doc nits in the new minimaxm2.5-fp4-b300-dynamo-vllm entry. (1) The description says "Same 1k/1k and 8k/1k search space as gb300" but no gb300 config key exists on main — only the GB200 sibling does, and the GB300 sibling lives in a separate yet-to-merge PR, so this is a dangling forward-reference. (2) The pr-link is https://github.com/NVIDIA/InferenceMAX/pull/83, but every other entry in this 3400+ line file (including the GB200 sibling at line 3411 → pull/1642) links to SemiAnalysisAI/InferenceX/pull/<n>; this PR is #1652, so the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/1652.
Extended reasoning...
Summary
Two metadata-only nits in the new perf-changelog entry for minimaxm2.5-fp4-b300-dynamo-vllm (perf-changelog.yaml:3413–3419).
1. Dangling gb300 reference (line 3418)
The description bullet reads:
"Same 1k/1k and 8k/1k search space as gb300, plus a new tp8-1p1d at low concurrencies for both ISLs"
But no gb300 config key exists in the repository. Grepping nvidia-master.yaml for minimaxm2.5.*gb returns only minimaxm2.5-fp4-gb200-dynamo-vllm (line 9909). The PR description itself acknowledges: "B300 half of split #1560 (GB300 sibling lives in a separate PR so one CI failure doesn’t block the other)." So the changelog is referring to a sibling artifact that has not yet merged.
Note also that this is not simply a typo for gb200 — the B300 1k/1k search space contains cells that the existing GB200 entry does not (dep2-2p3d, dep2-2p3d-c6144, tp4-1p2d, tp8-1p1d, dep4-4p1d, dep8-4p1d, tp4ep-2p1d), so the reference really does point at the unmerged GB300 sibling. Readers who later consult the changelog will see "search space as gb300" with no easy way to find that gb300 entry (it either does not exist yet, or, if/when the sibling lands, will live elsewhere in the file with no link from here).
Fix options:
- Reference the merged sibling:
"Same 1k/1k and 8k/1k search space as gb200, plus ...", or - Make the cross-reference explicit:
"Same 1k/1k and 8k/1k search space as the GB300 sibling in #<sibling-PR-number>, plus ..."so the dangling forward-reference becomes a clickable pointer.
2. pr-link points to a different repo (line 3419)
The entry sets:
pr-link: https://github.com/NVIDIA/InferenceMAX/pull/83This is the only entry in the 3400+ line perf-changelog.yaml that uses NVIDIA/InferenceMAX as the host. Every other entry uses https://github.com/SemiAnalysisAI/InferenceX/pull/<n>, including the immediately-preceding GB200 sibling at line 3411 which points to pull/1642 (the #1642 PR in this repo). Since this PR is #1652 in SemiAnalysisAI/InferenceX, the consistent link would be:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1652The README confirms the repo was renamed: "InferenceX™ (formerly InferenceMAX)" — so the URL is using the former name with what appears to be an unrelated PR number (#83 in NVIDIA/InferenceMAX is not a public repo referenced anywhere else in this codebase). Git history shows commit 57fb086 (perf-changelog: link minimaxm2.5-fp4-b300 entry to PR #83) explicitly retargeted the link from PLACEHOLDER_PR_LINK to this URL — likely a copy-paste mistake (#83 is an internal/fork PR number, not the public one).
Proof / step-by-step
- Grep for
NVIDIA/InferenceMAXin the repo → only one hit:perf-changelog.yaml:3419. - Grep for
pr-link.*SemiAnalysisAI/InferenceX→ ~600 hits, including line 3411 which usespull/1642. - Grep for
minimaxm.?2\.5.*gbin the repo → onlygb200matches; nogb300anywhere. - PR description states the GB300 sibling is in a separate not-yet-merged PR.
- The B300 search space added in this PR differs structurally from the existing GB200 entry, so the
gb300reference is not a typo forgb200.
Impact
Documentation-only, no behavioral effect — both verifier sets unanimously rate this nit. But the changelog is the canonical history readers (and tooling) consult to chase recipe changes, and both issues actively break it: anyone clicking through the pr-link lands on a 404 (or unrelated repo), and anyone trying to compare the b300 search space against the referenced gb300 finds no such entry.
Suggested fix
- config-keys:
- minimaxm2.5-fp4-b300-dynamo-vllm
description:
- "Add MiniMax-M2.5 NVFP4 B300 disaggregated multinode vLLM benchmarks via Dynamo"
- "Image: vllm/vllm-openai:v0.20.1"
- "Same 1k/1k and 8k/1k search space as gb200, plus a new tp8-1p1d at low concurrencies for both ISLs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1652(or rephrase to point at the GB300 sibling PR number once it is open).
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26857293503 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26857293503 |
|
@Ankur-singh this is failing, can u take a look? https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26857293503/job/79211249388?pr=1652 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26857293503 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26865500202 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 15a96d5. Configure here.
| cd "$SRT_REPO_DIR" || exit 1 | ||
| git checkout main | ||
| mkdir -p recipes/vllm/minimax-m2.5 | ||
| cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300" recipes/vllm/minimax-m2.5 |
There was a problem hiding this comment.
Missing seeded venv for Dynamo
High Severity
New minimaxm2.5 dynamo-vllm recipes pin dynamo.wheel, but launch_b300-nv.sh still creates the srtctl venv with plain uv venv, so pip is missing and Dynamo wheel prefetch can fail before jobs start.
Reviewed by Cursor Bugbot for commit 15a96d5. Configure here.


B300 half of split #1560 (GB300 sibling lives in a separate PR so one CI failure doesn't block the other).
Summary
minimaxm2.5-fp4-b300-dynamo-vllmto.github/configs/nvidia-master.yaml(1k1k + 8k1k search spaces, including a newtp8-1p1dcell at low concurrencies)benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/minimaxm2.5 + fp4 + dynamo-vllmrouting intorunners/launch_b300-nv.shImage:
vllm/vllm-openai:v0.20.1; model:nvidia/MiniMax-M2.5-NVFP4.Note
Low Risk
Benchmark/CI and runner wiring only; no production inference, auth, or application runtime changes.
Overview
Adds B300 disaggregated multinode benchmarking for MiniMax-M2.5 NVFP4 with Dynamo + vLLM (
vllm/vllm-openai:v0.20.1).A new
minimaxm2.5-fp4-b300-dynamo-vllmentry innvidia-master.yamldefines 1k/1k and 8k/1k fixed-seq-len search spaces (prefill/decode worker counts, TP/EP, optionaldp-attn), pointingCONFIG_FILEatrecipes/vllm/minimax-m2.5/...including new low-concurrencytp8-1p1dcells. Matching Slurm recipe YAMLs land underbenchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/(TP/EP/DP decode variants, expert-parallel and data-parallel decode configs, Nixl KV transfer).runners/launch_b300-nv.shnow routesminimaxm2.5+fp4+dynamo-vllmto the on-disk model path and copies those recipes into NVIDIAsrt-slurmonmain.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 15a96d5. Bugbot is set up for automated code reviews on this repo. Configure here.