Skip to content

[B300][vLLM] Add MiniMax-M2.5 FP4 disagg Dynamo configs#1652

Merged
functionstackx merged 4 commits into
mainfrom
split-pr1560-minimax-b300
Jun 3, 2026
Merged

[B300][vLLM] Add MiniMax-M2.5 FP4 disagg Dynamo configs#1652
functionstackx merged 4 commits into
mainfrom
split-pr1560-minimax-b300

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@Ankur-singh Ankur-singh commented Jun 3, 2026

B300 half of split #1560 (GB300 sibling lives in a separate PR so one CI failure doesn't block the other).

Summary

  • Add minimaxm2.5-fp4-b300-dynamo-vllm to .github/configs/nvidia-master.yaml (1k1k + 8k1k search spaces, including a new tp8-1p1d cell at low concurrencies)
  • Add srt-slurm vLLM recipes under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/
  • Wire minimaxm2.5 + fp4 + dynamo-vllm routing into runners/launch_b300-nv.sh
  • Append perf-changelog entry

Image: vllm/vllm-openai:v0.20.1; model: nvidia/MiniMax-M2.5-NVFP4.


Note

Low Risk
Benchmark/CI and runner wiring only; no production inference, auth, or application runtime changes.

Overview
Adds B300 disaggregated multinode benchmarking for MiniMax-M2.5 NVFP4 with Dynamo + vLLM (vllm/vllm-openai:v0.20.1).

A new minimaxm2.5-fp4-b300-dynamo-vllm entry in nvidia-master.yaml defines 1k/1k and 8k/1k fixed-seq-len search spaces (prefill/decode worker counts, TP/EP, optional dp-attn), pointing CONFIG_FILE at recipes/vllm/minimax-m2.5/... including new low-concurrency tp8-1p1d cells. Matching Slurm recipe YAMLs land under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/ (TP/EP/DP decode variants, expert-parallel and data-parallel decode configs, Nixl KV transfer).

runners/launch_b300-nv.sh now routes minimaxm2.5 + fp4 + dynamo-vllm to the on-disk model path and copies those recipes into NVIDIA srt-slurm on main. perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 15a96d5. Bugbot is set up for automated code reviews on this repo. Configure here.

Split of #1560 — B300 half.

- Add minimaxm2.5-fp4-b300-dynamo-vllm to nvidia-master.yaml (1k1k + 8k1k
  search spaces; image vllm/vllm-openai:v0.20.1, model nvidia/MiniMax-M2.5-NVFP4).
- Add srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/.
- Wire minimax + dynamo-vllm routing into runners/launch_b300-nv.sh.
- Append perf-changelog entry.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +3 to +6
# B300-only: full-node TP=8 decode (the 8 GPUs of a single B300 node).
# Cousin of tp4-1p1d.yaml but exercises the wider TP that B300's per-node
# GPU count makes available. Only the smallest concurrencies (1,4,8) —
# this topology is decode-latency focused, not throughput.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The header comments in both new tp8-1p1d.yaml files claim the recipe exercises 'the smallest concurrencies (1,4,8)', but the benchmark.concurrencies field is just "4" in both files (and the corresponding nvidia-master.yaml entries use conc-list: [4]). Either update the comments to say (4), or extend the conc-lists to include 1 and 8 if those were intended.

Extended reasoning...

Both new B300 tp8-1p1d recipes contain a self-contradictory header comment vs. their actual benchmark configuration:

  • benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/1k1k/tp8-1p1d.yaml line 5: # available. Only the smallest concurrencies (1,4,8) — versus line 77: concurrencies: "4"
  • benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/8k1k/tp8-1p1d.yaml line 5: # available. Smallest concurrencies only (1,4,8). versus line 72: concurrencies: "4"

The master config corroborates the file-level value: in .github/configs/nvidia-master.yaml, both tp8-1p1d entries (the 1k1k and 8k1k slots under minimaxm2.5-fp4-b300-dynamo-vllm) use conc-list: [4] — not [1, 4, 8]. So the comment is factually wrong about what the recipe exercises.

Impact: Documentation-only. The recipes still run correctly (only concurrency 4 is benchmarked, matching the nvidia-master.yaml wiring). The risk is misleading future readers who look at the recipe in isolation, infer that 1 and 8 are also covered, and either skip adding them to a follow-up sweep or get confused when results only show one point.

Why existing review didn't catch it: The conc-list values are spread across three files (the recipe yaml, the recipe yaml's concurrencies string, and nvidia-master.yaml). The comment was likely an earlier intent that got narrowed during tuning, but the prose wasn't updated to match.

Step-by-step proof:

  1. Open benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300/1k1k/tp8-1p1d.yaml. The header (lines 3–6) says: # B300-only: full-node TP=8 decode … Only the smallest concurrencies (1,4,8) — this topology is decode-latency focused, not throughput.
  2. Scroll to the benchmark: block at the bottom — concurrencies: "4". The recipe's own format for multiple values would be "1x4x8" (cf. the sibling tp4-1p1d.yaml which uses "4x16").
  3. Open .github/configs/nvidia-master.yaml and find the tp8-1p1d cells (1k1k and 8k1k) under minimaxm2.5-fp4-b300-dynamo-vllm: both have conc-list: [4]. No 1 or 8.
  4. So at runtime the orchestrator and the recipe agree on concurrency = 4 only. The comment, in two separate files, asserts (1,4,8). Mismatch confirmed.

Fix (trivial, pick one):

  • Update both header comments to say Smallest concurrency only (4). and adjust the surrounding prose, OR
  • Expand concurrencies to "1x4x8" in both recipe yamls and conc-list: [1, 4, 8] in both nvidia-master.yaml cells if 1 and 8 were intentionally scoped out (the comment says they should be there).

Comment thread perf-changelog.yaml Outdated
Comment on lines +3418 to +3419
- "Same 1k/1k and 8k/1k search space as gb300, plus a new tp8-1p1d at low concurrencies for both ISLs"
pr-link: https://github.com/NVIDIA/InferenceMAX/pull/83
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 perf-changelog.yaml:3418-3419 has two doc nits in the new minimaxm2.5-fp4-b300-dynamo-vllm entry. (1) The description says "Same 1k/1k and 8k/1k search space as gb300" but no gb300 config key exists on main — only the GB200 sibling does, and the GB300 sibling lives in a separate yet-to-merge PR, so this is a dangling forward-reference. (2) The pr-link is https://github.com/NVIDIA/InferenceMAX/pull/83, but every other entry in this 3400+ line file (including the GB200 sibling at line 3411 → pull/1642) links to SemiAnalysisAI/InferenceX/pull/<n>; this PR is #1652, so the link should be https://github.com/SemiAnalysisAI/InferenceX/pull/1652.

Extended reasoning...

Summary

Two metadata-only nits in the new perf-changelog entry for minimaxm2.5-fp4-b300-dynamo-vllm (perf-changelog.yaml:3413–3419).

1. Dangling gb300 reference (line 3418)

The description bullet reads:

"Same 1k/1k and 8k/1k search space as gb300, plus a new tp8-1p1d at low concurrencies for both ISLs"

But no gb300 config key exists in the repository. Grepping nvidia-master.yaml for minimaxm2.5.*gb returns only minimaxm2.5-fp4-gb200-dynamo-vllm (line 9909). The PR description itself acknowledges: "B300 half of split #1560 (GB300 sibling lives in a separate PR so one CI failure doesn’t block the other)." So the changelog is referring to a sibling artifact that has not yet merged.

Note also that this is not simply a typo for gb200 — the B300 1k/1k search space contains cells that the existing GB200 entry does not (dep2-2p3d, dep2-2p3d-c6144, tp4-1p2d, tp8-1p1d, dep4-4p1d, dep8-4p1d, tp4ep-2p1d), so the reference really does point at the unmerged GB300 sibling. Readers who later consult the changelog will see "search space as gb300" with no easy way to find that gb300 entry (it either does not exist yet, or, if/when the sibling lands, will live elsewhere in the file with no link from here).

Fix options:

  • Reference the merged sibling: "Same 1k/1k and 8k/1k search space as gb200, plus ...", or
  • Make the cross-reference explicit: "Same 1k/1k and 8k/1k search space as the GB300 sibling in #<sibling-PR-number>, plus ..." so the dangling forward-reference becomes a clickable pointer.

2. pr-link points to a different repo (line 3419)

The entry sets:

pr-link: https://github.com/NVIDIA/InferenceMAX/pull/83

This is the only entry in the 3400+ line perf-changelog.yaml that uses NVIDIA/InferenceMAX as the host. Every other entry uses https://github.com/SemiAnalysisAI/InferenceX/pull/<n>, including the immediately-preceding GB200 sibling at line 3411 which points to pull/1642 (the #1642 PR in this repo). Since this PR is #1652 in SemiAnalysisAI/InferenceX, the consistent link would be:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1652

The README confirms the repo was renamed: "InferenceX™ (formerly InferenceMAX)" — so the URL is using the former name with what appears to be an unrelated PR number (#83 in NVIDIA/InferenceMAX is not a public repo referenced anywhere else in this codebase). Git history shows commit 57fb086 (perf-changelog: link minimaxm2.5-fp4-b300 entry to PR #83) explicitly retargeted the link from PLACEHOLDER_PR_LINK to this URL — likely a copy-paste mistake (#83 is an internal/fork PR number, not the public one).

Proof / step-by-step

  1. Grep for NVIDIA/InferenceMAX in the repo → only one hit: perf-changelog.yaml:3419.
  2. Grep for pr-link.*SemiAnalysisAI/InferenceX → ~600 hits, including line 3411 which uses pull/1642.
  3. Grep for minimaxm.?2\.5.*gb in the repo → only gb200 matches; no gb300 anywhere.
  4. PR description states the GB300 sibling is in a separate not-yet-merged PR.
  5. The B300 search space added in this PR differs structurally from the existing GB200 entry, so the gb300 reference is not a typo for gb200.

Impact

Documentation-only, no behavioral effect — both verifier sets unanimously rate this nit. But the changelog is the canonical history readers (and tooling) consult to chase recipe changes, and both issues actively break it: anyone clicking through the pr-link lands on a 404 (or unrelated repo), and anyone trying to compare the b300 search space against the referenced gb300 finds no such entry.

Suggested fix

- config-keys:
    - minimaxm2.5-fp4-b300-dynamo-vllm
  description:
    - "Add MiniMax-M2.5 NVFP4 B300 disaggregated multinode vLLM benchmarks via Dynamo"
    - "Image: vllm/vllm-openai:v0.20.1"
    - "Same 1k/1k and 8k/1k search space as gb200, plus a new tp8-1p1d at low concurrencies for both ISLs"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1652

(or rephrase to point at the GB300 sibling PR number once it is open).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

@functionstackx
Copy link
Copy Markdown
Collaborator

@Ankur-singh this is failing, can u take a look? https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26857293503/job/79211249388?pr=1652

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 7d4063d into main Jun 3, 2026
16 of 22 checks passed
@functionstackx functionstackx deleted the split-pr1560-minimax-b300 branch June 3, 2026 05:25
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 15a96d5. Configure here.

Comment thread runners/launch_b300-nv.sh
cd "$SRT_REPO_DIR" || exit 1
git checkout main
mkdir -p recipes/vllm/minimax-m2.5
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5-b300" recipes/vllm/minimax-m2.5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing seeded venv for Dynamo

High Severity

New minimaxm2.5 dynamo-vllm recipes pin dynamo.wheel, but launch_b300-nv.sh still creates the srtctl venv with plain uv venv, so pip is missing and Dynamo wheel prefetch can fail before jobs start.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 15a96d5. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants