[NV] Add MiniMax-M2.5 FP4 GB300 Dynamo vLLM recipes by jasonlizhengjian · Pull Request #1641 · SemiAnalysisAI/InferenceX

jasonlizhengjian · 2026-06-02T18:23:08Z

Summary

Add GB300 MiniMax-M2.5 FP4 Dynamo vLLM recipes.

Based on the GB300 portion of #1560.

Note

Low Risk
Benchmark and runner configuration only; no application runtime or security logic changes.

Overview
Adds GB300 coverage for MiniMax-M2.5 NVFP4 with Dynamo + disaggregated vLLM multinode benchmarks.

A new minimaxm2.5-fp4-gb300-dynamo-vllm entry in nvidia-master.yaml defines 1k/1k and 8k/1k scenarios with prefill/decode worker layouts (TP/EP, multi-decode, DP-attn) and CONFIG_FILE pointers into new Slurm recipes. 13 recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/ (1k1k and 8k1k variants) configure Dynamo, Nixl KV transfer, and sa-bench concurrencies.

runners/launch_gb300-nv.sh maps the model to /data/models/MiniMax-M2.5-NVFP4, stages recipes into srt-slurm on main, and removes stale eval artifacts before copying results. perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 66b7e04. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-02T18:23:20Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-02T18:30:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26839794620
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26839794620

claude · 2026-06-02T18:34:51Z

+      no-enable-prefix-caching: true
+      max-model-len: 2048
+      max-cudagraph-capture-size: 2048
+      max-num-batched-tokens: 2048
+      max-num-seqs: 864
+      gpu-memory-utilization: 0.90
+      stream-interval: 32
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2048"
+  random_range_ratio: 0.8


🔴 All 12 new MiniMax-M2.5 recipes set trust-remote-code: true only in the prefill vllm_config block — the decode block is missing it. Since MiniMax-M2.5 ships custom HuggingFace modeling code and decode workers independently load the model, every decode engine will fail at startup with the HF trust_remote_code=True required error, blocking all 12 recipes from reaching a ready state. Add trust-remote-code: true under each decode: block (the dsv4 reference recipes already do this — see benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml lines 75 and 94).

Extended reasoning...

What's wrong

Every one of the 12 new MiniMax-M2.5 recipes added under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/ sets trust-remote-code: true in the prefill block of vllm_config but not in the decode block. Verified by counting occurrences in each file — every recipe yields exactly 1 match, all in the prefill block:

1k1k/dep2-2p3d-c6144.yaml: 1 1k1k/dep2-2p3d.yaml: 1 1k1k/dep8-2p1d.yaml: 1 1k1k/tp4-1p1d.yaml: 1 1k1k/tp4-1p2d.yaml: 1 1k1k/tp4ep-1p1d.yaml: 1 1k1k/tp4ep-1p3d.yaml: 1 8k1k/dep4-4p1d.yaml: 1 8k1k/dep8-4p1d.yaml: 1 8k1k/tp4-1p1d.yaml: 1 8k1k/tp4ep-1p1d.yaml: 1 8k1k/tp4ep-2p1d.yaml: 1

The reference dsv4 disagg recipe benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml sets trust-remote-code: true in both prefill (line 75) and decode (line 94) — that is the deliberate, established convention in this directory tree for models that need it.

Why it manifests

The vllm_config.prefill and vllm_config.decode blocks are passed as independent CLI argument sets to two separate vLLM engine processes — one for each worker role. There is no inheritance: every other flag (kv-transfer-config, kv-cache-dtype, no-enable-prefix-caching, max-model-len, max-cudagraph-capture-size, max-num-batched-tokens, stream-interval) is explicitly repeated in both blocks in every minimax recipe, confirming the author knows the blocks don't inherit.

Decode workers each spin up their own vLLM engine that independently calls HuggingFace's AutoModelForCausalLM.from_pretrained(...) on nvidia/MiniMax-M2.5-NVFP4. MiniMax-M2.5 ships custom modeling code (modeling_minimax.py / configuration_minimax.py), so HF refuses to load it without trust_remote_code=True and raises:

ValueError: ... requires you to execute the configuration file ... in that repo on your local machine. ... pass trust_remote_code=True to remove this error.

This fires at engine startup, before any inference happens, so the decode engine never reaches a ready state and the disaggregated job hangs indefinitely waiting for decode readiness.

Why this is provably a bug

The asymmetry cannot be intentional:

If MiniMax-M2.5 does not need trust_remote_code, then the prefill setting in all 12 recipes is dead config — and the author wouldn't have added it.

If it does need it (which the prefill setting itself asserts), then the decode workers will crash at HF model loading.

Both branches make the decode omission a bug. The existing in-repo launcher experimental/token_position_decode_slo/minimax-m2.5/serve_minimax_tep8_sbatch.sh (line 49) passes --trust-remote-code to vLLM, further confirming the model requires it. There is also no global default in runners/launch_gb300-nv.sh that injects --trust-remote-code for decode workers.

Impact

All 12 recipes — driving the entire new minimaxm2.5-fp4-gb300-dynamo-vllm config in .github/configs/nvidia-master.yaml (1k/1k and 8k/1k sweeps across TP4, TP4+EP, DEP2, DEP8, and multi-decode layouts) — will fail to start. No benchmark numbers will be produced for this entire config until the fix lands.

Step-by-step proof (worked example: 1k1k/dep2-2p3d.yaml)

CI selects the conc-list: [2048] entry in nvidia-master.yaml whose additional-settings points to recipes/vllm/minimax-m2.5/1k1k/dep2-2p3d.yaml.

srtctl apply reads that file; the YAML's vllm_config.prefill includes trust-remote-code: true, but vllm_config.decode (lines 60–73 in the new file) does not.

srt-slurm launches 2 prefill workers and 3 decode workers as separate SLURM steps. Each decode worker invokes vllm serve nvidia/MiniMax-M2.5-NVFP4 --kv-transfer-config ... --kv-cache-dtype fp8 --data-parallel-size 2 ... --no-enable-prefix-caching ... — without --trust-remote-code.

vLLM's engine startup calls AutoConfig.from_pretrained('nvidia/MiniMax-M2.5-NVFP4', trust_remote_code=False). Because the repo ships custom configuration_minimax.py, HF raises ValueError: The repository for nvidia/MiniMax-M2.5-NVFP4 contains custom code which must be executed to correctly load the model. ... pass trust_remote_code=True to remove this error.

Decode worker exits non-zero before serving anything. The disaggregated frontend never sees decode become ready; VLLM_ENGINE_READY_TIMEOUT_S=3600 fires after an hour and the SLURM job is killed with no useful output.

Fix

Add trust-remote-code: true to the decode: block of every recipe under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/ (12 files total):

decode: kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' kv-cache-dtype: "fp8" trust-remote-code: true # <-- add this line ...

This mirrors the dsv4 pattern at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml line 94.

claude · 2026-06-02T18:34:51Z

+elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then
+    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
+    cd "$SRT_REPO_DIR"
+    git checkout main
+    mkdir -p recipes/vllm/minimax-m2.5
+    cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5" recipes/vllm/minimax-m2.5
 else


🟡 The new minimax-m2.5 branch in runners/launch_gb300-nv.sh (line 146) runs git checkout main on NVIDIA/srt-slurm, which is the most volatile reference in the repo. Every other arm pins to a less-volatile named branch (aflowers/gb200-dsv4-recipes, sa-submission-q2-2026) or a commit SHA (agentic uses 6e34b8b). The 13 new recipes set setup_script: install-deps.sh while existing dsv4 recipes use vllm-container-deps.sh — so this arm has a real, schema-level dependency on whatever currently exists at srt-slurm@main, and a future upstream rename will silently break it. Pinning to a SHA at the install-deps.sh era would make this reproducible.

Extended reasoning...

What this is

runners/launch_gb300-nv.sh now has a new dynamo-vllm && minimaxm2.5 branch (lines 143-149 in the diff):

elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" cd "$SRT_REPO_DIR" git checkout main mkdir -p recipes/vllm/minimax-m2.5 cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5" recipes/vllm/minimax-m2.5

The other arms in the same file pin more deliberately: agentic checks out a commit SHA (6e34b8b...), the existing dsv4 dynamo-vllm arm uses the named feature branch aflowers/gb200-dsv4-recipes, and the glm5 / default arm uses sa-submission-q2-2026. main is the only moving target where unrelated upstream work is regularly committed.

Why the moving target matters here (concrete proof)

All 13 new recipes added by this PR declare:

setup_script: install-deps.sh

…whereas every existing recipe under benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/ uses:

setup_script: vllm-container-deps.sh

So this PR has a real, schema-level dependency on a script name that only exists on the current state of srt-slurm@main. Step through what happens when upstream renames or removes install-deps.sh (a perfectly normal refactor on a development branch):

Next CI run clones NVIDIA/srt-slurm@main fresh into $SRT_REPO_DIR.

The recipe is copied in with setup_script: install-deps.sh.

srtctl apply reads the recipe and tries to invoke install-deps.sh from the cloned repo — which no longer exists or has been renamed.

The minimax-m2.5 GB300 benchmark fails (or worse, runs with stale dependency setup) silently from the PR author's point of view. The change that breaks it is in a different repo, on a different branch, by a different author.

The dsv4 and glm5 arms also use mutable references, but they are project-specific feature branches that upstream maintainers don't actively churn — the volatility delta vs. main is real.

Addressing the refutation

The refutation argues this is consistent with the existing pattern (other arms use branches too) and that the author deliberately chose main because that's where install-deps.sh lives. Both points are accurate, but they actually reinforce the concern: the PR is intentionally taking a dependency on the current state of main, which is exactly the moment to pin to a SHA. The author already knows which commit currently has the script with the expected name and shape — capturing it as a SHA costs nothing and makes the benchmark reproducible. The agentic precedent in this same file shows the pattern is already used in the repo when reproducibility matters.

Fix

Replace git checkout main with the current SHA of NVIDIA/srt-slurm@main at the time of this PR (the commit that carries install-deps.sh in its current form), matching how the agentic arm pins to 6e34b8b. Severity is nit rather than blocking because (a) other arms in this file already use named branches without SHA pinning, and (b) this is benchmark/CI infrastructure rather than runtime code.

github-actions · 2026-06-02T20:27:48Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26840145659
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26840145659

…r1641 # Conflicts: # perf-changelog.yaml

jasonlizhengjian requested a review from a team June 2, 2026 18:23

jasonlizhengjian requested review from jgangani and kedarpotdar-nv as code owners June 2, 2026 18:23

github-project-automation Bot added this to InferenceMAX Board Jun 2, 2026

jasonlizhengjian added the full-sweep-enabled label Jun 2, 2026

claude Bot reviewed Jun 2, 2026

View reviewed changes

jasonlizhengjian removed the full-sweep-enabled label Jun 3, 2026

jasonlizhengjian mentioned this pull request Jun 3, 2026

[GB300][vLLM] Add MiniMax-M2.5 FP4 disagg Dynamo configs #1653

Closed

jasonlizhengjian added 3 commits June 2, 2026 21:47

Add MiniMax-M2.5 FP4 GB300 Dynamo vLLM recipes

ac23274

fix: pin minimax gb300 sweep to nv runners

97fbd88

Fix GB300 eval artifact copy

45759a7

functionstackx force-pushed the nv/jasonli/minimaxm2.5-fp4-gb300-only branch from d5387cb to 45759a7 Compare June 3, 2026 01:48

Merge remote-tracking branch 'inferencex/main' into merge-gb300-fp4-p…

66b7e04

…r1641 # Conflicts: # perf-changelog.yaml

jasonlizhengjian added the full-sweep-enabled label Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Add MiniMax-M2.5 FP4 GB300 Dynamo vLLM recipes#1641

[NV] Add MiniMax-M2.5 FP4 GB300 Dynamo vLLM recipes#1641
jasonlizhengjian wants to merge 4 commits into
mainfrom
nv/jasonli/minimaxm2.5-fp4-gb300-only

jasonlizhengjian commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonlizhengjian commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

claude Bot Jun 2, 2026

Choose a reason for hiding this comment

What's wrong

Why it manifests

Why this is provably a bug

Impact

Step-by-step proof (worked example: 1k1k/dep2-2p3d.yaml)

Fix

Uh oh!

claude Bot Jun 2, 2026

Choose a reason for hiding this comment

What this is

Why the moving target matters here (concrete proof)

Addressing the refutation

Fix

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jasonlizhengjian commented Jun 2, 2026 •

edited by cursor Bot

Loading

Step-by-step proof (worked example: `1k1k/dep2-2p3d.yaml`)