[NV] Add MiniMax-M2.5 FP4 GB300 Dynamo vLLM recipes#1641
[NV] Add MiniMax-M2.5 FP4 GB300 Dynamo vLLM recipes#1641jasonlizhengjian wants to merge 4 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26839794620 |
| no-enable-prefix-caching: true | ||
| max-model-len: 2048 | ||
| max-cudagraph-capture-size: 2048 | ||
| max-num-batched-tokens: 2048 | ||
| max-num-seqs: 864 | ||
| gpu-memory-utilization: 0.90 | ||
| stream-interval: 32 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: "2048" | ||
| random_range_ratio: 0.8 |
There was a problem hiding this comment.
🔴 All 12 new MiniMax-M2.5 recipes set trust-remote-code: true only in the prefill vllm_config block — the decode block is missing it. Since MiniMax-M2.5 ships custom HuggingFace modeling code and decode workers independently load the model, every decode engine will fail at startup with the HF trust_remote_code=True required error, blocking all 12 recipes from reaching a ready state. Add trust-remote-code: true under each decode: block (the dsv4 reference recipes already do this — see benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml lines 75 and 94).
Extended reasoning...
What's wrong
Every one of the 12 new MiniMax-M2.5 recipes added under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/ sets trust-remote-code: true in the prefill block of vllm_config but not in the decode block. Verified by counting occurrences in each file — every recipe yields exactly 1 match, all in the prefill block:
1k1k/dep2-2p3d-c6144.yaml: 1
1k1k/dep2-2p3d.yaml: 1
1k1k/dep8-2p1d.yaml: 1
1k1k/tp4-1p1d.yaml: 1
1k1k/tp4-1p2d.yaml: 1
1k1k/tp4ep-1p1d.yaml: 1
1k1k/tp4ep-1p3d.yaml: 1
8k1k/dep4-4p1d.yaml: 1
8k1k/dep8-4p1d.yaml: 1
8k1k/tp4-1p1d.yaml: 1
8k1k/tp4ep-1p1d.yaml: 1
8k1k/tp4ep-2p1d.yaml: 1
The reference dsv4 disagg recipe benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml sets trust-remote-code: true in both prefill (line 75) and decode (line 94) — that is the deliberate, established convention in this directory tree for models that need it.
Why it manifests
The vllm_config.prefill and vllm_config.decode blocks are passed as independent CLI argument sets to two separate vLLM engine processes — one for each worker role. There is no inheritance: every other flag (kv-transfer-config, kv-cache-dtype, no-enable-prefix-caching, max-model-len, max-cudagraph-capture-size, max-num-batched-tokens, stream-interval) is explicitly repeated in both blocks in every minimax recipe, confirming the author knows the blocks don't inherit.
Decode workers each spin up their own vLLM engine that independently calls HuggingFace's AutoModelForCausalLM.from_pretrained(...) on nvidia/MiniMax-M2.5-NVFP4. MiniMax-M2.5 ships custom modeling code (modeling_minimax.py / configuration_minimax.py), so HF refuses to load it without trust_remote_code=True and raises:
ValueError: ... requires you to execute the configuration file ... in that repo on your local machine. ... passtrust_remote_code=Trueto remove this error.
This fires at engine startup, before any inference happens, so the decode engine never reaches a ready state and the disaggregated job hangs indefinitely waiting for decode readiness.
Why this is provably a bug
The asymmetry cannot be intentional:
- If MiniMax-M2.5 does not need
trust_remote_code, then the prefill setting in all 12 recipes is dead config — and the author wouldn't have added it. - If it does need it (which the prefill setting itself asserts), then the decode workers will crash at HF model loading.
Both branches make the decode omission a bug. The existing in-repo launcher experimental/token_position_decode_slo/minimax-m2.5/serve_minimax_tep8_sbatch.sh (line 49) passes --trust-remote-code to vLLM, further confirming the model requires it. There is also no global default in runners/launch_gb300-nv.sh that injects --trust-remote-code for decode workers.
Impact
All 12 recipes — driving the entire new minimaxm2.5-fp4-gb300-dynamo-vllm config in .github/configs/nvidia-master.yaml (1k/1k and 8k/1k sweeps across TP4, TP4+EP, DEP2, DEP8, and multi-decode layouts) — will fail to start. No benchmark numbers will be produced for this entire config until the fix lands.
Step-by-step proof (worked example: 1k1k/dep2-2p3d.yaml)
- CI selects the
conc-list: [2048]entry innvidia-master.yamlwhoseadditional-settingspoints torecipes/vllm/minimax-m2.5/1k1k/dep2-2p3d.yaml. srtctl applyreads that file; the YAML'svllm_config.prefillincludestrust-remote-code: true, butvllm_config.decode(lines 60–73 in the new file) does not.- srt-slurm launches 2 prefill workers and 3 decode workers as separate SLURM steps. Each decode worker invokes
vllm serve nvidia/MiniMax-M2.5-NVFP4 --kv-transfer-config ... --kv-cache-dtype fp8 --data-parallel-size 2 ... --no-enable-prefix-caching ...— without--trust-remote-code. - vLLM's engine startup calls
AutoConfig.from_pretrained('nvidia/MiniMax-M2.5-NVFP4', trust_remote_code=False). Because the repo ships customconfiguration_minimax.py, HF raisesValueError: The repository for nvidia/MiniMax-M2.5-NVFP4 contains custom code which must be executed to correctly load the model. ... passtrust_remote_code=Trueto remove this error. - Decode worker exits non-zero before serving anything. The disaggregated frontend never sees decode become ready;
VLLM_ENGINE_READY_TIMEOUT_S=3600fires after an hour and the SLURM job is killed with no useful output.
Fix
Add trust-remote-code: true to the decode: block of every recipe under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/ (12 files total):
decode:
kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
kv-cache-dtype: "fp8"
trust-remote-code: true # <-- add this line
...This mirrors the dsv4 pattern at benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb300-1p6d-dep4-tp4.yaml line 94.
| elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then | ||
| git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" | ||
| cd "$SRT_REPO_DIR" | ||
| git checkout main | ||
| mkdir -p recipes/vllm/minimax-m2.5 | ||
| cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5" recipes/vllm/minimax-m2.5 | ||
| else |
There was a problem hiding this comment.
🟡 The new minimax-m2.5 branch in runners/launch_gb300-nv.sh (line 146) runs git checkout main on NVIDIA/srt-slurm, which is the most volatile reference in the repo. Every other arm pins to a less-volatile named branch (aflowers/gb200-dsv4-recipes, sa-submission-q2-2026) or a commit SHA (agentic uses 6e34b8b). The 13 new recipes set setup_script: install-deps.sh while existing dsv4 recipes use vllm-container-deps.sh — so this arm has a real, schema-level dependency on whatever currently exists at srt-slurm@main, and a future upstream rename will silently break it. Pinning to a SHA at the install-deps.sh era would make this reproducible.
Extended reasoning...
What this is
runners/launch_gb300-nv.sh now has a new dynamo-vllm && minimaxm2.5 branch (lines 143-149 in the diff):
elif [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "minimaxm2.5" ]]; then
git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
cd "$SRT_REPO_DIR"
git checkout main
mkdir -p recipes/vllm/minimax-m2.5
cp -rT "$GITHUB_WORKSPACE/benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5" recipes/vllm/minimax-m2.5The other arms in the same file pin more deliberately: agentic checks out a commit SHA (6e34b8b...), the existing dsv4 dynamo-vllm arm uses the named feature branch aflowers/gb200-dsv4-recipes, and the glm5 / default arm uses sa-submission-q2-2026. main is the only moving target where unrelated upstream work is regularly committed.
Why the moving target matters here (concrete proof)
All 13 new recipes added by this PR declare:
setup_script: install-deps.sh…whereas every existing recipe under benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/ uses:
setup_script: vllm-container-deps.shSo this PR has a real, schema-level dependency on a script name that only exists on the current state of srt-slurm@main. Step through what happens when upstream renames or removes install-deps.sh (a perfectly normal refactor on a development branch):
- Next CI run clones
NVIDIA/srt-slurm@mainfresh into$SRT_REPO_DIR. - The recipe is copied in with
setup_script: install-deps.sh. srtctl applyreads the recipe and tries to invokeinstall-deps.shfrom the cloned repo — which no longer exists or has been renamed.- The minimax-m2.5 GB300 benchmark fails (or worse, runs with stale dependency setup) silently from the PR author's point of view. The change that breaks it is in a different repo, on a different branch, by a different author.
The dsv4 and glm5 arms also use mutable references, but they are project-specific feature branches that upstream maintainers don't actively churn — the volatility delta vs. main is real.
Addressing the refutation
The refutation argues this is consistent with the existing pattern (other arms use branches too) and that the author deliberately chose main because that's where install-deps.sh lives. Both points are accurate, but they actually reinforce the concern: the PR is intentionally taking a dependency on the current state of main, which is exactly the moment to pin to a SHA. The author already knows which commit currently has the script with the expected name and shape — capturing it as a SHA costs nothing and makes the benchmark reproducible. The agentic precedent in this same file shows the pattern is already used in the repo when reproducibility matters.
Fix
Replace git checkout main with the current SHA of NVIDIA/srt-slurm@main at the time of this PR (the commit that carries install-deps.sh in its current form), matching how the agentic arm pins to 6e34b8b. Severity is nit rather than blocking because (a) other arms in this file already use named branches without SHA pinning, and (b) this is benchmark/CI infrastructure rather than runtime code.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26840145659 |
d5387cb to
45759a7
Compare
…r1641 # Conflicts: # perf-changelog.yaml
Summary
Add GB300 MiniMax-M2.5 FP4 Dynamo vLLM recipes.
Based on the GB300 portion of #1560.
Note
Low Risk
Benchmark and runner configuration only; no application runtime or security logic changes.
Overview
Adds GB300 coverage for MiniMax-M2.5 NVFP4 with Dynamo + disaggregated vLLM multinode benchmarks.
A new
minimaxm2.5-fp4-gb300-dynamo-vllmentry innvidia-master.yamldefines 1k/1k and 8k/1k scenarios with prefill/decode worker layouts (TP/EP, multi-decode, DP-attn) andCONFIG_FILEpointers into new Slurm recipes. 13 recipe YAMLs underbenchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m2.5/(1k1k and 8k1k variants) configure Dynamo, Nixl KV transfer, andsa-benchconcurrencies.runners/launch_gb300-nv.shmaps the model to/data/models/MiniMax-M2.5-NVFP4, stages recipes intosrt-slurmonmain, and removes stale eval artifacts before copying results.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 66b7e04. Bugbot is set up for automated code reviews on this repo. Configure here.