dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space#1645
dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space#1645JohnQinAMD wants to merge 2 commits into
Conversation
Bump image lmsysorg/sglang:v0.5.12-rocm700-mi35x -> lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 (ROCm 7.0->7.2; same image validated multi-node in PR #1566). Add tp:4 search-space. Recipe unchanged (already QR-INT4 + range-ratio 0.8). fp8-mi355x untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
…1645) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| - "Image: lmsysorg/sglang:v0.5.12-rocm700-mi35x -> lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 (ROCm 7.0 -> 7.2; same image already validated multi-node in PR #1566)" | ||
| - "Add tp:4 to search-space (8k1k for non-mtp; 1k1k+8k1k for mtp) to surface the TP4 throughput-per-GPU frontier" | ||
| - "Local MI355X bench (matched cfg: range-ratio 0.8 + ROCM_QUICK_REDUCE_QUANTIZATION=INT4): 8k1k decode TPOT 9-25% faster at every conc vs the rocm700 CI; c64 now leads both axes (TPOT 16.5 vs 18.1 ms, tput 2954 vs 2841 tok/s/gpu). TP4 c64 8k1k = 4254 tok/s/gpu (> TP8 2954)" | ||
| pr-link: PR_LINK_PENDING |
There was a problem hiding this comment.
🔴 The new perf-changelog entry uses pr-link: PR_LINK_PENDING instead of the canonical placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX documented in AGENTS.md:128 and used by all ~100 existing entries. On a clean merge this non-URL string lands on main as-is; on a merge-conflict path it breaks utils/merge_with_reuse.sh (the script filters PR entries by 'XXX' in link or link.endswith('/pull/{pr}'), neither of which matches PR_LINK_PENDING, so the entry is silently skipped and the tool aborts with "No PR contributions found"). Replace with https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (or the final /pull/1645 URL).
Extended reasoning...
What the bug is
perf-changelog.yaml line 3406 sets pr-link: PR_LINK_PENDING for the new entry. The repository's documented convention (AGENTS.md:128) is a real GitHub URL or the literal placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, and every one of the ~100 existing entries in the file follows that format. PR_LINK_PENDING is neither, and nothing in the repo rewrites it before merge (a repo-wide grep finds the string only in perf-changelog.yaml itself).
Why this matters in practice
The auto-merge helper utils/merge_with_reuse.sh recognises a PR's own changelog entry via this filter (lines ~140-141):
link = str(entry.get('pr-link') or '')
if 'XXX' not in link and not link.endswith(f'/pull/{pr}'):
continuePR_LINK_PENDING contains no XXX and does not end with /pull/1645, so the entry is silently skipped. The subsequent guard (if not contribs: sys.exit("No PR contributions found...")) then aborts the reuse-merge tool whenever a perf-changelog.yaml conflict arises against this PR — the documented recovery path is broken.
Step-by-step proof
- Author opens PR dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space #1645 with
pr-link: PR_LINK_PENDINGin the new changelog entry. - Another perf-changelog PR lands first, creating a merge conflict on
perf-changelog.yamlfor dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space #1645. - A maintainer runs
utils/merge_with_reuse.shto auto-resolve, passingpr=1645. - The script iterates
pr_dataand computeslink = "PR_LINK_PENDING". The check'XXX' not in link and not link.endswith('/pull/1645')evaluatesTrue and True == True, socontinueruns and this PR's contribution is dropped. - After the loop,
contribsis empty →sys.exit("No PR contributions found ...")fires; the auto-merge aborts. - Even without conflicts, the literal
PR_LINK_PENDINGends up onmain, breaking the documented convention and any tooling that parses pr-links as URLs.
How to fix
Replace pr-link: PR_LINK_PENDING with either the canonical placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (which merge_with_reuse.sh and convention both recognise) or, since the PR number is known, the final https://github.com/SemiAnalysisAI/InferenceX/pull/1645. A one-character / one-line diff.
There was a problem hiding this comment.
Fixed — pr-link was updated to the real URL in the second commit (4f52374) immediately after the PR was opened.
What
Bump the SGLang DSR1-FP4 MI355X benchmark image from
lmsysorg/sglang:v0.5.12-rocm700-mi35x(ROCm 7.0) tolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519(ROCm 7.2), and add atp:4search-space point. Applies to bothdsr1-fp4-mi355x-sglanganddsr1-fp4-mi355x-sglang-mtp.Why
The SGLang DSR1-FP4 MI355X config is the last one still on ROCm 7.0 (the ATOM config is already on 7.2.3). Local MI355X benches with the matched recipe (range-ratio 0.8 +
ROCM_QUICK_REDUCE_QUANTIZATION=INT4, MTP):The target image is the same one already validated multi-node in PR #1566, so it's low-risk.
Scope / safety
dsr1_fp4_mi355x_mtp.shis unchanged (already has QR-INT4 + range-ratio 0.8 + the MLA/MTP flags; handles--tensor-parallel-size=4).amd-master.yaml+ oneperf-changelog.yamltrigger entry.Caveat
The ATOM config keeps 8k1k TP4 commented out (likely an OOM at high conc for that stack). For SGLang, TP4 (4 GPUs, ~90 GB MXFP4 weights) was benched cleanly at c4-64; if CI OOMs at tp4/8k1k, fall back to
conc-start: 32like ATOM.Note
Low Risk
Benchmark-only YAML and changelog changes; serving recipes are unchanged and the target image was already validated on related MI355X workloads.
Overview
Updates DeepSeek-R1 FP4 MI355X SGLang benchmarks (
dsr1-fp4-mi355x-sglangand-mtp) to move off the last ROCm 7.0 image ontolmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519(ROCm 7.2), matching the build already exercised on multi-node disagg in PR #1566.The sweep matrix gains
tp: 4alongside existingtp: 8—8k1k for both configs, plus 1k1k for the MTP variant—to capture a better throughput-per-GPU point.agentic-codingon the non-MTP block is commented out for this validation PR to limit CI cost;perf-changelog.yamlrecords the image bump, TP4 addition, and local perf notes.Reviewed by Cursor Bugbot for commit 4f52374. Bugbot is set up for automated code reviews on this repo. Configure here.