Skip to content

dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space#1645

Open
JohnQinAMD wants to merge 2 commits into
mainfrom
dsr1-mi355x-sglang-rocm720-bump
Open

dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space#1645
JohnQinAMD wants to merge 2 commits into
mainfrom
dsr1-mi355x-sglang-rocm720-bump

Conversation

@JohnQinAMD
Copy link
Copy Markdown
Collaborator

@JohnQinAMD JohnQinAMD commented Jun 2, 2026

What

Bump the SGLang DSR1-FP4 MI355X benchmark image from lmsysorg/sglang:v0.5.12-rocm700-mi35x (ROCm 7.0) to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 (ROCm 7.2), and add a tp:4 search-space point. Applies to both dsr1-fp4-mi355x-sglang and dsr1-fp4-mi355x-sglang-mtp.

Why

The SGLang DSR1-FP4 MI355X config is the last one still on ROCm 7.0 (the ATOM config is already on 7.2.3). Local MI355X benches with the matched recipe (range-ratio 0.8 + ROCM_QUICK_REDUCE_QUANTIZATION=INT4, MTP):

  • 8k1k decode TPOT is 9-25% faster at every concurrency vs the current rocm700 CI.
  • At c64 the rocm720 build leads both axes: TPOT 16.5 vs 18.1 ms and tput 2954 vs 2841 tok/s/gpu.
  • TP4 at c64/8k1k reaches 4254 tok/s/gpu (> TP8's 2954) — adds a throughput-per-GPU frontier point.

The target image is the same one already validated multi-node in PR #1566, so it's low-risk.

Scope / safety

  • Recipe dsr1_fp4_mi355x_mtp.sh is unchanged (already has QR-INT4 + range-ratio 0.8 + the MLA/MTP flags; handles --tensor-parallel-size=4).
  • fp8-mi355x configs are untouched (still rocm700) — only the two fp4-sglang blocks change.
  • Diff: 5 insertions / 2 deletions in amd-master.yaml + one perf-changelog.yaml trigger entry.

Caveat

The ATOM config keeps 8k1k TP4 commented out (likely an OOM at high conc for that stack). For SGLang, TP4 (4 GPUs, ~90 GB MXFP4 weights) was benched cleanly at c4-64; if CI OOMs at tp4/8k1k, fall back to conc-start: 32 like ATOM.


Note

Low Risk
Benchmark-only YAML and changelog changes; serving recipes are unchanged and the target image was already validated on related MI355X workloads.

Overview
Updates DeepSeek-R1 FP4 MI355X SGLang benchmarks (dsr1-fp4-mi355x-sglang and -mtp) to move off the last ROCm 7.0 image onto lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 (ROCm 7.2), matching the build already exercised on multi-node disagg in PR #1566.

The sweep matrix gains tp: 4 alongside existing tp: 88k1k for both configs, plus 1k1k for the MTP variant—to capture a better throughput-per-GPU point. agentic-coding on the non-MTP block is commented out for this validation PR to limit CI cost; perf-changelog.yaml records the image bump, TP4 addition, and local perf notes.

Reviewed by Cursor Bugbot for commit 4f52374. Bugbot is set up for automated code reviews on this repo. Configure here.

Bump image lmsysorg/sglang:v0.5.12-rocm700-mi35x -> lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 (ROCm 7.0->7.2; same image validated multi-node in PR #1566). Add tp:4 search-space. Recipe unchanged (already QR-INT4 + range-ratio 0.8). fp8-mi355x untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…1645)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread perf-changelog.yaml Outdated
- "Image: lmsysorg/sglang:v0.5.12-rocm700-mi35x -> lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519 (ROCm 7.0 -> 7.2; same image already validated multi-node in PR #1566)"
- "Add tp:4 to search-space (8k1k for non-mtp; 1k1k+8k1k for mtp) to surface the TP4 throughput-per-GPU frontier"
- "Local MI355X bench (matched cfg: range-ratio 0.8 + ROCM_QUICK_REDUCE_QUANTIZATION=INT4): 8k1k decode TPOT 9-25% faster at every conc vs the rocm700 CI; c64 now leads both axes (TPOT 16.5 vs 18.1 ms, tput 2954 vs 2841 tok/s/gpu). TP4 c64 8k1k = 4254 tok/s/gpu (> TP8 2954)"
pr-link: PR_LINK_PENDING
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog entry uses pr-link: PR_LINK_PENDING instead of the canonical placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX documented in AGENTS.md:128 and used by all ~100 existing entries. On a clean merge this non-URL string lands on main as-is; on a merge-conflict path it breaks utils/merge_with_reuse.sh (the script filters PR entries by 'XXX' in link or link.endswith('/pull/{pr}'), neither of which matches PR_LINK_PENDING, so the entry is silently skipped and the tool aborts with "No PR contributions found"). Replace with https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (or the final /pull/1645 URL).

Extended reasoning...

What the bug is

perf-changelog.yaml line 3406 sets pr-link: PR_LINK_PENDING for the new entry. The repository's documented convention (AGENTS.md:128) is a real GitHub URL or the literal placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, and every one of the ~100 existing entries in the file follows that format. PR_LINK_PENDING is neither, and nothing in the repo rewrites it before merge (a repo-wide grep finds the string only in perf-changelog.yaml itself).

Why this matters in practice

The auto-merge helper utils/merge_with_reuse.sh recognises a PR's own changelog entry via this filter (lines ~140-141):

link = str(entry.get('pr-link') or '')
if 'XXX' not in link and not link.endswith(f'/pull/{pr}'):
    continue

PR_LINK_PENDING contains no XXX and does not end with /pull/1645, so the entry is silently skipped. The subsequent guard (if not contribs: sys.exit("No PR contributions found...")) then aborts the reuse-merge tool whenever a perf-changelog.yaml conflict arises against this PR — the documented recovery path is broken.

Step-by-step proof

  1. Author opens PR dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space #1645 with pr-link: PR_LINK_PENDING in the new changelog entry.
  2. Another perf-changelog PR lands first, creating a merge conflict on perf-changelog.yaml for dsr1-fp4-mi355x-sglang: bump ROCm 7.0->7.2 image + add TP4 search-space #1645.
  3. A maintainer runs utils/merge_with_reuse.sh to auto-resolve, passing pr=1645.
  4. The script iterates pr_data and computes link = "PR_LINK_PENDING". The check 'XXX' not in link and not link.endswith('/pull/1645') evaluates True and True == True, so continue runs and this PR's contribution is dropped.
  5. After the loop, contribs is empty → sys.exit("No PR contributions found ...") fires; the auto-merge aborts.
  6. Even without conflicts, the literal PR_LINK_PENDING ends up on main, breaking the documented convention and any tooling that parses pr-links as URLs.

How to fix

Replace pr-link: PR_LINK_PENDING with either the canonical placeholder https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (which merge_with_reuse.sh and convention both recognise) or, since the PR number is known, the final https://github.com/SemiAnalysisAI/InferenceX/pull/1645. A one-character / one-line diff.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — pr-link was updated to the real URL in the second commit (4f52374) immediately after the PR was opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants