From a79476d39ff0f00b18c77004f703576b58540935 Mon Sep 17 00:00:00 2001 From: Chun Fang Date: Mon, 1 Jun 2026 12:47:41 +0000 Subject: [PATCH 1/2] Enable Rust frontend (VLLM_USE_RUST_FRONTEND=1) With Rust frontend, we don't change kernel, attention, MoE GEMM, or KV cache. So it won't change the Througput and TPOT. But it benefits TTFT as it helps to decrease the frontend CUP time cost from the moment of requesting to generate the first token. --- benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh b/benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh index 4d8fbc9ed..df9323d0d 100755 --- a/benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh +++ b/benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh @@ -25,6 +25,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then fi export VLLM_ROCM_USE_AITER=1 +export VLLM_USE_RUST_FRONTEND=1 EXTRA_VLLM_ARGS="" # if [ "$TP" -ge 4 ]; then # # AITER CK fused MoE kernels lack compiled tiles for N=intermediate_size/TP From 7d601a090f5965770c4e27cbba706768ddde4eaf Mon Sep 17 00:00:00 2001 From: Chun Fang Date: Mon, 1 Jun 2026 12:58:10 +0000 Subject: [PATCH 2/2] Update per-changelog --- perf-changelog.yaml | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 61ce924d5..d5a28c244 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -3348,3 +3348,11 @@ description: - "Add MTP speculative-decoding sibling for dsv4-fp4-mi355x-vllm (model: deepseek-ai/DeepSeek-V4-Pro) on vllm/vllm-openai-rocm:v0.22.0, per vllm-project/vllm#43385" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1630 + +- config-keys: + - minimaxm2.5-fp4-mi355x-vllm + description: + - "Enable vLLM Rust request frontend by exporting VLLM_USE_RUST_FRONTEND=1 in benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh (v0.22.0 ROCm image ships the vllm-rs binary, so the flag engages it). Environment-only change; serve flags, TP/EP, attention/kernel settings unchanged" + - "The Rust frontend replaces only the Python serving/API layer (HTTP, tokenization, scheduling glue, detokenization) and spawns the same Python EngineCore, so GPU kernels/attention/MoE GEMM/KV cache are untouched" + - "A/B sweep (28 single-node points, 1k1k + 8k1k, TP 1/2/4) vs the Python-frontend baseline (run 26696260751): throughput Pareto-neutral (peak tok/s/GPU within <1.5%, frontiers coincident) and TPOT flat (+-0.5%); TTFT improves ~8% at 1k1k and ~22% at 8k1k (every point), the expected signature of lower frontend CPU latency before first token, scaling with input length" + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1634