Add GitHub Action to collect SPEED-Bench AL matrix by qiching · Pull Request #1650 · SemiAnalysisAI/InferenceX

qiching · 2026-06-02T21:33:20Z

Summary

Adds a push-button GitHub Action that produces the DeepSeek-V4-Pro SPEED-Bench acceptance-length (AL) matrix — thinking_on/off × MTP (num_speculative_tokens) 1–8 — on the self-hosted B300 runners, and (optionally) opens a PR that updates benchmarks/speedbench-reference-al.yaml. This is the AL-distribution collection that the synthetic-acceptance MTP framework consumes as its golden reference.

Triggered manually via workflow_dispatch (MTP levels, thinking modes, category, output length, allocation time, and whether to auto-open a PR are all inputs).

What's in this PR

File	Role
`benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh`	The AL collector. For each `(thinking, MTP)` cell: start a vLLM server, run SPEED-Bench on one category, derive AL from `/metrics` (`accepted_tokens / drafts + 1`), and emit a YAML matrix identical in shape to `benchmarks/speedbench-reference-al.yaml`.
`runners/launch_b300-nv.sh`	Two opt-in hooks (both default to prior behavior): `BENCH_SCRIPT_OVERRIDE` (run a specific script instead of the auto-selected throughput benchmark) and `SALLOC_TIME_LIMIT` (raise the Slurm time limit; the 16 server starts need more than the 180-min default).
`.github/workflows/speedbench-al.yml`	`workflow_dispatch` entry point: passes the matrix tunables into the launcher, uploads the matrix + server logs as artifacts, and optionally opens a PR updating the reference YAML.

How it fits together

speedbench-al.yml  --(env: BENCH_SCRIPT_OVERRIDE, SALLOC_TIME_LIMIT, MTP_LIST, ...)-->
  runners/launch_b300-nv.sh  --(salloc + srun --export=ALL into the vLLM container)-->
    dsv4_fp4_b300_vllm_speedbench_matrix.sh  -->  speedbench-reference-al.yaml

The workflow only passes parameters and opens the PR; the launcher acquires the GPU node and enters the container; the collector runs the measurement. This reuses the existing single-node launcher path rather than duplicating the salloc/srun/enroot/mount logic.

Model path handling

The collector serves from SERVE_MODEL="${MODEL_PATH:-$MODEL}":

In CI, the workflow sets MODEL to the HF id deepseek-ai/DeepSeek-V4-Pro; the launcher resolves MODEL_PATH to the pre-staged local weights (its basename is in STAGED_MODELS) and mounts them, so the collector serves locally with no download.
For a standalone local run, MODEL_PATH is unset and MODEL is itself a local path, so the same script works unchanged.

Measurement config (for reviewers)

Max OSL = 4096 (--speed-bench-output-len 4096), exposed as the workflow output-len input. This is the recommended setting and is applied to every cell.
--max-model-len 16384 is the server's total context budget (real SPEED-Bench prompt length + the 4096-token output), not the OSL. It is a workload constant for this benchmark (there is no ISL/OSL sweep here), which is why it is fixed rather than injected per-config like the throughput recipes.
Category defaults to coding; thinking-on cells use chat_template_kwargs = {"thinking": true, "reasoning_effort": "high"} to match the golden/production config.
The reference AL matrix was measured with exactly this config, so the values the Action produces are directly comparable.

Deliberate, documented exception: temporary `--chat-template-kwargs` shim

The collector contains a small monkeypatch shim (the apply_chat_template_kwargs_shim function) that patches vllm.benchmarks at runtime to add a real --chat-template-kwargs CLI option. This is non-typical for this repo (no other script patches a third-party library), so calling it out explicitly:

Why it's needed: until vllm-project/vllm#44244 ships in the benchmark image, speed_bench/CustomDataset pre-renders the chat template client-side without chat_template_kwargs and posts to /v1/completions, so thinking mode cannot be enabled via --extra-body or --default-chat-template-kwargs. The shim wires a proper --chat-template-kwargs through get_samples into CustomDataset.sample's apply_chat_template.
Why it's safe: it is idempotent (guarded by a marker check, so re-running is a no-op), is applied only when a thinking-on cell is requested, asserts its anchors match exactly, and exit 1s the whole run if the patch fails rather than silently producing wrong (non-thinking) numbers.
Lifecycle: delete the entire shim block once #44244 is released in the benchmark image. It is intentionally self-contained and marked TODO for that removal.

This is the only part that does not look like the rest of the repo; it is a known trade-off, not an oversight.

Backward compatibility

Both launcher hooks are pure opt-in (${BENCH_SCRIPT_OVERRIDE:-}, ${SALLOC_TIME_LIMIT:-180}) — existing callers that don't set them get exactly the previous behavior. This follows the repo's existing ${VAR:-default} switch pattern (EVAL_ONLY, RUN_EVAL, etc.).

Test plan

Test via workflow_dispatch with a trimmed matrix (mtp-list: "1", thinking-modes: "off", open-pr: false) to validate the full CI chain (model loads, dataset downloads, AL is computed, artifact uploads).
Confirm the produced YAML matches the expected shape and that thinking-on/off level-1 AL values are sane (locally observed: thinking_on: 1.79, thinking_off: 1.92).
Full run (mtp-list: "1 2 3 4 5 6 7 8", thinking-modes: "off on") with open-pr: true; review the auto-opened reference-YAML PR before merging.

Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on self-hosted B300 runners, optionally opening a PR that updates benchmarks/speedbench-reference-al.yaml. - benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh: per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local pre-staged weights resolved by the launcher), falling back to MODEL for a standalone local run. Carries a temporary --chat-template-kwargs shim until vllm-project/vllm#44244 lands in the benchmark image (idempotent, applied only for thinking-on cells). - runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and SALLOC_TIME_LIMIT hooks; both default to the prior behavior. - .github/workflows/speedbench-al.yml: workflow_dispatch entry point; MODEL is the HF id so the launcher resolves the staged MODEL_PATH.

github-project-automation Bot added this to InferenceMAX Board Jun 2, 2026

xinli-sw mentioned this pull request Jun 2, 2026

[Tracking Issue] Synthetic Acceptance for MTP Benchmarks #1651

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GitHub Action to collect SPEED-Bench AL matrix#1650

Add GitHub Action to collect SPEED-Bench AL matrix#1650
qiching wants to merge 1 commit into
SemiAnalysisAI:mainfrom
qiching:albecheng/speedbench-al-action

qiching commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qiching commented Jun 2, 2026

Summary

What's in this PR

How it fits together

Model path handling

Measurement config (for reviewers)

Deliberate, documented exception: temporary --chat-template-kwargs shim

Backward compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deliberate, documented exception: temporary `--chat-template-kwargs` shim