Add GitHub Action to collect SPEED-Bench AL matrix#1650
Draft
qiching wants to merge 1 commit into
Draft
Conversation
Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on self-hosted B300 runners, optionally opening a PR that updates benchmarks/speedbench-reference-al.yaml. - benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh: per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local pre-staged weights resolved by the launcher), falling back to MODEL for a standalone local run. Carries a temporary --chat-template-kwargs shim until vllm-project/vllm#44244 lands in the benchmark image (idempotent, applied only for thinking-on cells). - runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and SALLOC_TIME_LIMIT hooks; both default to the prior behavior. - .github/workflows/speedbench-al.yml: workflow_dispatch entry point; MODEL is the HF id so the launcher resolves the staged MODEL_PATH.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a push-button GitHub Action that produces the DeepSeek-V4-Pro SPEED-Bench acceptance-length (AL) matrix —
thinking_on/off × MTP (num_speculative_tokens) 1–8— on the self-hosted B300 runners, and (optionally) opens a PR that updatesbenchmarks/speedbench-reference-al.yaml. This is the AL-distribution collection that the synthetic-acceptance MTP framework consumes as its golden reference.Triggered manually via
workflow_dispatch(MTP levels, thinking modes, category, output length, allocation time, and whether to auto-open a PR are all inputs).What's in this PR
benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh(thinking, MTP)cell: start a vLLM server, run SPEED-Bench on one category, derive AL from/metrics(accepted_tokens / drafts + 1), and emit a YAML matrix identical in shape tobenchmarks/speedbench-reference-al.yaml.runners/launch_b300-nv.shBENCH_SCRIPT_OVERRIDE(run a specific script instead of the auto-selected throughput benchmark) andSALLOC_TIME_LIMIT(raise the Slurm time limit; the 16 server starts need more than the 180-min default)..github/workflows/speedbench-al.ymlworkflow_dispatchentry point: passes the matrix tunables into the launcher, uploads the matrix + server logs as artifacts, and optionally opens a PR updating the reference YAML.How it fits together
The workflow only passes parameters and opens the PR; the launcher acquires the GPU node and enters the container; the collector runs the measurement. This reuses the existing single-node launcher path rather than duplicating the
salloc/srun/enroot/mount logic.Model path handling
The collector serves from
SERVE_MODEL="${MODEL_PATH:-$MODEL}":MODELto the HF iddeepseek-ai/DeepSeek-V4-Pro; the launcher resolvesMODEL_PATHto the pre-staged local weights (its basename is inSTAGED_MODELS) and mounts them, so the collector serves locally with no download.MODEL_PATHis unset andMODELis itself a local path, so the same script works unchanged.Measurement config (for reviewers)
--speed-bench-output-len 4096), exposed as the workflowoutput-leninput. This is the recommended setting and is applied to every cell.--max-model-len 16384is the server's total context budget (real SPEED-Bench prompt length + the 4096-token output), not the OSL. It is a workload constant for this benchmark (there is no ISL/OSL sweep here), which is why it is fixed rather than injected per-config like the throughput recipes.coding; thinking-on cells usechat_template_kwargs = {"thinking": true, "reasoning_effort": "high"}to match the golden/production config.Deliberate, documented exception: temporary
--chat-template-kwargsshimThe collector contains a small monkeypatch shim (the
apply_chat_template_kwargs_shimfunction) that patchesvllm.benchmarksat runtime to add a real--chat-template-kwargsCLI option. This is non-typical for this repo (no other script patches a third-party library), so calling it out explicitly:speed_bench/CustomDatasetpre-renders the chat template client-side withoutchat_template_kwargsand posts to/v1/completions, so thinking mode cannot be enabled via--extra-bodyor--default-chat-template-kwargs. The shim wires a proper--chat-template-kwargsthroughget_samplesintoCustomDataset.sample'sapply_chat_template.exit 1s the whole run if the patch fails rather than silently producing wrong (non-thinking) numbers.TODOfor that removal.This is the only part that does not look like the rest of the repo; it is a known trade-off, not an oversight.
Backward compatibility
Both launcher hooks are pure opt-in (
${BENCH_SCRIPT_OVERRIDE:-},${SALLOC_TIME_LIMIT:-180}) — existing callers that don't set them get exactly the previous behavior. This follows the repo's existing${VAR:-default}switch pattern (EVAL_ONLY,RUN_EVAL, etc.).Test plan
workflow_dispatchwith a trimmed matrix (mtp-list: "1",thinking-modes: "off",open-pr: false) to validate the full CI chain (model loads, dataset downloads, AL is computed, artifact uploads).thinking_on: 1.79,thinking_off: 1.92).mtp-list: "1 2 3 4 5 6 7 8",thinking-modes: "off on") withopen-pr: true; review the auto-opened reference-YAML PR before merging.