Skip to content

Add GitHub Action to collect SPEED-Bench AL matrix#1650

Draft
qiching wants to merge 1 commit into
SemiAnalysisAI:mainfrom
qiching:albecheng/speedbench-al-action
Draft

Add GitHub Action to collect SPEED-Bench AL matrix#1650
qiching wants to merge 1 commit into
SemiAnalysisAI:mainfrom
qiching:albecheng/speedbench-al-action

Conversation

@qiching
Copy link
Copy Markdown

@qiching qiching commented Jun 2, 2026

Summary

Adds a push-button GitHub Action that produces the DeepSeek-V4-Pro SPEED-Bench acceptance-length (AL) matrixthinking_on/off × MTP (num_speculative_tokens) 1–8 — on the self-hosted B300 runners, and (optionally) opens a PR that updates benchmarks/speedbench-reference-al.yaml. This is the AL-distribution collection that the synthetic-acceptance MTP framework consumes as its golden reference.

Triggered manually via workflow_dispatch (MTP levels, thinking modes, category, output length, allocation time, and whether to auto-open a PR are all inputs).

What's in this PR

File Role
benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh The AL collector. For each (thinking, MTP) cell: start a vLLM server, run SPEED-Bench on one category, derive AL from /metrics (accepted_tokens / drafts + 1), and emit a YAML matrix identical in shape to benchmarks/speedbench-reference-al.yaml.
runners/launch_b300-nv.sh Two opt-in hooks (both default to prior behavior): BENCH_SCRIPT_OVERRIDE (run a specific script instead of the auto-selected throughput benchmark) and SALLOC_TIME_LIMIT (raise the Slurm time limit; the 16 server starts need more than the 180-min default).
.github/workflows/speedbench-al.yml workflow_dispatch entry point: passes the matrix tunables into the launcher, uploads the matrix + server logs as artifacts, and optionally opens a PR updating the reference YAML.

How it fits together

speedbench-al.yml  --(env: BENCH_SCRIPT_OVERRIDE, SALLOC_TIME_LIMIT, MTP_LIST, ...)-->
  runners/launch_b300-nv.sh  --(salloc + srun --export=ALL into the vLLM container)-->
    dsv4_fp4_b300_vllm_speedbench_matrix.sh  -->  speedbench-reference-al.yaml

The workflow only passes parameters and opens the PR; the launcher acquires the GPU node and enters the container; the collector runs the measurement. This reuses the existing single-node launcher path rather than duplicating the salloc/srun/enroot/mount logic.

Model path handling

The collector serves from SERVE_MODEL="${MODEL_PATH:-$MODEL}":

  • In CI, the workflow sets MODEL to the HF id deepseek-ai/DeepSeek-V4-Pro; the launcher resolves MODEL_PATH to the pre-staged local weights (its basename is in STAGED_MODELS) and mounts them, so the collector serves locally with no download.
  • For a standalone local run, MODEL_PATH is unset and MODEL is itself a local path, so the same script works unchanged.

Measurement config (for reviewers)

  • Max OSL = 4096 (--speed-bench-output-len 4096), exposed as the workflow output-len input. This is the recommended setting and is applied to every cell.
  • --max-model-len 16384 is the server's total context budget (real SPEED-Bench prompt length + the 4096-token output), not the OSL. It is a workload constant for this benchmark (there is no ISL/OSL sweep here), which is why it is fixed rather than injected per-config like the throughput recipes.
  • Category defaults to coding; thinking-on cells use chat_template_kwargs = {"thinking": true, "reasoning_effort": "high"} to match the golden/production config.
  • The reference AL matrix was measured with exactly this config, so the values the Action produces are directly comparable.

Deliberate, documented exception: temporary --chat-template-kwargs shim

The collector contains a small monkeypatch shim (the apply_chat_template_kwargs_shim function) that patches vllm.benchmarks at runtime to add a real --chat-template-kwargs CLI option. This is non-typical for this repo (no other script patches a third-party library), so calling it out explicitly:

  • Why it's needed: until vllm-project/vllm#44244 ships in the benchmark image, speed_bench/CustomDataset pre-renders the chat template client-side without chat_template_kwargs and posts to /v1/completions, so thinking mode cannot be enabled via --extra-body or --default-chat-template-kwargs. The shim wires a proper --chat-template-kwargs through get_samples into CustomDataset.sample's apply_chat_template.
  • Why it's safe: it is idempotent (guarded by a marker check, so re-running is a no-op), is applied only when a thinking-on cell is requested, asserts its anchors match exactly, and exit 1s the whole run if the patch fails rather than silently producing wrong (non-thinking) numbers.
  • Lifecycle: delete the entire shim block once #44244 is released in the benchmark image. It is intentionally self-contained and marked TODO for that removal.

This is the only part that does not look like the rest of the repo; it is a known trade-off, not an oversight.

Backward compatibility

Both launcher hooks are pure opt-in (${BENCH_SCRIPT_OVERRIDE:-}, ${SALLOC_TIME_LIMIT:-180}) — existing callers that don't set them get exactly the previous behavior. This follows the repo's existing ${VAR:-default} switch pattern (EVAL_ONLY, RUN_EVAL, etc.).

Test plan

  • Test via workflow_dispatch with a trimmed matrix (mtp-list: "1", thinking-modes: "off", open-pr: false) to validate the full CI chain (model loads, dataset downloads, AL is computed, artifact uploads).
  • Confirm the produced YAML matches the expected shape and that thinking-on/off level-1 AL values are sane (locally observed: thinking_on: 1.79, thinking_off: 1.92).
  • Full run (mtp-list: "1 2 3 4 5 6 7 8", thinking-modes: "off on") with open-pr: true; review the auto-opened reference-YAML PR before merging.

Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro
SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on
self-hosted B300 runners, optionally opening a PR that updates
benchmarks/speedbench-reference-al.yaml.

- benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh:
  per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from
  /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local
  pre-staged weights resolved by the launcher), falling back to MODEL for
  a standalone local run. Carries a temporary --chat-template-kwargs shim
  until vllm-project/vllm#44244 lands in the benchmark image (idempotent,
  applied only for thinking-on cells).
- runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and
  SALLOC_TIME_LIMIT hooks; both default to the prior behavior.
- .github/workflows/speedbench-al.yml: workflow_dispatch entry point;
  MODEL is the HF id so the launcher resolves the staged MODEL_PATH.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant