Skip to content

[Tracking Issue] Synthetic Acceptance for MTP Benchmarks #1651

@xinli-sw

Description

@xinli-sw

Proposed Change: Synthetic Acceptance

Recently, all major serving frameworks (vLLM, SGLang, TensorRT-LLM) have added support for synthetic acceptance length (AL) for benchmarking purposes: the server can be configured to accept specific number of draft tokens in order to match (on average) a target acceptance length. Using this feature, we can match expected generation speeds with far higher consistency, and have much more control over the fidelity of the acceptance rate.

Doing so will also ensure consistent benchmarking environments across FWs, avoid honest mistakes such as client side settings (e.g. chat-template), and provide uniform performance expectations when using random and pseudorandom text for benchmarking purposes.

For each model and number of speculated tokens, SemiAnalysis chooses a fixed AL based on the observed MTP AL calibrated on SpeedBench coding dataset using model-provider-recommended sampling parameters (M0). For example, running DSV4-Pro on SpeedBench coding with 3 MTP draft tokens gives a mean AL of ~2.75, so the settings would be roughly [MTP1: AL 1.85, MTP2: AL 2.45, MTP3: AL 2.75] (mock numbers).

We propose the following methodology:

M0: New Github Action for Auditable Reference AL Generation

We introduce a new GitHub Action that integrates SpeedBench to InferenceX. For each model that supports MTP, we use the workflow to generate reference AL values, with the coding dataset, for both thinking on and thinking off. All reference values will be generated using vLLM for consistency, for speculative tokens going from 1 to 7.

M1: Evaluation runs for correct implementation of MTP

During evaluation the synthetic acceptance feature is disabled and the same model-provider-recommended sampling parameters are used. The eval script now includes two pieces:

  1. Run SpeedBench, acceptance rate is automatically collected from the server at the end of the run. Then: we assert that the achieved acceptance length is above a threshold from the reference values.
  2. Run the existing GSM8k evals to ensure accuracy still pass.

M2: Apply Synthetic Acceptance During Benchmarks

During benchmarking synthetic acceptance is enabled with the target AL and uses the same prompt generation setup as non-MTP. We will roll this out with single node first, followed by multi-node tests. To make sure the benchmarks are fair, we ensure implementation for synthetic MTP acceptance rates to be open, auditable, and mathematically equivalent for all participating FWs.

  • M0: New Github Action for Auditable Reference AL Generation
  • M1: Evaluation runs for correct implementation of MTP
  • M2: Apply Synthetic Acceptance During Benchmarks

Metadata

Metadata

Assignees

No fields configured for Feature.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions