[Tracking Issue] Synthetic Acceptance for MTP Benchmarks

# Proposed Change: Synthetic Acceptance

Recently, all major serving frameworks (vLLM, SGLang, TensorRT-LLM) have added support for synthetic acceptance length (AL) for benchmarking purposes: the server can be configured to accept specific number of draft tokens in order to match (on average) a target acceptance length. Using this feature, we can match expected generation speeds with far higher consistency, and have much more control over the fidelity of the acceptance rate. 

Doing so will also ensure consistent benchmarking environments across FWs, avoid honest mistakes such as client side settings (e.g. chat-template), and provide uniform performance expectations when using random and pseudorandom text for benchmarking purposes.

For each model and number of speculated tokens, SemiAnalysis chooses a fixed AL based on the observed MTP AL calibrated on SpeedBench coding dataset using model-provider-recommended sampling parameters (M0). For example, running DSV4-Pro on SpeedBench coding with 3 MTP draft tokens gives a mean AL of ~2.75, so the settings would be roughly [MTP1: AL 1.85, MTP2: AL 2.45, MTP3: AL 2.75] (mock numbers).

We propose the following methodology:

## M0: New Github Action for Auditable Reference AL Generation
We introduce a new GitHub Action that integrates [SpeedBench](https://huggingface.co/datasets/nvidia/SPEED-Bench) to InferenceX. For each model that supports MTP, we use the workflow to generate reference AL values, with the coding dataset, for both thinking on and thinking off. All reference values will be generated using vLLM for consistency, for speculative tokens going from 1 to 7.

## M1: Evaluation runs for correct implementation of MTP

During evaluation the synthetic acceptance feature is disabled and the same model-provider-recommended sampling parameters are used. The eval script now includes two pieces:

1. Run SpeedBench, acceptance rate is automatically collected from the server at the end of the run. Then: we assert that the achieved acceptance length is above a threshold from the reference values.
2. Run the existing GSM8k evals to ensure accuracy still pass.

## M2: Apply Synthetic Acceptance During Benchmarks

During benchmarking synthetic acceptance is enabled with the target AL and uses the same prompt generation setup as non-MTP.  We will roll this out with single node first, followed by multi-node tests. To make sure the benchmarks are fair, we ensure implementation for synthetic MTP acceptance rates to be open, auditable, and mathematically equivalent for all participating FWs.


- [ ] M0: New Github Action for Auditable Reference AL Generation
- [ ] M1: Evaluation runs for correct implementation of MTP
- [ ] M2: Apply Synthetic Acceptance During Benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking Issue] Synthetic Acceptance for MTP Benchmarks #1651

Proposed Change: Synthetic Acceptance

M0: New Github Action for Auditable Reference AL Generation

M1: Evaluation runs for correct implementation of MTP

M2: Apply Synthetic Acceptance During Benchmarks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Tracking Issue] Synthetic Acceptance for MTP Benchmarks #1651

Description

Proposed Change: Synthetic Acceptance

M0: New Github Action for Auditable Reference AL Generation

M1: Evaluation runs for correct implementation of MTP

M2: Apply Synthetic Acceptance During Benchmarks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions