Add metrics for speculative decoding (acceptance rate, average acceptance length) by scottjlee · Pull Request #11441 · sgl-project/sglang

scottjlee · 2025-10-10T23:09:05Z

Re-attempt of #11144, which introduced an additional device sync per verify pass (see discussion here). This PR updates the implementation to instead iterate over an existing list on CPU, avoiding the device sync (see here for the change).

Motivation

To better understand results when using speculative decoding, it would be helpful to have request-level information on the acceptance rate and average acceptance length.

Modifications

Add the following metrics tracked at the request level:

spec_accept_rate: Speculative decoding acceptance rate for this request (# accepted tokens / # draft tokens).
spec_accept_length: Number of accepted tokens for each verification pass, averaged across all passes for this request.

These metrics are collected at a per-request level, and reported in the final log output (example below):

[2025-10-02 23:42:16] Finish: obj=GenerateReqInput(video_data=None, rid='5799660f79fd4dc481ac2226d5b97102', return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=True, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, label=None, return_bytes=False), out={'meta_info': {'id': '5799660f79fd4dc481ac2226d5b97102', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 23, 'weight_version': 'default', 'completion_tokens': 32, 'cached_tokens': 0, 'spec_verify_ct': 14, 'spec_accept_rate': 0.40476190476190477, 'spec_accept_length': 2.2857142857142856, 'e2e_latency': 0.15333986282348633}}

In addition, we also expose the average acceptance rate of a batch (calculated as total accepted tokens / total draft tokens for a batch) as a Prometheus metric sglang:spec_accept_rate. Example output below:

curl -s http://localhost:8001/metrics | grep -E "spec_accept"
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="meta-llama/Llama-2-7b-chat-hf",pp_rank="0",tp_rank="0"} 2.4782608695652173
# HELP sglang:spec_accept_rate The average acceptance rate of speculative decoding (`accepted tokens / total draft tokens` in batch).
# TYPE sglang:spec_accept_rate gauge
sglang:spec_accept_rate{engine_type="unified",model_name="meta-llama/Llama-2-7b-chat-hf",pp_rank="0",tp_rank="0"} 0.8260869565217391

Accuracy Tests

Should not affect model outputs or accuracy.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…nto sjl/1001-spec-metrics

scottjlee · 2025-10-10T23:14:43Z

+            req.spec_accepted_tokens += (
+                sum(1 for idx in accept_index_row if idx != -1) - 1
+            )


this is the change since #11144

LGTM, this change seems pretty lightweight.

zhyncs · 2025-10-12T05:28:18Z

@hnyls2002

ShangmingCai

LGTM

…ance length) (sgl-project#11441)

Scott Lee and others added 13 commits October 1, 2025 12:30

add ,

fe7c45d

Merge branch 'main' into sjl/1001-spec-metrics

a2ec9a7

add sglang:spec_accept_rate prometheus metric

a7a2697

Merge branch 'sjl/1001-spec-metrics' of github.com:scottjlee/sglang i…

6653101

…nto sjl/1001-spec-metrics

Merge branch 'main' into sjl/1001-spec-metrics

df7dc43

Merge branch 'main' into sjl/1001-spec-metrics

3a5d644

Merge branch 'sjl/1001-spec-metrics' of github.com:scottjlee/sglang i…

4618b8a

…nto sjl/1001-spec-metrics

fix typo

a9ff581

Merge branch 'main' into sjl/1001-spec-metrics

dda4670

Merge branch 'main' into sjl/1001-spec-metrics

5e2d2ce

Merge branch 'main' into sjl/1001-spec-metrics

b9f6bf8

per req spec metrics behind server launch flag

c868a61

use existing accept_index_cpu to calculate req acc rate

0ee38a4

scottjlee mentioned this pull request Oct 10, 2025

Add metrics for speculative decoding (acceptance rate, average acceptance length) #11144

Merged

4 tasks

scottjlee marked this pull request as ready for review October 10, 2025 23:12

scottjlee requested review from Ying1123, hnyls2002, kssteven418, merrymercy and xiezhq-hermann as code owners October 10, 2025 23:12

scottjlee commented Oct 10, 2025

View reviewed changes

scottjlee mentioned this pull request Oct 10, 2025

[RFC] [Feature] Improving request-level timing and performance metrics #11141

Closed

7 tasks

zhyncs self-assigned this Oct 10, 2025

zhyncs added the run-ci label Oct 10, 2025

Merge branch 'main' into sjl/1001-spec-metrics

e3f58e3

zhyncs assigned ShangmingCai Oct 10, 2025

Merge branch 'main' into sjl/1001-spec-metrics

61af364

zhyncs added the high priority label Oct 12, 2025

zhyncs approved these changes Oct 12, 2025

View reviewed changes

Merge branch 'main' into sjl/1001-spec-metrics

7ef9d25

zhyncs requested a review from ShangmingCai October 13, 2025 06:15

ShangmingCai approved these changes Oct 13, 2025

View reviewed changes

Merge branch 'main' into sjl/1001-spec-metrics

a9ea487

zhyncs merged commit b6fb5d7 into sgl-project:main Oct 13, 2025
191 of 225 checks passed

merrymercy reviewed Oct 13, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/tokenizer_manager.py

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025

Add metrics for speculative decoding (acceptance rate, average accept…

5ed13ba

…ance length) (sgl-project#11441)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for speculative decoding (acceptance rate, average acceptance length)#11441

Add metrics for speculative decoding (acceptance rate, average acceptance length)#11441
zhyncs merged 17 commits intosgl-project:mainfrom
scottjlee:sjl/1001-spec-metrics

scottjlee commented Oct 10, 2025 •

edited

Loading

Uh oh!

scottjlee Oct 10, 2025 •

edited

Loading

Uh oh!

ShangmingCai Oct 11, 2025

Uh oh!

zhyncs commented Oct 12, 2025

Uh oh!

ShangmingCai left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

scottjlee commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

scottjlee Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Oct 12, 2025

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scottjlee commented Oct 10, 2025 •

edited

Loading

scottjlee Oct 10, 2025 •

edited

Loading