Skip to content

Add metrics for speculative decoding (acceptance rate, average acceptance length)#11441

Merged
zhyncs merged 17 commits intosgl-project:mainfrom
scottjlee:sjl/1001-spec-metrics
Oct 13, 2025
Merged

Add metrics for speculative decoding (acceptance rate, average acceptance length)#11441
zhyncs merged 17 commits intosgl-project:mainfrom
scottjlee:sjl/1001-spec-metrics

Conversation

@scottjlee
Copy link
Copy Markdown
Contributor

@scottjlee scottjlee commented Oct 10, 2025

Re-attempt of #11144, which introduced an additional device sync per verify pass (see discussion here). This PR updates the implementation to instead iterate over an existing list on CPU, avoiding the device sync (see here for the change).


Motivation

To better understand results when using speculative decoding, it would be helpful to have request-level information on the acceptance rate and average acceptance length.

Modifications

Add the following metrics tracked at the request level:

  • spec_accept_rate: Speculative decoding acceptance rate for this request (# accepted tokens / # draft tokens).
  • spec_accept_length: Number of accepted tokens for each verification pass, averaged across all passes for this request.

These metrics are collected at a per-request level, and reported in the final log output (example below):

[2025-10-02 23:42:16] Finish: obj=GenerateReqInput(video_data=None, rid='5799660f79fd4dc481ac2226d5b97102', return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=True, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, label=None, return_bytes=False), out={'meta_info': {'id': '5799660f79fd4dc481ac2226d5b97102', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 23, 'weight_version': 'default', 'completion_tokens': 32, 'cached_tokens': 0, 'spec_verify_ct': 14, 'spec_accept_rate': 0.40476190476190477, 'spec_accept_length': 2.2857142857142856, 'e2e_latency': 0.15333986282348633}}

In addition, we also expose the average acceptance rate of a batch (calculated as total accepted tokens / total draft tokens for a batch) as a Prometheus metric sglang:spec_accept_rate. Example output below:

curl -s http://localhost:8001/metrics | grep -E "spec_accept"
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="meta-llama/Llama-2-7b-chat-hf",pp_rank="0",tp_rank="0"} 2.4782608695652173
# HELP sglang:spec_accept_rate The average acceptance rate of speculative decoding (`accepted tokens / total draft tokens` in batch).
# TYPE sglang:spec_accept_rate gauge
sglang:spec_accept_rate{engine_type="unified",model_name="meta-llama/Llama-2-7b-chat-hf",pp_rank="0",tp_rank="0"} 0.8260869565217391

Accuracy Tests

Should not affect model outputs or accuracy.

Benchmarking and Profiling

Checklist

Comment on lines +380 to +382
req.spec_accepted_tokens += (
sum(1 for idx in accept_index_row if idx != -1) - 1
)
Copy link
Copy Markdown
Contributor Author

@scottjlee scottjlee Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the change since #11144

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this change seems pretty lightweight.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Oct 12, 2025

@hnyls2002

@zhyncs zhyncs requested a review from ShangmingCai October 13, 2025 06:15
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhyncs zhyncs merged commit b6fb5d7 into sgl-project:main Oct 13, 2025
191 of 225 checks passed
Comment thread python/sglang/srt/managers/tokenizer_manager.py
lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants