[metrics] Add in queue metrics by hebiao064 · Pull Request #4412 · sgl-project/sglang

hebiao064 · 2025-03-14T05:42:58Z

Motivation

When serving LLMs at scale, understanding where time is spent during request processing is crucial for optimization. The current metrics don't provide enough granularity to identify specific bottlenecks in the request lifecycle.

Note about performance concern

We only emit those metrics when --enable-metrics are specified.

Future Work

If this approach is well-received, I plan to implement additional latency breakdowns for:

prefix_cache_lookup time

Metrics Result

# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:avg_request_queue_latency The average request queue latency.
# TYPE sglang:avg_request_queue_latency gauge
sglang:avg_request_queue_latency{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 2.9325485229492188e-05
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.6962933540344238
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.3",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.7",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="0.9",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="4.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="6.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="8.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="40.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="60.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="80.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="120.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="160.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.6962814331054688
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="150.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="250.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="300.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="350.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="500.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1000.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE sglang:time_per_output_token_seconds histogram
sglang:time_per_output_token_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.034814071655273435
sglang:time_per_output_token_seconds_bucket{le="0.002",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.06",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.07",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.08",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.09",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.6",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.8",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 91.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 20.0
# HELP sglang:cached_tokens_total Number of cached prompt tokens.
# TYPE sglang:cached_tokens_total counter
sglang:cached_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0

Benchmark Result

python -m sglang.bench_one_batch --model-path /path/to/Llama-3.2-3B-Instruct --batch 1 --input-len 32768 --output-len 1 --quantization w8a8_fp8

# Before
Benchmark ...
Prefill. latency: 0.28452 s, throughput: 115169.06 token/s
Total. latency:  0.285 s, throughput: 115172.57 token/s

# After
Benchmark ...
Prefill. latency: 0.28434 s, throughput: 115242.84 token/s
Total. latency:  0.284 s, throughput: 115246.35 token/s

As expected, the instrumentation adds negligible overhead (within normal benchmark fluctuation). This confirms that the metrics collection doesn't impact performance while providing valuable insights.

Modifications

This PR takes a minimalist approach by focusing only on queue latency as a first step. We're setting queue_time_start when requests enter the queue and queue_time_end when they're selected for processing, then calculating the average latency across all requests in a batch.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

merrymercy · 2025-03-14T08:27:33Z

+        total_queue_latency = 0
+        avg_queue_latency = 0
+        for req in can_run_list:
+            print(req.queue_time_start, req.queue_time_end)


remove debug statement. Protect this under if self. enable_metrics

Thanks, fixed the comments and moved this piece of code under if self.enable_metrics

Co-authored-by: sleepcoo <sleepcoo@gmail.com>

…gl-project#3964) Signed-off-by: wangyu <wangyu.steph@bytedance.com>

hebiao064 · 2025-03-15T01:10:40Z

After I rebased my branch it seems github was failing to pull my branch, will close this PR and create new one @merrymercy

hebiao064 added 2 commits March 14, 2025 05:34

Add In Queue Latency

7eae491

Add In queue Latency

a582c9a

hebiao064 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners March 14, 2025 05:42

merrymercy reviewed Mar 14, 2025

View reviewed changes

zhyncs and others added 4 commits March 14, 2025 17:54

cleanup deps 1/n (sgl-project#4400)

415d8a6

Co-authored-by: sleepcoo <sleepcoo@gmail.com>

feat(remote_model): support variable remote backend for model loader (s…

d422921

…gl-project#3964) Signed-off-by: wangyu <wangyu.steph@bytedance.com>

[bug] fix duplicate variable MAX_PIXELS in qwen_vl.py (sgl-project#4419)

5f8c612

Fix Comment

7ed824c

hebiao064 requested a review from HaiShaw as a code owner March 14, 2025 17:54

Merge branch 'main' into add_in_queue_metrics

59a9037

hebiao064 closed this Mar 15, 2025

hebiao064 mentioned this pull request Mar 15, 2025

[metrics] Add in queue metrics #4444

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metrics] Add in queue metrics#4412

[metrics] Add in queue metrics#4412
hebiao064 wants to merge 7 commits intosgl-project:mainfrom
hebiao064:add_in_queue_metrics

hebiao064 commented Mar 14, 2025 •

edited

Loading

Uh oh!

merrymercy Mar 14, 2025

Uh oh!

hebiao064 Mar 14, 2025

Uh oh!

hebiao064 commented Mar 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hebiao064 commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Note about performance concern

Future Work

Metrics Result

Benchmark Result

Modifications

Checklist

Uh oh!

merrymercy Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

hebiao064 Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

hebiao064 commented Mar 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hebiao064 commented Mar 14, 2025 •

edited

Loading