Skip to content

[metrics] Add in queue metrics#4412

Closed
hebiao064 wants to merge 7 commits intosgl-project:mainfrom
hebiao064:add_in_queue_metrics
Closed

[metrics] Add in queue metrics#4412
hebiao064 wants to merge 7 commits intosgl-project:mainfrom
hebiao064:add_in_queue_metrics

Conversation

@hebiao064
Copy link
Copy Markdown
Collaborator

@hebiao064 hebiao064 commented Mar 14, 2025

Motivation

When serving LLMs at scale, understanding where time is spent during request processing is crucial for optimization. The current metrics don't provide enough granularity to identify specific bottlenecks in the request lifecycle.

Note about performance concern

We only emit those metrics when --enable-metrics are specified.

Future Work

If this approach is well-received, I plan to implement additional latency breakdowns for:

  • prefix_cache_lookup time

Metrics Result

# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:avg_request_queue_latency The average request queue latency.
# TYPE sglang:avg_request_queue_latency gauge
sglang:avg_request_queue_latency{engine_type="unified",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 2.9325485229492188e-05
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.6962933540344238
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.3",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.5",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.7",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="0.9",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="4.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="6.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="8.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="40.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="60.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="80.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="120.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="160.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_to_first_token_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.6962814331054688
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="5.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="150.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="250.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="300.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="350.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="500.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="1000.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:e2e_request_latency_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:time_per_output_token_seconds Histogram of time per output token in seconds.
# TYPE sglang:time_per_output_token_seconds histogram
sglang:time_per_output_token_seconds_sum{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.034814071655273435
sglang:time_per_output_token_seconds_bucket{le="0.002",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.005",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.01",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.02",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.03",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
sglang:time_per_output_token_seconds_bucket{le="0.04",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.05",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.06",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.07",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.08",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.09",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.1",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.15",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.2",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.3",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.4",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.6",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="0.8",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="1.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="2.0",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_bucket{le="+Inf",model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
sglang:time_per_output_token_seconds_count{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 91.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 20.0
# HELP sglang:cached_tokens_total Number of cached prompt tokens.
# TYPE sglang:cached_tokens_total counter
sglang:cached_tokens_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 0.0
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="/shared/public/models/Meta-Llama-3-8B-Instruct"} 1.0

Benchmark Result

python -m sglang.bench_one_batch --model-path /path/to/Llama-3.2-3B-Instruct --batch 1 --input-len 32768 --output-len 1 --quantization w8a8_fp8

# Before
Benchmark ...
Prefill. latency: 0.28452 s, throughput: 115169.06 token/s
Total. latency:  0.285 s, throughput: 115172.57 token/s

# After
Benchmark ...
Prefill. latency: 0.28434 s, throughput: 115242.84 token/s
Total. latency:  0.284 s, throughput: 115246.35 token/s

As expected, the instrumentation adds negligible overhead (within normal benchmark fluctuation). This confirms that the metrics collection doesn't impact performance while providing valuable insights.

Modifications

This PR takes a minimalist approach by focusing only on queue latency as a first step. We're setting queue_time_start when requests enter the queue and queue_time_end when they're selected for processing, then calculating the average latency across all requests in a batch.

Checklist

Comment thread python/sglang/srt/managers/scheduler.py Outdated
total_queue_latency = 0
avg_queue_latency = 0
for req in can_run_list:
print(req.queue_time_start, req.queue_time_end)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove debug statement. Protect this under if self. enable_metrics

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed the comments and moved this piece of code under if self.enable_metrics

@hebiao064 hebiao064 requested a review from HaiShaw as a code owner March 14, 2025 17:54
@hebiao064
Copy link
Copy Markdown
Collaborator Author

Screenshot 2025-03-14 at 6 10 03 PM After I rebased my branch it seems github was failing to pull my branch, will close this PR and create new one @merrymercy

@hebiao064 hebiao064 closed this Mar 15, 2025
@hebiao064 hebiao064 mentioned this pull request Mar 15, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants