Skip to content

[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats#7317

Merged
zhyncs merged 17 commits intosgl-project:mainfrom
SCDESPERTATE:main
Aug 25, 2025
Merged

[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats#7317
zhyncs merged 17 commits intosgl-project:mainfrom
SCDESPERTATE:main

Conversation

@SCDESPERTATE
Copy link
Copy Markdown
Contributor

@SCDESPERTATE SCDESPERTATE commented Jun 18, 2025

Motivation

This PR connects existing Prometheus metric definitions in PD disaggregation part with the appropriate code paths where they should be updated. While the metrics (e.g., num_bootstrap_failed_reqs, num_transfer_failed_reqs) were already defined in #6188 , they were not being updated during runtime execution.

Modifications

Inserted metric update calls in:

  • prefill/decode node is aware of failure during bootstrapping/transferring (increment_transfer_failed_reqs()/ increment_transfer_failed_reqs)
  • stats logging calls(log_prefill_stats && log_decode_stats)
  • fix some typo in naming(infight->inflight)

Result

A querying result of a decode node's metrics at a moment

$ curl http://localhost:30001/metrics
# HELP sglang:num_transfer_failed_reqs_total The number of transfer failed requests.
# TYPE sglang:num_transfer_failed_reqs_total counter
sglang:num_transfer_failed_reqs_total{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 156.0
# HELP sglang:num_running_reqs The number of running requests.
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_used_tokens The number of used tokens.
# TYPE sglang:num_used_tokens gauge
sglang:num_used_tokens{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:token_usage The token usage.
# TYPE sglang:token_usage gauge
sglang:token_usage{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:gen_throughput The generation throughput (token/s).
# TYPE sglang:gen_throughput gauge
sglang:gen_throughput{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_queue_reqs The number of requests in the waiting queue.
# TYPE sglang:num_queue_reqs gauge
sglang:num_queue_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_grammar_queue_reqs The number of requests in the grammar waiting queue.
# TYPE sglang:num_grammar_queue_reqs gauge
sglang:num_grammar_queue_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:cache_hit_rate The prefix cache hit rate.
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:spec_accept_length The average acceptance length of speculative decoding.
# TYPE sglang:spec_accept_length gauge
sglang:spec_accept_length{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_prefill_prealloc_queue_reqs The number of requests in the prefill prealloc queue.
# TYPE sglang:num_prefill_prealloc_queue_reqs gauge
sglang:num_prefill_prealloc_queue_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_prefill_infight_queue_reqs The number of requests in the prefill infight queue.
# TYPE sglang:num_prefill_infight_queue_reqs gauge
sglang:num_prefill_infight_queue_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
# HELP sglang:num_decode_prealloc_queue_reqs The number of requests in the decode prealloc queue.
# TYPE sglang:num_decode_prealloc_queue_reqs gauge
sglang:num_decode_prealloc_queue_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 8.0
# HELP sglang:num_decode_transfer_queue_reqs The number of requests in the decode transfer queue.
# TYPE sglang:num_decode_transfer_queue_reqs gauge
sglang:num_decode_transfer_queue_reqs{engine_type="unified",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 9.0
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 229440.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2448.0
# HELP sglang:num_requests_total Number of requests processed.
# TYPE sglang:num_requests_total counter
sglang:num_requests_total{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
# HELP sglang:num_aborted_requests_total Number of requests aborted.
# TYPE sglang:num_aborted_requests_total counter
sglang:num_aborted_requests_total{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 1.0
# HELP sglang:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE sglang:time_to_first_token_seconds histogram
sglang:time_to_first_token_seconds_sum{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 858.7035551071167
sglang:time_to_first_token_seconds_bucket{le="0.1",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.2",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.4",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.6",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:time_to_first_token_seconds_bucket{le="0.8",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 1.0
sglang:time_to_first_token_seconds_bucket{le="1.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 1.0
sglang:time_to_first_token_seconds_bucket{le="2.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 3.0
sglang:time_to_first_token_seconds_bucket{le="4.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 5.0
sglang:time_to_first_token_seconds_bucket{le="6.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 9.0
sglang:time_to_first_token_seconds_bucket{le="8.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 14.0
sglang:time_to_first_token_seconds_bucket{le="10.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 20.0
sglang:time_to_first_token_seconds_bucket{le="20.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 24.0
sglang:time_to_first_token_seconds_bucket{le="40.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_bucket{le="60.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_bucket{le="80.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_bucket{le="100.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_bucket{le="200.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_bucket{le="400.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_bucket{le="+Inf",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:time_to_first_token_seconds_count{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
# HELP sglang:e2e_request_latency_seconds Histogram of End-to-end request latency in seconds
# TYPE sglang:e2e_request_latency_seconds histogram
sglang:e2e_request_latency_seconds_sum{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 933.9169390201569
sglang:e2e_request_latency_seconds_bucket{le="0.1",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.2",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.4",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.6",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="0.8",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="1.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:e2e_request_latency_seconds_bucket{le="2.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2.0
sglang:e2e_request_latency_seconds_bucket{le="4.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 3.0
sglang:e2e_request_latency_seconds_bucket{le="6.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 4.0
sglang:e2e_request_latency_seconds_bucket{le="8.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 5.0
sglang:e2e_request_latency_seconds_bucket{le="10.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 8.0
sglang:e2e_request_latency_seconds_bucket{le="20.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 24.0
sglang:e2e_request_latency_seconds_bucket{le="40.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="60.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="80.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="100.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="200.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="400.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="800.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
sglang:e2e_request_latency_seconds_count{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 58.0
# HELP sglang:inter_token_latency_seconds Histogram of inter-token latency in seconds.
# TYPE sglang:inter_token_latency_seconds histogram
sglang:inter_token_latency_seconds_sum{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 75.21362400054932
sglang:inter_token_latency_seconds_bucket{le="0.002",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:inter_token_latency_seconds_bucket{le="0.004",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:inter_token_latency_seconds_bucket{le="0.006",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:inter_token_latency_seconds_bucket{le="0.008",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:inter_token_latency_seconds_bucket{le="0.01",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:inter_token_latency_seconds_bucket{le="0.015",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 0.0
sglang:inter_token_latency_seconds_bucket{le="0.02",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 268.0
sglang:inter_token_latency_seconds_bucket{le="0.025",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 679.0
sglang:inter_token_latency_seconds_bucket{le="0.03",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 1854.0
sglang:inter_token_latency_seconds_bucket{le="0.035",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2332.0
sglang:inter_token_latency_seconds_bucket{le="0.04",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2338.0
sglang:inter_token_latency_seconds_bucket{le="0.06",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2347.0
sglang:inter_token_latency_seconds_bucket{le="0.08",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2348.0
sglang:inter_token_latency_seconds_bucket{le="0.1",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2377.0
sglang:inter_token_latency_seconds_bucket{le="0.2",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2405.0
sglang:inter_token_latency_seconds_bucket{le="0.4",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2405.0
sglang:inter_token_latency_seconds_bucket{le="0.6",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2410.0
sglang:inter_token_latency_seconds_bucket{le="0.8",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_bucket{le="1.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_bucket{le="2.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_bucket{le="4.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_bucket{le="6.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_bucket{le="8.0",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_bucket{le="+Inf",model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0
sglang:inter_token_latency_seconds_count{model_name="/home/models/deepseek-ai__DeepSeek-R1"} 2415.0

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses the issue of missing metric updates in the PD disaggregation component. It connects existing Prometheus metric definitions with the appropriate code paths to accurately reflect runtime execution, particularly regarding bootstrap and transfer failures. The changes involve injecting the metrics collector into relevant classes and incrementing the appropriate counters when failures occur.

Highlights

  • Metric Updates: Implemented updates to Prometheus metrics in the PD disaggregation part, specifically for tracking failed bootstrap and transfer requests.
  • Code Modifications: Added metric update calls in prefill/decode nodes to track failures during bootstrapping/transferring and in stats logging calls.
  • Integration: Ensured that the metrics collector is properly passed to the KV managers in both prefill and decode nodes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully integrates Prometheus metric updates into the PD disaggregation components. The changes correctly pass the metrics collector to relevant managers and add calls to increment failure counters (bootstrap and transfer failures) and log queue length statistics. The approach of conditioning metric collection on enable_metrics and a specific rank (attn_tp_rank == 0) is sound for distributed environments. The modifications are clear, well-targeted, and align with the PR's stated goals, enhancing system observability. No issues of medium or higher severity were found.

@SCDESPERTATE SCDESPERTATE changed the title Add the missing logic to update existing PD monitoring metrics [PD] Improve disaggregation metrics monitoring Jun 19, 2025
@SCDESPERTATE SCDESPERTATE changed the title [PD] Improve disaggregation metrics monitoring [PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats Jun 19, 2025
@SCDESPERTATE SCDESPERTATE marked this pull request as ready for review June 19, 2025 08:27
@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

SCDESPERTATE commented Jun 19, 2025

@merrymercy Please take a look, thx😊

@fungaren
Copy link
Copy Markdown

@merrymercy @zhyncs Please take a look, thanks. Metrics are important for us to use PD disaggregation

@ishandhanani
Copy link
Copy Markdown
Collaborator

Hi @SCDESPERTATE - can you please rebase? There is a merge conflict with scheduler.py

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

Hi @SCDESPERTATE - can you please rebase? There is a merge conflict with scheduler.py

Okay, working on it.

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

Hi @SCDESPERTATE - can you please rebase? There is a merge conflict with scheduler.py

Done. PTAL😊

@ishandhanani
Copy link
Copy Markdown
Collaborator

Hi @SCDESPERTATE - can you please rebase? There is a merge conflict with scheduler.py

Done. PTAL😊

Oof - another one sorry with conn.py in mooncake

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

SCDESPERTATE commented Aug 5, 2025

Hi @SCDESPERTATE - can you please rebase? There is a merge conflict with scheduler.py

Done. PTAL😊

Oof - another one sorry with conn.py in mooncake

Solved. 😊

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks clean. So we only handle the failed reqs metric first? Let me ping @ByronHsu to double-check on this.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@SCDESPERTATE Maybe fix isort or use pre-commit run --all-files.

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

@SCDESPERTATE Maybe fix isort or use pre-commit run --all-files.

Got it. Lint error is fixed.

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

@ShangmingCai Could you retry the check?😊 Seems the failed task was caused by some unknown timeout.

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

SCDESPERTATE commented Aug 10, 2025

@ShangmingCai Might be another timeout issue caused by network thrashing in the CI problem. Could you help check the commits😊

Copy link
Copy Markdown
Collaborator

@ByronHsu ByronHsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a high level, we should not pass metrics collector into kv manager and handle metrics individually in transfer backend. Instead, we should emit metrics from prefill.py and decode.py to be more generic. For example, here we can do

self.scheduler.stream_output([req], req.return_logprob)

  elif poll == KVPoll.Failed:
      ...
      if self.scheduler.enable_metrics and self.scheduler.attn_tp_rank == 0:
          self.scheduler.metrics_collector.increment_bootstrap_failed_reqs()
      ...

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

On a high level, we should not pass metrics collector into kv manager and handle metrics individually in transfer backend. Instead, we should emit metrics from prefill.py and decode.py to be more generic. For example, here we can do

self.scheduler.stream_output([req], req.return_logprob)

  elif poll == KVPoll.Failed:
      ...
      if self.scheduler.enable_metrics and self.scheduler.attn_tp_rank == 0:
          self.scheduler.metrics_collector.increment_bootstrap_failed_reqs()
      ...

Got it.

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

SCDESPERTATE commented Aug 11, 2025

@ByronHsu Seems okay in the latest version, would you mind taking a look again😊

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me run the CI first.

@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

Let me run the CI first.

@ByronHsu CI checks passed without significant issues. PTAL😊

@SCDESPERTATE SCDESPERTATE requested a review from ByronHsu August 16, 2025 08:11
@SCDESPERTATE
Copy link
Copy Markdown
Contributor Author

@ShangmingCai @ByronHsu Gently ping, thanks.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zhyncs zhyncs merged commit b5c6529 into sgl-project:main Aug 25, 2025
155 of 166 checks passed
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants