[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats#7317
[PD] Improve disaggregation metrics output: update the metrics to keep reflecting real stats#7317zhyncs merged 17 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @SCDESPERTATE, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses the issue of missing metric updates in the PD disaggregation component. It connects existing Prometheus metric definitions with the appropriate code paths to accurately reflect runtime execution, particularly regarding bootstrap and transfer failures. The changes involve injecting the metrics collector into relevant classes and incrementing the appropriate counters when failures occur.
Highlights
- Metric Updates: Implemented updates to Prometheus metrics in the PD disaggregation part, specifically for tracking failed bootstrap and transfer requests.
- Code Modifications: Added metric update calls in prefill/decode nodes to track failures during bootstrapping/transferring and in stats logging calls.
- Integration: Ensured that the metrics collector is properly passed to the KV managers in both prefill and decode nodes.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request successfully integrates Prometheus metric updates into the PD disaggregation components. The changes correctly pass the metrics collector to relevant managers and add calls to increment failure counters (bootstrap and transfer failures) and log queue length statistics. The approach of conditioning metric collection on enable_metrics and a specific rank (attn_tp_rank == 0) is sound for distributed environments. The modifications are clear, well-targeted, and align with the PR's stated goals, enhancing system observability. No issues of medium or higher severity were found.
|
@merrymercy Please take a look, thx😊 |
|
@merrymercy @zhyncs Please take a look, thanks. Metrics are important for us to use PD disaggregation |
|
Hi @SCDESPERTATE - can you please rebase? There is a merge conflict with |
Okay, working on it. |
Done. PTAL😊 |
Oof - another one sorry with |
Solved. 😊 |
ShangmingCai
left a comment
There was a problem hiding this comment.
Looks clean. So we only handle the failed reqs metric first? Let me ping @ByronHsu to double-check on this.
|
@SCDESPERTATE Maybe fix |
Got it. Lint error is fixed. |
|
@ShangmingCai Could you retry the check?😊 Seems the failed task was caused by some unknown timeout. |
|
@ShangmingCai Might be another timeout issue caused by network thrashing in the CI problem. Could you help check the commits😊 |
ByronHsu
left a comment
There was a problem hiding this comment.
On a high level, we should not pass metrics collector into kv manager and handle metrics individually in transfer backend. Instead, we should emit metrics from prefill.py and decode.py to be more generic. For example, here we can do
elif poll == KVPoll.Failed:
...
if self.scheduler.enable_metrics and self.scheduler.attn_tp_rank == 0:
self.scheduler.metrics_collector.increment_bootstrap_failed_reqs()
...
Got it. |
|
@ByronHsu Seems okay in the latest version, would you mind taking a look again😊 |
ShangmingCai
left a comment
There was a problem hiding this comment.
Let me run the CI first.
@ByronHsu CI checks passed without significant issues. PTAL😊 |
|
@ShangmingCai @ByronHsu Gently ping, thanks. |
…p reflecting real stats (sgl-project#7317)
Motivation
This PR connects existing Prometheus metric definitions in PD disaggregation part with the appropriate code paths where they should be updated. While the metrics (e.g.,
num_bootstrap_failed_reqs,num_transfer_failed_reqs) were already defined in #6188 , they were not being updated during runtime execution.Modifications
Inserted metric update calls in:
increment_transfer_failed_reqs()/increment_transfer_failed_reqs)log_prefill_stats&&log_decode_stats)infight->inflight)Result
A querying result of a decode node's metrics at a moment