feat(metrics): add scheduler and hiradix cache metrics (#10218)#10225
feat(metrics): add scheduler and hiradix cache metrics (#10218)#10225hnyls2002 merged 19 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @ShawnKung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the observability of the SGLang scheduler and KV cache mechanisms by introducing a suite of new performance metrics. These additions aim to provide deeper insights into memory management efficiency and prefill process dynamics, which are crucial for optimizing system performance and identifying bottlenecks.
Highlights
- Scheduler Metrics Enhancement: Introduced new metrics for the scheduler, including
new_token_ratioand histograms for eviction duration, load-back duration, and chunked prefill loop counts, to provide more granular performance insights. - KV Cache Metrics Integration: Integrated the new scheduler metrics collector into various KV cache implementations (HiRadixCache, RadixCache, RadixCacheCpp, SWARadixCache) to observe and record eviction and load-back durations directly within the cache operations.
- Prefill Loop Count Tracking: Added a
prefill_loop_countattribute toScheduleBatchrequests to track how many times a request goes through the prefill loop, enabling better analysis of chunked prefill behavior.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces new metrics for the scheduler and cache components, including eviction duration, load back duration, and chunked prefill loop count. The changes are generally well-implemented, but I've identified a potential runtime error due to an incorrect keyword argument and several opportunities to improve metric collection logic and code maintainability. My review includes suggestions to fix these issues.
| "--bucket-eviction-duration", | ||
| type=float, | ||
| nargs="+", | ||
| default=ServerArgs.bucket_eviction_duration, |
There was a problem hiding this comment.
I think we shouldn't be putting them all into server args, but to have a default value and maybe configure through environment variables or a json file.
There was a problem hiding this comment.
Good idea! Putting everything into server args is indeed a bit verbose. I’m going to revise it and push a new commit.
|
Thanks a lot for the PR, these are all reasonable metrics to collect. My only complain is that we probably do not want to have all the minor configs into server config parameters as I mentioned in the comment. |
Good idea! I've moved them into env vars. Thanks for the suggestion. |
| @@ -1695,6 +1701,7 @@ def get_new_batch_prefill(self) -> Optional[ScheduleBatch]: | |||
| # only record queue time when enable_metrics is True to avoid overhead | |||
| for req in can_run_list: | |||
| req.queue_time_end = time.perf_counter() | |||
| req.prefill_loop_count += 1 | |||
There was a problem hiding this comment.
can you elaborate what this loop count is for? it actually feels quite similar to what the queuing time is already doing
There was a problem hiding this comment.
can you elaborate what this loop count is for? it actually feels quite similar to what the queuing time is already doing
@xiezhq-hermann This is meant to capture something slightly different from queuing time.
- Queuing time = how long a request waits before prefill starts.
- Prefill loop count = how many rounds a request goes through during prefill.
The main purpose is to check if our chunking configuration is reasonable. For example, in some deployment scenarios, if most prefill requests go through multiple rounds, it may indicate that the chunked_prefill_size could be increased to reduce prefill rounds.
There was a problem hiding this comment.
I am a bit confused, when a certain request got chunked, the next chunk will also come from this request, which means the number of rounds of a request got processed when chunking happens is basically length / chunk size, which can be easily analysed with input length and the chunk size instead of recording a runtime metrics.
There was a problem hiding this comment.
@xiezhq-hermann I agree with your point that in theory the chunk count can be derived as length / chunk_size. However, since there’s currently no metric that reflects the actual chunk_size, when I need to compute the request’s chunk count I would have to hardcode the chunk size in the expression. That makes it less flexible and not aligned with the real deployment setup, where the chunk size may vary. Having a runtime metric is therefore more convenient and more direct in practice.
| self.metrics_collector = SchedulerMetricsCollector(labels=labels) | ||
| self.metrics_collector = SchedulerMetricsCollector( | ||
| labels=labels, | ||
| bucket_eviction_duration=self.get_histogram_conf_from_env( |
There was a problem hiding this comment.
I think we can do the environment variable reading inside the collector.py
There was a problem hiding this comment.
I think we can do the environment variable reading inside the
collector.py
good point
| @@ -293,6 +304,11 @@ def evict(self, num_tokens: int): | |||
| assert node.backuped | |||
| self._evict_backuped(node) | |||
|
|
|||
| if num_evicted > 0 and self.scheduler_metrics_collector is not None: | |||
| self.scheduler_metrics_collector.observe_eviction_duration( | |||
There was a problem hiding this comment.
is this to measure the time taken for eviction to complete? is it primarily for understanding the eviction overhead or other purpose I am not aware here. This overhead will vary a lot based on the tokens required for eviction, simply recording the time probably wouldn't make too much sense
There was a problem hiding this comment.
is this to measure the time taken for eviction to complete? is it primarily for understanding the eviction overhead or other purpose I am not aware here. This overhead will vary a lot based on the tokens required for eviction, simply recording the time probably wouldn't make too much sense
@xiezhq-hermann You make a good point.
My initial idea was mainly to have a quick observable signal in cases where the system slows down noticeably, we can immediately see whether eviction or write-back is taking up significant time.
I think we could enhance the story by also recording evicted/write-back tokens and total bytes. With these, we could compute average time per token and effective eviction throughput, which would give a more complete picture of the overhead.
What do you think?
There was a problem hiding this comment.
I think recording number of tokens got evicted will be a helpful metrics to understand memory activity. e.g., we can plot the # evicted tokens over time to see the peak and low, etc.
There was a problem hiding this comment.
overhead of eviction might not be something as interesting, but I can see that to be useful though. we are pushing for replacing hiradix tree by a cpp version, and then this overhead might be more insignificant.
There was a problem hiding this comment.
@xiezhq-hermann I agree — adding the number of evicted tokens is a good idea, I’ll include that metric.
As for the eviction time, while it may become less relevant once the c++ version is in place, I think it’s still useful at this stage. In our case, we may also develop a custom distributed storage backend for the hiradix tree, and having this metric framework in place will make it easier to extend with additional metrics we need later. That said, if you think the eviction time metric is unnecessary for the community, I can drop it.
5083933 to
f69ff4a
Compare
|
@xiezhq-hermann hello, I've add these two new metric of evicted and load back token numbers,
|
|
thank you @ShawnKung for the continuing update, I have two final suggestions:
|
@xiezhq-hermann Thank you for the suggestions. I've removed the |
|
@xiezhq-hermann Hi, just wanted to check in — I've addressed the latest comments. |
I will handle the rest and get it merged soon thanks! |
|
@ShawnKung thanks for the PR. Do you think it would also be nice to have metrics for the |
|
@ShawnKung When do you think it is going to be merged? We'd like to use this feature to track cache/eviction related metrics. |
|
@xiezhq-hermann Hi, just checking in — seems there’s more interest in this feature now. |
c84f3f9 to
7e7413f
Compare
7e7413f to
71a4447
Compare
|
@ShawnKung I made some changes to it can you take a look and let me know how do you feel about it, thanks! |
@xiezhq-hermann Thanks! The changes look good to me. |
Motivation
See Issue #10218 for detailed background and motivation.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist