Skip to content

feat(metrics): add scheduler and hiradix cache metrics (#10218)#10225

Merged
hnyls2002 merged 19 commits intosgl-project:mainfrom
ShawnKung:feat/kvcache-scheduler-metrics
Nov 10, 2025
Merged

feat(metrics): add scheduler and hiradix cache metrics (#10218)#10225
hnyls2002 merged 19 commits intosgl-project:mainfrom
ShawnKung:feat/kvcache-scheduler-metrics

Conversation

@ShawnKung
Copy link
Copy Markdown
Contributor

@ShawnKung ShawnKung commented Sep 9, 2025

Motivation

See Issue #10218 for detailed background and motivation.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ShawnKung, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of the SGLang scheduler and KV cache mechanisms by introducing a suite of new performance metrics. These additions aim to provide deeper insights into memory management efficiency and prefill process dynamics, which are crucial for optimizing system performance and identifying bottlenecks.

Highlights

  • Scheduler Metrics Enhancement: Introduced new metrics for the scheduler, including new_token_ratio and histograms for eviction duration, load-back duration, and chunked prefill loop counts, to provide more granular performance insights.
  • KV Cache Metrics Integration: Integrated the new scheduler metrics collector into various KV cache implementations (HiRadixCache, RadixCache, RadixCacheCpp, SWARadixCache) to observe and record eviction and load-back durations directly within the cache operations.
  • Prefill Loop Count Tracking: Added a prefill_loop_count attribute to ScheduleBatch requests to track how many times a request goes through the prefill loop, enabling better analysis of chunked prefill behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new metrics for the scheduler and cache components, including eviction duration, load back duration, and chunked prefill loop count. The changes are generally well-implemented, but I've identified a potential runtime error due to an incorrect keyword argument and several opportunities to improve metric collection logic and code maintainability. My review includes suggestions to fix these issues.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache.py Outdated
Comment thread python/sglang/srt/mem_cache/radix_cache_cpp.py Outdated
Comment thread python/sglang/srt/mem_cache/swa_radix_cache.py Outdated
Comment thread python/sglang/srt/metrics/collector.py Outdated
Comment thread python/sglang/srt/server_args.py Outdated
"--bucket-eviction-duration",
type=float,
nargs="+",
default=ServerArgs.bucket_eviction_duration,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't be putting them all into server args, but to have a default value and maybe configure through environment variables or a json file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Putting everything into server args is indeed a bit verbose. I’m going to revise it and push a new commit.

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

Thanks a lot for the PR, these are all reasonable metrics to collect. My only complain is that we probably do not want to have all the minor configs into server config parameters as I mentioned in the comment.

@ShawnKung
Copy link
Copy Markdown
Contributor Author

Thanks a lot for the PR, these are all reasonable metrics to collect. My only complain is that we probably do not want to have all the minor configs into server config parameters as I mentioned in the comment.

Good idea! I've moved them into env vars. Thanks for the suggestion.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
@@ -1695,6 +1701,7 @@ def get_new_batch_prefill(self) -> Optional[ScheduleBatch]:
# only record queue time when enable_metrics is True to avoid overhead
for req in can_run_list:
req.queue_time_end = time.perf_counter()
req.prefill_loop_count += 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate what this loop count is for? it actually feels quite similar to what the queuing time is already doing

Copy link
Copy Markdown
Contributor Author

@ShawnKung ShawnKung Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate what this loop count is for? it actually feels quite similar to what the queuing time is already doing

@xiezhq-hermann This is meant to capture something slightly different from queuing time.

  • Queuing time = how long a request waits before prefill starts.
  • Prefill loop count = how many rounds a request goes through during prefill.
    The main purpose is to check if our chunking configuration is reasonable. For example, in some deployment scenarios, if most prefill requests go through multiple rounds, it may indicate that the chunked_prefill_size could be increased to reduce prefill rounds.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused, when a certain request got chunked, the next chunk will also come from this request, which means the number of rounds of a request got processed when chunking happens is basically length / chunk size, which can be easily analysed with input length and the chunk size instead of recording a runtime metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiezhq-hermann I agree with your point that in theory the chunk count can be derived as length / chunk_size. However, since there’s currently no metric that reflects the actual chunk_size, when I need to compute the request’s chunk count I would have to hardcode the chunk size in the expression. That makes it less flexible and not aligned with the real deployment setup, where the chunk size may vary. Having a runtime metric is therefore more convenient and more direct in practice.

self.metrics_collector = SchedulerMetricsCollector(labels=labels)
self.metrics_collector = SchedulerMetricsCollector(
labels=labels,
bucket_eviction_duration=self.get_histogram_conf_from_env(
Copy link
Copy Markdown
Collaborator

@xiezhq-hermann xiezhq-hermann Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do the environment variable reading inside the collector.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do the environment variable reading inside the collector.py

good point

@@ -293,6 +304,11 @@ def evict(self, num_tokens: int):
assert node.backuped
self._evict_backuped(node)

if num_evicted > 0 and self.scheduler_metrics_collector is not None:
self.scheduler_metrics_collector.observe_eviction_duration(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to measure the time taken for eviction to complete? is it primarily for understanding the eviction overhead or other purpose I am not aware here. This overhead will vary a lot based on the tokens required for eviction, simply recording the time probably wouldn't make too much sense

Copy link
Copy Markdown
Contributor Author

@ShawnKung ShawnKung Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to measure the time taken for eviction to complete? is it primarily for understanding the eviction overhead or other purpose I am not aware here. This overhead will vary a lot based on the tokens required for eviction, simply recording the time probably wouldn't make too much sense

@xiezhq-hermann You make a good point.
My initial idea was mainly to have a quick observable signal in cases where the system slows down noticeably, we can immediately see whether eviction or write-back is taking up significant time.
I think we could enhance the story by also recording evicted/write-back tokens and total bytes. With these, we could compute average time per token and effective eviction throughput, which would give a more complete picture of the overhead.
What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think recording number of tokens got evicted will be a helpful metrics to understand memory activity. e.g., we can plot the # evicted tokens over time to see the peak and low, etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overhead of eviction might not be something as interesting, but I can see that to be useful though. we are pushing for replacing hiradix tree by a cpp version, and then this overhead might be more insignificant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiezhq-hermann I agree — adding the number of evicted tokens is a good idea, I’ll include that metric.
As for the eviction time, while it may become less relevant once the c++ version is in place, I think it’s still useful at this stage. In our case, we may also develop a custom distributed storage backend for the hiradix tree, and having this metric framework in place will make it easier to extend with additional metrics we need later. That said, if you think the eviction time metric is unnecessary for the community, I can drop it.

@ShawnKung ShawnKung force-pushed the feat/kvcache-scheduler-metrics branch from 5083933 to f69ff4a Compare September 17, 2025 07:32
@ShawnKung
Copy link
Copy Markdown
Contributor Author

@xiezhq-hermann hello, I've add these two new metric of evicted and load back token numbers,

  • sglang:evicted_tokens_total: A new Counter metric that tracks the total number of tokens evicted from the GPU to the CPU.
  • sglang:load_back_tokens_total: A new Counter metric that tracks the total number of tokens loaded back from the CPU to the GPU.

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

thank you @ShawnKung for the continuing update, I have two final suggestions:

  1. I still find the loop_count rather less meaningful, it is easier to just log the chunk size when start the serving engine and querying this over and over again is unnecessary
  2. it feels rather weird to pass the scheduler_metrics_collector to every radix implementation, how about we just create a collector in the base radix cache and every derived class can just use it?

@ShawnKung
Copy link
Copy Markdown
Contributor Author

thank you @ShawnKung for the continuing update, I have two final suggestions:

  1. I still find the loop_count rather less meaningful, it is easier to just log the chunk size when start the serving engine and querying this over and over again is unnecessary
  2. it feels rather weird to pass the scheduler_metrics_collector to every radix implementation, how about we just create a collector in the base radix cache and every derived class can just use it?

@xiezhq-hermann Thank you for the suggestions. I've removed the sglang:prefill_loop_count metric since it was not very meaningful.
I've also extracted a new metric collector RadixCacheMetricsCollector and a base cache class BasePrefixCacheMetricsMixin for cache-related metrics.

@ShawnKung
Copy link
Copy Markdown
Contributor Author

@xiezhq-hermann Hi, just wanted to check in — I've addressed the latest comments.
Please let me know if there are any other issues I should fix.
If everything looks good, I'll resolve the conflicts so it can be merged.
Thanks!

@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

@xiezhq-hermann Hi, just wanted to check in — I've addressed the latest comments. Please let me know if there are any other issues I should fix. If everything looks good, I'll resolve the conflicts so it can be merged. Thanks!

I will handle the rest and get it merged soon thanks!

@MMuzzammil1
Copy link
Copy Markdown
Contributor

@ShawnKung thanks for the PR. Do you think it would also be nice to have metrics for the evict_host as well (either a flag for if the host memory is exhausted/ or simple log of tokens that are getting evicted from the host memory). This could be a good indicator to asses the memory status of L2 cache.

@taegeonum
Copy link
Copy Markdown

@ShawnKung When do you think it is going to be merged? We'd like to use this feature to track cache/eviction related metrics.

@ShawnKung
Copy link
Copy Markdown
Contributor Author

@xiezhq-hermann Hi, just checking in — seems there’s more interest in this feature now.
Is there any plan or timeline for merging?
Happy to help with anything needed to get it merged. Thanks!

MMuzzammil1

This comment was marked as duplicate.

Comment thread python/sglang/srt/mem_cache/hiradix_cache.py Outdated
@ShawnKung ShawnKung force-pushed the feat/kvcache-scheduler-metrics branch from c84f3f9 to 7e7413f Compare October 21, 2025 09:02
@ShawnKung ShawnKung force-pushed the feat/kvcache-scheduler-metrics branch from 7e7413f to 71a4447 Compare October 21, 2025 09:06
@xiezhq-hermann
Copy link
Copy Markdown
Collaborator

@ShawnKung I made some changes to it can you take a look and let me know how do you feel about it, thanks!

@ShawnKung
Copy link
Copy Markdown
Contributor Author

@ShawnKung I made some changes to it can you take a look and let me know how do you feel about it, thanks!

@xiezhq-hermann Thanks! The changes look good to me.

@xiezhq-hermann xiezhq-hermann added the ready-to-merge The PR is ready to merge after the CI is green. label Nov 4, 2025
@xiezhq-hermann xiezhq-hermann enabled auto-merge (squash) November 10, 2025 04:48
@hnyls2002 hnyls2002 disabled auto-merge November 10, 2025 10:14
@hnyls2002 hnyls2002 merged commit afee284 into sgl-project:main Nov 10, 2025
256 of 281 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants