feat(metrics): add scheduler and hiradix cache metrics (#10218) by ShawnKung · Pull Request #10225 · sgl-project/sglang

ShawnKung · 2025-09-09T11:41:29Z

Motivation

See Issue #10218 for detailed background and motivation.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @ShawnKung, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of the SGLang scheduler and KV cache mechanisms by introducing a suite of new performance metrics. These additions aim to provide deeper insights into memory management efficiency and prefill process dynamics, which are crucial for optimizing system performance and identifying bottlenecks.

Highlights

Scheduler Metrics Enhancement: Introduced new metrics for the scheduler, including new_token_ratio and histograms for eviction duration, load-back duration, and chunked prefill loop counts, to provide more granular performance insights.
KV Cache Metrics Integration: Integrated the new scheduler metrics collector into various KV cache implementations (HiRadixCache, RadixCache, RadixCacheCpp, SWARadixCache) to observe and record eviction and load-back durations directly within the cache operations.
Prefill Loop Count Tracking: Added a prefill_loop_count attribute to ScheduleBatch requests to track how many times a request goes through the prefill loop, enabling better analysis of chunked prefill behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces new metrics for the scheduler and cache components, including eviction duration, load back duration, and chunked prefill loop count. The changes are generally well-implemented, but I've identified a potential runtime error due to an incorrect keyword argument and several opportunities to improve metric collection logic and code maintainability. My review includes suggestions to fix these issues.

xiezhq-hermann · 2025-09-15T03:10:36Z

+            "--bucket-eviction-duration",
+            type=float,
+            nargs="+",
+            default=ServerArgs.bucket_eviction_duration,


I think we shouldn't be putting them all into server args, but to have a default value and maybe configure through environment variables or a json file.

Good idea! Putting everything into server args is indeed a bit verbose. I’m going to revise it and push a new commit.

xiezhq-hermann · 2025-09-15T03:12:01Z

Thanks a lot for the PR, these are all reasonable metrics to collect. My only complain is that we probably do not want to have all the minor configs into server config parameters as I mentioned in the comment.

ShawnKung · 2025-09-15T03:57:54Z

Thanks a lot for the PR, these are all reasonable metrics to collect. My only complain is that we probably do not want to have all the minor configs into server config parameters as I mentioned in the comment.

Good idea! I've moved them into env vars. Thanks for the suggestion.

xiezhq-hermann · 2025-09-15T23:10:53Z

@@ -1695,6 +1701,7 @@ def get_new_batch_prefill(self) -> Optional[ScheduleBatch]:
            # only record queue time when enable_metrics is True to avoid overhead
            for req in can_run_list:
                req.queue_time_end = time.perf_counter()
+                req.prefill_loop_count += 1


can you elaborate what this loop count is for? it actually feels quite similar to what the queuing time is already doing

can you elaborate what this loop count is for? it actually feels quite similar to what the queuing time is already doing

@xiezhq-hermann This is meant to capture something slightly different from queuing time.

Queuing time = how long a request waits before prefill starts.

Prefill loop count = how many rounds a request goes through during prefill.
The main purpose is to check if our chunking configuration is reasonable. For example, in some deployment scenarios, if most prefill requests go through multiple rounds, it may indicate that the chunked_prefill_size could be increased to reduce prefill rounds.

I am a bit confused, when a certain request got chunked, the next chunk will also come from this request, which means the number of rounds of a request got processed when chunking happens is basically length / chunk size, which can be easily analysed with input length and the chunk size instead of recording a runtime metrics.

@xiezhq-hermann I agree with your point that in theory the chunk count can be derived as length / chunk_size. However, since there’s currently no metric that reflects the actual chunk_size, when I need to compute the request’s chunk count I would have to hardcode the chunk size in the expression. That makes it less flexible and not aligned with the real deployment setup, where the chunk size may vary. Having a runtime metric is therefore more convenient and more direct in practice.

xiezhq-hermann · 2025-09-15T23:11:57Z

-            self.metrics_collector = SchedulerMetricsCollector(labels=labels)
+            self.metrics_collector = SchedulerMetricsCollector(
+                labels=labels,
+                bucket_eviction_duration=self.get_histogram_conf_from_env(


I think we can do the environment variable reading inside the collector.py

I think we can do the environment variable reading inside the collector.py

good point

xiezhq-hermann · 2025-09-15T23:17:42Z

@@ -293,6 +304,11 @@ def evict(self, num_tokens: int):
                assert node.backuped
                self._evict_backuped(node)

+        if num_evicted > 0 and self.scheduler_metrics_collector is not None:
+            self.scheduler_metrics_collector.observe_eviction_duration(


is this to measure the time taken for eviction to complete? is it primarily for understanding the eviction overhead or other purpose I am not aware here. This overhead will vary a lot based on the tokens required for eviction, simply recording the time probably wouldn't make too much sense

is this to measure the time taken for eviction to complete? is it primarily for understanding the eviction overhead or other purpose I am not aware here. This overhead will vary a lot based on the tokens required for eviction, simply recording the time probably wouldn't make too much sense

@xiezhq-hermann You make a good point.
My initial idea was mainly to have a quick observable signal in cases where the system slows down noticeably, we can immediately see whether eviction or write-back is taking up significant time.
I think we could enhance the story by also recording evicted/write-back tokens and total bytes. With these, we could compute average time per token and effective eviction throughput, which would give a more complete picture of the overhead.
What do you think?

I think recording number of tokens got evicted will be a helpful metrics to understand memory activity. e.g., we can plot the # evicted tokens over time to see the peak and low, etc.

overhead of eviction might not be something as interesting, but I can see that to be useful though. we are pushing for replacing hiradix tree by a cpp version, and then this overhead might be more insignificant.

@xiezhq-hermann I agree — adding the number of evicted tokens is a good idea, I’ll include that metric.
As for the eviction time, while it may become less relevant once the c++ version is in place, I think it’s still useful at this stage. In our case, we may also develop a custom distributed storage backend for the hiradix tree, and having this metric framework in place will make it easier to extend with additional metrics we need later. That said, if you think the eviction time metric is unnecessary for the community, I can drop it.

…0218)

… config

ShawnKung · 2025-09-28T08:07:44Z

@xiezhq-hermann hello, I've add these two new metric of evicted and load back token numbers,

sglang:evicted_tokens_total: A new Counter metric that tracks the total number of tokens evicted from the GPU to the CPU.
sglang:load_back_tokens_total: A new Counter metric that tracks the total number of tokens loaded back from the CPU to the GPU.

xiezhq-hermann · 2025-09-28T08:55:14Z

thank you @ShawnKung for the continuing update, I have two final suggestions:

I still find the loop_count rather less meaningful, it is easier to just log the chunk size when start the serving engine and querying this over and over again is unnecessary
it feels rather weird to pass the scheduler_metrics_collector to every radix implementation, how about we just create a collector in the base radix cache and every derived class can just use it?

…ated metrics logic

ShawnKung · 2025-10-01T14:46:28Z

thank you @ShawnKung for the continuing update, I have two final suggestions:

I still find the loop_count rather less meaningful, it is easier to just log the chunk size when start the serving engine and querying this over and over again is unnecessary

it feels rather weird to pass the scheduler_metrics_collector to every radix implementation, how about we just create a collector in the base radix cache and every derived class can just use it?

@xiezhq-hermann Thank you for the suggestions. I've removed the sglang:prefill_loop_count metric since it was not very meaningful.
I've also extracted a new metric collector RadixCacheMetricsCollector and a base cache class BasePrefixCacheMetricsMixin for cache-related metrics.

ShawnKung · 2025-10-13T02:42:49Z

@xiezhq-hermann Hi, just wanted to check in — I've addressed the latest comments.
Please let me know if there are any other issues I should fix.
If everything looks good, I'll resolve the conflicts so it can be merged.
Thanks!

xiezhq-hermann · 2025-10-13T19:19:25Z

@xiezhq-hermann Hi, just wanted to check in — I've addressed the latest comments. Please let me know if there are any other issues I should fix. If everything looks good, I'll resolve the conflicts so it can be merged. Thanks!

I will handle the rest and get it merged soon thanks!

MMuzzammil1 · 2025-10-17T02:01:27Z

@ShawnKung thanks for the PR. Do you think it would also be nice to have metrics for the evict_host as well (either a flag for if the host memory is exhausted/ or simple log of tokens that are getting evicted from the host memory). This could be a good indicator to asses the memory status of L2 cache.

taegeonum · 2025-10-19T23:57:19Z

@ShawnKung When do you think it is going to be merged? We'd like to use this feature to track cache/eviction related metrics.

ShawnKung · 2025-10-20T12:34:15Z

@xiezhq-hermann Hi, just checking in — seems there’s more interest in this feature now.
Is there any plan or timeline for merging?
Happy to help with anything needed to get it merged. Thanks!

xiezhq-hermann · 2025-10-31T08:26:38Z

@ShawnKung I made some changes to it can you take a look and let me know how do you feel about it, thanks!

ShawnKung · 2025-10-31T09:07:39Z

@ShawnKung I made some changes to it can you take a look and let me know how do you feel about it, thanks!

@xiezhq-hermann Thanks! The changes look good to me.

ShawnKung requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners September 9, 2025 11:41

gemini-code-assist Bot reviewed Sep 9, 2025

View reviewed changes

zhyncs assigned xiezhq-hermann and hnyls2002 Sep 12, 2025

zhyncs added the high priority label Sep 12, 2025

xiezhq-hermann reviewed Sep 15, 2025

View reviewed changes

ShawnKung requested a review from xiezhq-hermann September 15, 2025 08:58

xiezhq-hermann reviewed Sep 15, 2025

View reviewed changes

ShawnKung added 6 commits September 17, 2025 15:30

feat(metrics): add scheduler and hiradix cache metrics (sgl-project#1…

042ec59

…0218)

fix(metrics): fix HiRadixCache init args

fea9e57

fix(lint): fix the lint issue

5bb848a

fix(metrics): fix HiRadixCache init args

7b98634

fix(conflict): fix conflict

933f475

feat(metrics): use env vars instead of server args for metric buckets…

f69ff4a

… config

ShawnKung force-pushed the feat/kvcache-scheduler-metrics branch from 5083933 to f69ff4a Compare September 17, 2025 07:32

feat(metrics): mv environment variable reading inside the collector.py

e808c0d

ShawnKung requested a review from xiezhq-hermann September 17, 2025 08:14

feat(metrics): Add counters for token eviction and loading

5af4e95

ShawnKung added 2 commits October 1, 2025 22:37

feat(metrics): remove prefill loop count metric and extract cache-rel…

10fbb18

…ated metrics logic

fix(conflict): fix conflict

c9b37ab

fix(conflict): fix conflict

a28700c

This comment was marked as duplicate.

Sign in to view

MMuzzammil1 reviewed Oct 21, 2025

View reviewed changes

Comment thread python/sglang/srt/mem_cache/hiradix_cache.py Outdated

ShawnKung force-pushed the feat/kvcache-scheduler-metrics branch from c84f3f9 to 7e7413f Compare October 21, 2025 09:02

fix(conflict): fix conflict and rename storage_metric_collector

71a4447

ShawnKung force-pushed the feat/kvcache-scheduler-metrics branch from 7e7413f to 71a4447 Compare October 21, 2025 09:06

xiezhq-hermann added 4 commits October 22, 2025 23:20

Merge branch 'main' into feat/kvcache-scheduler-metrics

26cdda7

Merge branch 'main' into feat/kvcache-scheduler-metrics

4982dee

merge with enable_metrics

922313c

minor style

6179351

Merge branch 'main' into feat/kvcache-scheduler-metrics

3e26291

xiezhq-hermann approved these changes Oct 31, 2025

View reviewed changes

xiezhq-hermann added the run-ci label Nov 1, 2025

fix(metrics): fix init_metrics_collector labels init

99caa0b

xiezhq-hermann added the ready-to-merge The PR is ready to merge after the CI is green. label Nov 4, 2025

fix(conflict): fix conflict

9a180a5

xiezhq-hermann enabled auto-merge (squash) November 10, 2025 04:48

hnyls2002 approved these changes Nov 10, 2025

View reviewed changes

hnyls2002 disabled auto-merge November 10, 2025 10:14

hnyls2002 merged commit afee284 into sgl-project:main Nov 10, 2025
256 of 281 checks passed

hnyls2002 mentioned this pull request Nov 10, 2025

Tiny simplify evcition metrics collector #12983

Merged

Conversation

ShawnKung commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiezhq-hermann commented Sep 15, 2025

Uh oh!

ShawnKung commented Sep 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShawnKung Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiezhq-hermann Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShawnKung Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShawnKung commented Sep 28, 2025

Uh oh!

xiezhq-hermann commented Sep 28, 2025

Uh oh!

ShawnKung commented Oct 1, 2025

Uh oh!

ShawnKung commented Oct 13, 2025

Uh oh!

xiezhq-hermann commented Oct 13, 2025

Uh oh!

MMuzzammil1 commented Oct 17, 2025

Uh oh!

taegeonum commented Oct 19, 2025

Uh oh!

ShawnKung commented Oct 20, 2025

Uh oh!

This comment was marked as duplicate.

Uh oh!

Uh oh!

xiezhq-hermann commented Oct 31, 2025

ShawnKung commented Sep 9, 2025 •

edited

Loading

ShawnKung Sep 16, 2025 •

edited

Loading

xiezhq-hermann Sep 15, 2025 •

edited

Loading

ShawnKung Sep 16, 2025 •

edited

Loading