[HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics#17648
Conversation
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Summary of ChangesHello @vladnosiv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the observability of HiCache by providing a detailed breakdown of where cached tokens originate within the system. Previously, only a total count of cached tokens was available, making it difficult to diagnose performance or understand cache effectiveness across different levels (GPU, CPU, and external storage). The changes introduce new data models in the OpenAI API response to expose this per-tier breakdown and extend Prometheus metrics to allow for granular monitoring of cache hits by source, offering critical insights for debugging and operational analysis. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature for HiCache by adding a detailed breakdown of cache hits in both the API usage response and Prometheus metrics. This will significantly improve observability and debugging capabilities. The changes are well-structured across multiple files, primarily plumbing the new detailed cache information through the system. The core logic for calculating and reporting the breakdown is sound. I've provided a couple of suggestions to refactor the aggregation logic in usage_processor.py and metrics/collector.py to enhance code readability and maintainability. Overall, this is a solid contribution.
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
Good feature. HiCache currently lacks visibility and monitoring. |
I'm glad to hear it. I also have a desire to add latency monitoring to the retrieve pages from the L2 and L3 cache. Because right now, in the HiCache + MoonCake bundle, there doesn't seem to be a normal way to understand the latency with which tokens were delivered due to the batch structure of requests to MoonCake itself. |
|
A thought - will we be breaking OAI API compatibility with this addition? In this PR #17434, we added a struct to return out of band information via an |
Yes, it sounds great, I'll correct it for this option tomorrow. |
|
@ishandhanani I think you are a professional in this area. PTAL |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
updated to sgl_ext |
|
AI review for reference: https://app.devin.ai/review/sgl-project/sglang/pull/17648 @vladnosiv A few flags and a possible error was caught. |
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
…h l3 Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
ready to merge |
|
/tag-and-rerun-ci |
sglext and Prometheus metrics
PR #17648 moved the sgl_ext field from choices[0] to the response level and renamed it to sglext. The test was not updated accordingly, causing CI failures. Changes: - Access response.get("sglext") instead of choices[0].get("sgl_ext") - Update field name from sgl_ext to sglext (no underscore) Failure example: https://github.com/sgl-project/sglang/actions/runs/21653054987/job/62421861711
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Motivation
When using HiCache, users need visibility into which cache level served their cached tokens. This is critical for debugging and monitoring (especially when HiCache uses external cache storage, like MoonCake).
Previously, only total
cached_tokenswas reported in usage and metrics without breakdown by source.Modifications
OpenAI API Response (
sglext)Added
cached_tokens_detailstosglextextension object (returned per-choice when requested). This maintains OpenAI API compatibility by keeping extensions separate.To receive detailed breakdown, include
return_cached_tokens_details: truein request (similar toreturn_routed_experts).Fields in
cached_tokens_details:device- Tokens from GPU KV cache (L1)host- Tokens from CPU memory cache (L2)storage- Tokens from storage backend (L3) - only present when L3 enabledstorage_backend- Backend type (e.g., "HiCacheFile") - only present when L3 enabledFiles changed:
python/sglang/srt/entrypoints/openai/protocol.py- AddedCachedTokensDetailsmodel,PromptTokensDetailsmodel,return_cached_tokens_detailsrequest field,cached_tokens_detailsinSglExtpython/sglang/srt/entrypoints/openai/utils.py- Addedprocess_cached_tokens_details_from_ret()helperpython/sglang/srt/entrypoints/openai/serving_chat.py- Returncached_tokens_detailsinsgl_extpython/sglang/srt/entrypoints/openai/serving_completions.py- Returncached_tokens_detailsinsgl_extpython/sglang/srt/entrypoints/openai/usage_processor.py- UsePromptTokensDetailsmodel for type safetypython/sglang/srt/managers/scheduler_output_processor_mixin.py- Build breakdown from request statePrometheus Metrics
Extended
sglang:cached_tokens_totalcounter withcache_sourcelabel:cache_source="device"- L1 hitscache_source="host"- L2 hitscache_source="storage_{backend}"- L3 hits (e.g.,storage_HiCacheFile)Files changed:
python/sglang/srt/metrics/collector.py- Report tokens by sourceStorage Hit Tracking
Added proper tracking of tokens loaded from L3 storage during prefetch:
Files changed:
python/sglang/srt/managers/schedule_batch.py- Addedstorage_hit_lengthand_cache_breakdown_computedfieldspython/sglang/srt/mem_cache/hiradix_cache.py- Track prefetched tokens per request viaprefetch_loaded_tokens_by_reqidpython/sglang/srt/managers/scheduler.py- Pop storage hit count after prefetch completionAccuracy Tests
N/A - This change only affects metadata reporting, not model outputs.
Benchmarking and Profiling
N/A - No impact on inference speed.
Example Outputs
Request with cache details:
{ "messages": [...], "return_cached_tokens_details": true }Response - No cache hit:
Response - L1 (device) cache hit:
Response - L3 (storage) cache hit after server restart:
Prometheus metrics: