Skip to content

[HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics#17648

Merged
ishandhanani merged 16 commits intosgl-project:mainfrom
vladnosiv:cached-details
Feb 3, 2026
Merged

[HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics#17648
ishandhanani merged 16 commits intosgl-project:mainfrom
vladnosiv:cached-details

Conversation

@vladnosiv
Copy link
Copy Markdown
Contributor

@vladnosiv vladnosiv commented Jan 23, 2026

Motivation

When using HiCache, users need visibility into which cache level served their cached tokens. This is critical for debugging and monitoring (especially when HiCache uses external cache storage, like MoonCake).

Previously, only total cached_tokens was reported in usage and metrics without breakdown by source.

Modifications

OpenAI API Response (sglext)

Added cached_tokens_details to sglext extension object (returned per-choice when requested). This maintains OpenAI API compatibility by keeping extensions separate.

To receive detailed breakdown, include return_cached_tokens_details: true in request (similar to return_routed_experts).

Fields in cached_tokens_details:

  • device - Tokens from GPU KV cache (L1)
  • host - Tokens from CPU memory cache (L2)
  • storage - Tokens from storage backend (L3) - only present when L3 enabled
  • storage_backend - Backend type (e.g., "HiCacheFile") - only present when L3 enabled

Files changed:

  • python/sglang/srt/entrypoints/openai/protocol.py - Added CachedTokensDetails model, PromptTokensDetails model, return_cached_tokens_details request field, cached_tokens_details in SglExt
  • python/sglang/srt/entrypoints/openai/utils.py - Added process_cached_tokens_details_from_ret() helper
  • python/sglang/srt/entrypoints/openai/serving_chat.py - Return cached_tokens_details in sgl_ext
  • python/sglang/srt/entrypoints/openai/serving_completions.py - Return cached_tokens_details in sgl_ext
  • python/sglang/srt/entrypoints/openai/usage_processor.py - Use PromptTokensDetails model for type safety
  • python/sglang/srt/managers/scheduler_output_processor_mixin.py - Build breakdown from request state

Prometheus Metrics

Extended sglang:cached_tokens_total counter with cache_source label:

  • cache_source="device" - L1 hits
  • cache_source="host" - L2 hits
  • cache_source="storage_{backend}" - L3 hits (e.g., storage_HiCacheFile)

Files changed:

  • python/sglang/srt/metrics/collector.py - Report tokens by source

Storage Hit Tracking

Added proper tracking of tokens loaded from L3 storage during prefetch:

Files changed:

  • python/sglang/srt/managers/schedule_batch.py - Added storage_hit_length and _cache_breakdown_computed fields
  • python/sglang/srt/mem_cache/hiradix_cache.py - Track prefetched tokens per request via prefetch_loaded_tokens_by_reqid
  • python/sglang/srt/managers/scheduler.py - Pop storage hit count after prefetch completion

Accuracy Tests

N/A - This change only affects metadata reporting, not model outputs.

Benchmarking and Profiling

N/A - No impact on inference speed.

Example Outputs

Request with cache details:

{
    "messages": [...],
    "return_cached_tokens_details": true
}

Response - No cache hit:

"usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
}

Response - L1 (device) cache hit:

"choices": [...],
"usage": {
    "prompt_tokens": 14256,
    "prompt_tokens_details": {
        "cached_tokens": 14256
    },
    ...,
"sglext": {
    "cached_tokens_details": {
        "device": 14256,
        "host": 0
    }
}

Response - L3 (storage) cache hit after server restart:

"choices": [...],
"usage": {
    "prompt_tokens": 14256,
    "prompt_tokens_details": {
        "cached_tokens": 14256
    },
    ...,
"sglext": {
    "cached_tokens_details": {
        "device": 0,
        "host": 0,
        "storage": 14256,
        "storage_backend": "HiCacheFile"
    }
}

Prometheus metrics:

sglang:cached_tokens_total{cache_source="device",model_name="Qwen/Qwen3-0.6B"} 14256.0
sglang:cached_tokens_total{cache_source="host",model_name="Qwen/Qwen3-0.6B"} 14256.0
sglang:cached_tokens_total{cache_source="storage_HiC
acheFile",model_name="Qwen/Qwen3-0.6B"} 14256.0
Grafana

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @vladnosiv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of HiCache by providing a detailed breakdown of where cached tokens originate within the system. Previously, only a total count of cached tokens was available, making it difficult to diagnose performance or understand cache effectiveness across different levels (GPU, CPU, and external storage). The changes introduce new data models in the OpenAI API response to expose this per-tier breakdown and extend Prometheus metrics to allow for granular monitoring of cache hits by source, offering critical insights for debugging and operational analysis.

Highlights

  • Enhanced Cache Visibility: Introduced detailed breakdown of cached tokens by source (device, host, storage) in the OpenAI API response, specifically within the prompt_tokens_details field.
  • Granular Prometheus Metrics: Extended the sglang:cached_tokens_total Prometheus metric with a cache_source label to differentiate hits from device (L1), host (L2), and storage (L3) caches, providing more granular monitoring capabilities.
  • Improved Storage Hit Tracking: Implemented precise tracking of tokens loaded from L3 storage during prefetch operations, ensuring accurate reporting for storage-level cache hits and proper cleanup for aborted requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for HiCache by adding a detailed breakdown of cache hits in both the API usage response and Prometheus metrics. This will significantly improve observability and debugging capabilities. The changes are well-structured across multiple files, primarily plumbing the new detailed cache information through the system. The core logic for calculating and reporting the breakdown is sound. I've provided a couple of suggestions to refactor the aggregation logic in usage_processor.py and metrics/collector.py to enhance code readability and maintainability. Overall, this is a solid contribution.

Comment thread python/sglang/srt/entrypoints/openai/usage_processor.py Outdated
Comment thread python/sglang/srt/metrics/collector.py
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@stmatengss
Copy link
Copy Markdown
Collaborator

Good feature. HiCache currently lacks visibility and monitoring.

@vladnosiv
Copy link
Copy Markdown
Contributor Author

Good feature. HiCache currently lacks visibility and monitoring.

I'm glad to hear it. I also have a desire to add latency monitoring to the retrieve pages from the L2 and L3 cache. Because right now, in the HiCache + MoonCake bundle, there doesn't seem to be a normal way to understand the latency with which tokens were delivered due to the batch structure of requests to MoonCake itself.

@ishandhanani
Copy link
Copy Markdown
Collaborator

A thought - will we be breaking OAI API compatibility with this addition?

In this PR #17434, we added a struct to return out of band information via an sglext struct. You'll notice that it is very similar to the nvext we use in Dynamo. What do you think about returning information via here?

@vladnosiv
Copy link
Copy Markdown
Contributor Author

In this PR #17434, we added a struct to return out of band information via an sglext struct. You'll notice that it is very similar to the nvext we use in Dynamo. What do you think about returning information via here?

Yes, it sounds great, I'll correct it for this option tomorrow.

@stmatengss
Copy link
Copy Markdown
Collaborator

@ishandhanani I think you are a professional in this area. PTAL

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@vladnosiv
Copy link
Copy Markdown
Contributor Author

updated to sgl_ext

@vladnosiv vladnosiv changed the title [HiCache] feat: Add detailed cache hit breakdown for HiCache in usage response and Prometheus metrics [HiCache] feat: Add detailed cache hit breakdown for HiCache in sgl_ext and Prometheus metrics Jan 29, 2026
@ishandhanani
Copy link
Copy Markdown
Collaborator

ishandhanani commented Jan 30, 2026

AI review for reference: https://app.devin.ai/review/sgl-project/sglang/pull/17648 @vladnosiv

A few flags and a possible error was caught.

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
…h l3

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
@vladnosiv
Copy link
Copy Markdown
Contributor Author

ready to merge

@ishandhanani
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Feb 3, 2026
@ishandhanani ishandhanani changed the title [HiCache] feat: Add detailed cache hit breakdown for HiCache in sgl_ext and Prometheus metrics [HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics Feb 3, 2026
@ishandhanani ishandhanani merged commit e166ca8 into sgl-project:main Feb 3, 2026
85 of 99 checks passed
alisonshao added a commit that referenced this pull request Feb 5, 2026
PR #17648 moved the sgl_ext field from choices[0] to the response
level and renamed it to sglext. The test was not updated accordingly,
causing CI failures.

Changes:
- Access response.get("sglext") instead of choices[0].get("sgl_ext")
- Update field name from sgl_ext to sglext (no underscore)

Failure example: https://github.com/sgl-project/sglang/actions/runs/21653054987/job/62421861711
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 9, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 9, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…xt` and Prometheus metrics (sgl-project#17648)

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants