[HiCache] feat: Add detailed cache hit breakdown for HiCache in `sglext` and Prometheus metrics by vladnosiv · Pull Request #17648 · sgl-project/sglang

vladnosiv · 2026-01-23T10:08:49Z

Motivation

When using HiCache, users need visibility into which cache level served their cached tokens. This is critical for debugging and monitoring (especially when HiCache uses external cache storage, like MoonCake).

Previously, only total cached_tokens was reported in usage and metrics without breakdown by source.

Modifications

OpenAI API Response (`sglext`)

Added cached_tokens_details to sglext extension object (returned per-choice when requested). This maintains OpenAI API compatibility by keeping extensions separate.

To receive detailed breakdown, include return_cached_tokens_details: true in request (similar to return_routed_experts).

Fields in cached_tokens_details:

device - Tokens from GPU KV cache (L1)
host - Tokens from CPU memory cache (L2)
storage - Tokens from storage backend (L3) - only present when L3 enabled
storage_backend - Backend type (e.g., "HiCacheFile") - only present when L3 enabled

Files changed:

python/sglang/srt/entrypoints/openai/protocol.py - Added CachedTokensDetails model, PromptTokensDetails model, return_cached_tokens_details request field, cached_tokens_details in SglExt
python/sglang/srt/entrypoints/openai/utils.py - Added process_cached_tokens_details_from_ret() helper
python/sglang/srt/entrypoints/openai/serving_chat.py - Return cached_tokens_details in sgl_ext
python/sglang/srt/entrypoints/openai/serving_completions.py - Return cached_tokens_details in sgl_ext
python/sglang/srt/entrypoints/openai/usage_processor.py - Use PromptTokensDetails model for type safety
python/sglang/srt/managers/scheduler_output_processor_mixin.py - Build breakdown from request state

Prometheus Metrics

Extended sglang:cached_tokens_total counter with cache_source label:

cache_source="device" - L1 hits
cache_source="host" - L2 hits
cache_source="storage_{backend}" - L3 hits (e.g., storage_HiCacheFile)

Files changed:

python/sglang/srt/metrics/collector.py - Report tokens by source

Storage Hit Tracking

Added proper tracking of tokens loaded from L3 storage during prefetch:

Files changed:

python/sglang/srt/managers/schedule_batch.py - Added storage_hit_length and _cache_breakdown_computed fields
python/sglang/srt/mem_cache/hiradix_cache.py - Track prefetched tokens per request via prefetch_loaded_tokens_by_reqid
python/sglang/srt/managers/scheduler.py - Pop storage hit count after prefetch completion

Accuracy Tests

N/A - This change only affects metadata reporting, not model outputs.

Benchmarking and Profiling

N/A - No impact on inference speed.

Example Outputs

Request with cache details:

{
    "messages": [...],
    "return_cached_tokens_details": true
}

Response - No cache hit:

"usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
}

Response - L1 (device) cache hit:

"choices": [...],
"usage": {
    "prompt_tokens": 14256,
    "prompt_tokens_details": {
        "cached_tokens": 14256
    },
    ...,
"sglext": {
    "cached_tokens_details": {
        "device": 14256,
        "host": 0
    }
}

Response - L3 (storage) cache hit after server restart:

"choices": [...],
"usage": {
    "prompt_tokens": 14256,
    "prompt_tokens_details": {
        "cached_tokens": 14256
    },
    ...,
"sglext": {
    "cached_tokens_details": {
        "device": 0,
        "host": 0,
        "storage": 14256,
        "storage_backend": "HiCacheFile"
    }
}

Prometheus metrics:

sglang:cached_tokens_total{cache_source="device",model_name="Qwen/Qwen3-0.6B"} 14256.0
sglang:cached_tokens_total{cache_source="host",model_name="Qwen/Qwen3-0.6B"} 14256.0
sglang:cached_tokens_total{cache_source="storage_HiC
acheFile",model_name="Qwen/Qwen3-0.6B"} 14256.0

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

gemini-code-assist · 2026-01-23T10:09:21Z

Summary of Changes

Hello @vladnosiv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of HiCache by providing a detailed breakdown of where cached tokens originate within the system. Previously, only a total count of cached tokens was available, making it difficult to diagnose performance or understand cache effectiveness across different levels (GPU, CPU, and external storage). The changes introduce new data models in the OpenAI API response to expose this per-tier breakdown and extend Prometheus metrics to allow for granular monitoring of cache hits by source, offering critical insights for debugging and operational analysis.

Highlights

Enhanced Cache Visibility: Introduced detailed breakdown of cached tokens by source (device, host, storage) in the OpenAI API response, specifically within the prompt_tokens_details field.
Granular Prometheus Metrics: Extended the sglang:cached_tokens_total Prometheus metric with a cache_source label to differentiate hits from device (L1), host (L2), and storage (L3) caches, providing more granular monitoring capabilities.
Improved Storage Hit Tracking: Implemented precise tracking of tokens loaded from L3 storage during prefetch operations, ensuring accurate reporting for storage-level cache hits and proper cleanup for aborted requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for HiCache by adding a detailed breakdown of cache hits in both the API usage response and Prometheus metrics. This will significantly improve observability and debugging capabilities. The changes are well-structured across multiple files, primarily plumbing the new detailed cache information through the system. The core logic for calculating and reporting the breakdown is sound. I've provided a couple of suggestions to refactor the aggregation logic in usage_processor.py and metrics/collector.py to enhance code readability and maintainability. Overall, this is a solid contribution.

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

stmatengss · 2026-01-23T17:00:02Z

Good feature. HiCache currently lacks visibility and monitoring.

vladnosiv · 2026-01-23T18:12:10Z

Good feature. HiCache currently lacks visibility and monitoring.

I'm glad to hear it. I also have a desire to add latency monitoring to the retrieve pages from the L2 and L3 cache. Because right now, in the HiCache + MoonCake bundle, there doesn't seem to be a normal way to understand the latency with which tokens were delivered due to the batch structure of requests to MoonCake itself.

ishandhanani · 2026-01-28T16:06:30Z

A thought - will we be breaking OAI API compatibility with this addition?

In this PR #17434, we added a struct to return out of band information via an sglext struct. You'll notice that it is very similar to the nvext we use in Dynamo. What do you think about returning information via here?

vladnosiv · 2026-01-28T16:58:42Z

In this PR #17434, we added a struct to return out of band information via an sglext struct. You'll notice that it is very similar to the nvext we use in Dynamo. What do you think about returning information via here?

Yes, it sounds great, I'll correct it for this option tomorrow.

stmatengss · 2026-01-29T15:33:35Z

@ishandhanani I think you are a professional in this area. PTAL

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv · 2026-01-29T16:05:16Z

updated to sgl_ext

ishandhanani · 2026-01-30T18:31:17Z

AI review for reference: https://app.devin.ai/review/sgl-project/sglang/pull/17648 @vladnosiv

A few flags and a possible error was caught.

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

…h l3 Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv · 2026-02-03T07:42:09Z

ready to merge

ishandhanani · 2026-02-03T07:44:57Z

/tag-and-rerun-ci

PR #17648 moved the sgl_ext field from choices[0] to the response level and renamed it to sglext. The test was not updated accordingly, causing CI failures. Changes: - Access response.get("sglext") instead of choices[0].get("sgl_ext") - Update field name from sgl_ext to sglext (no underscore) Failure example: https://github.com/sgl-project/sglang/actions/runs/21653054987/job/62421861711

…xt` and Prometheus metrics (sgl-project#17648) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>

vladnosiv added 6 commits January 23, 2026 11:04

iter

08a25b1

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

add debug logs

7bc01fb

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

more logs

34021f6

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

remove null fields

b3010a2

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

fix chunked prefill case

ea188c9

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

remove debug logs

dc145b9

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv requested review from CatherineSue, JustinTong0323, Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, slin1237, xiezhq-hermann and yizhang2077 as code owners January 23, 2026 10:08

gemini-code-assist Bot reviewed Jan 23, 2026

View reviewed changes

Comment thread python/sglang/srt/entrypoints/openai/usage_processor.py Outdated

Comment thread python/sglang/srt/metrics/collector.py

gemini review patch

a752395

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

stmatengss assigned stmatengss and xiezhq-hermann Jan 23, 2026

Merge branch 'main' into cached-details

6091509

vladnosiv mentioned this pull request Jan 28, 2026

[SGLANG]: 2026 H1 backend roadmap/improvements ai-dynamo/dynamo#5146

Open

6 tasks

Merge branch 'sgl-project:main' into cached-details

9a6b312

stmatengss assigned ishandhanani Jan 29, 2026

vladnosiv added 3 commits January 29, 2026 18:53

move cached details to sgl-ext

4aafaf1

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

Merge branch 'main' into cached-details

fd2aa11

minimize diff

4fe222b

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

vladnosiv changed the title ~~[HiCache] feat: Add detailed cache hit breakdown for HiCache in usage response and Prometheus metrics~~ [HiCache] feat: Add detailed cache hit breakdown for HiCache in sgl_ext and Prometheus metrics Jan 29, 2026

vladnosiv added 3 commits February 2, 2026 20:03

sgl_ext -> sglext & sglext from choices to root level

9d0121a

Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

Merge branch 'main' into cached-details

9ea7aeb

remove caching backend type since possibility of dynamic attach/detac…

2793f54

…h l3 Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com>

ishandhanani approved these changes Feb 3, 2026

View reviewed changes

github-actions Bot added the run-ci label Feb 3, 2026

Merge branch 'main' into cached-details

30a34ef

ishandhanani changed the title ~~[HiCache] feat: Add detailed cache hit breakdown for HiCache in sgl_ext and Prometheus metrics~~ [HiCache] feat: Add detailed cache hit breakdown for HiCache in sglext and Prometheus metrics Feb 3, 2026

ishandhanani merged commit e166ca8 into sgl-project:main Feb 3, 2026
85 of 99 checks passed

alisonshao mentioned this pull request Feb 5, 2026

Fix test_return_routed_experts to use response-level sglext #18274

Merged

1 task

Kangyan-Zhou mentioned this pull request Mar 18, 2026

Temporary fix: abort crash on ChunkCache in disagg decode #20790

Open

3 tasks

huangtingwei9988 mentioned this pull request Mar 31, 2026

[HiCache & PD]Fixed detailed cache hit breakdown in PD scenarios. #21764

Merged

5 tasks

This was referenced Apr 3, 2026

[HiCache & Bench] add cache hit breakdown in bench_serving #22053

Open

[HiCache] return cached_tokens_details in sglext for streaming responses #22055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HiCache] feat: Add detailed cache hit breakdown for HiCache in `sglext` and Prometheus metrics#17648

[HiCache] feat: Add detailed cache hit breakdown for HiCache in `sglext` and Prometheus metrics#17648
ishandhanani merged 16 commits intosgl-project:mainfrom
vladnosiv:cached-details

vladnosiv commented Jan 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

stmatengss commented Jan 23, 2026

Uh oh!

vladnosiv commented Jan 23, 2026

Uh oh!

ishandhanani commented Jan 28, 2026

Uh oh!

vladnosiv commented Jan 28, 2026

Uh oh!

stmatengss commented Jan 29, 2026

Uh oh!

vladnosiv commented Jan 29, 2026

Uh oh!

ishandhanani commented Jan 30, 2026 •

edited

Loading

Uh oh!

vladnosiv commented Feb 3, 2026

Uh oh!

ishandhanani commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vladnosiv commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

OpenAI API Response (sglext)

Prometheus Metrics

Storage Hit Tracking

Accuracy Tests

Benchmarking and Profiling

Example Outputs

Uh oh!

gemini-code-assist Bot commented Jan 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

stmatengss commented Jan 23, 2026

Uh oh!

vladnosiv commented Jan 23, 2026

Uh oh!

ishandhanani commented Jan 28, 2026

Uh oh!

vladnosiv commented Jan 28, 2026

Uh oh!

stmatengss commented Jan 29, 2026

Uh oh!

vladnosiv commented Jan 29, 2026

Uh oh!

ishandhanani commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vladnosiv commented Feb 3, 2026

Uh oh!

ishandhanani commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vladnosiv commented Jan 23, 2026 •

edited

Loading

OpenAI API Response (`sglext`)

ishandhanani commented Jan 30, 2026 •

edited

Loading