[Observability]: Fix vllm cached and prompt tokens by sammshen · Pull Request #2576 · LMCache/LMCache

sammshen · 2026-02-09T22:24:35Z

vllm cached tokens and prompt tokens were previously only valid for requests with can_load=True (lmcache cached tokens >= vllm cached tokens).

vllm with prefix caching

# choose a chunk size larger than vllm block size 16 or else we can't get negative need to load
PROMETHEUS_MULTIPROC_DIR=/tmp/lmcache_prometheus \
LMCACHE_CHUNK_SIZE=32 \
vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

curl -s http://localhost:8000/metrics | grep "num_vllm_hit_tokens_total"
# clear 
rm -rf /tmp/lmcache_prometheus/*

# NEGATIVE NEED TO LOAD
# TEXT len is 27
# (27 * 3) % 32 = 17 > 16
MULT=3
TEXT="Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts"

PROMPT=$(printf "%0.s$TEXT" $(seq 1 $MULT))

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-8B\",
    \"prompt\": \"$PROMPT\",
    \"max_tokens\": 100,
    \"temperature\": 0.7
  }"

Logs:

(EngineCore_DP0 pid=568279) [2026-02-09 22:55:49,661] LMCache INFO: Reqid: cmpl-3ee2fc427a6a47a285084e63d76ddbae-0, Total tokens 81, LMCache hit tokens: 64, need to load: -16 (vllm_v1_adapter.py:1285:lmcache.integration.vllm.vllm_v1_adapter)

# HELP lmcache:num_vllm_hit_tokens_total Number of hit tokens in vllm
# TYPE lmcache:num_vllm_hit_tokens_total counter
lmcache:num_vllm_hit_tokens_total{model_name="Qwen/Qwen3-8B",role="worker",served_model_name="Qwen/Qwen3-8B",worker_id="0"} 80.0

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

gemini-code-assist · 2026-02-09T22:24:51Z

Summary of Changes

Hello @sammshen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an observability gap in the vLLM integration where metrics for cached and prompt tokens were only being captured for requests explicitly marked for loading. The changes ensure that these crucial performance metrics are consistently reported for all requests, significantly enhancing the accuracy and completeness of monitoring data within the system.

Highlights

Metric Reporting Consistency: Ensured that load_spec is always retained in ReqMeta creation, even when can_load is false, to facilitate consistent metric reporting for all requests.
Expanded Metric Collection: Modified metric collection logic to update vllm_hit_tokens and prompt_tokens for all requests possessing a load_spec, irrespective of their can_load status, improving observability.
Adjusted Request Processing: Adjusted request iteration conditions to correctly handle requests where load_spec is present but can_load is false, aligning with the new metric collection strategy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/integration/vllm/vllm_v1_adapter.py
- Removed the logic that would set load_spec to None if can_load was false, ensuring load_spec is always passed for metric purposes.
- Moved the _stats_monitor.update_interval_vllm_hit_tokens and _stats_monitor.update_interval_prompt_tokens calls to a broader scope, allowing them to update metrics for any request with a load_spec.
- Updated the conditional check in request processing loops from request.load_spec is None to request.load_spec is None or not request.load_spec.can_load to align with the new metric collection strategy.

Activity

No human activity (comments, reviews, etc.) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes an observability issue in the vLLM integration, ensuring that cached and prompt tokens are correctly accounted for even when can_load=False. No vulnerabilities or critical issues were found, and the changes appear to be well-implemented and adhere to best practices. The fix correctly preserves the load_spec and updates metric collection logic. A suggestion has been made to refactor a section of the code to improve efficiency by consolidating two loops into a single pass.

gemini-code-assist · 2026-02-09T22:26:47Z

        for idx, request in enumerate(metadata.requests):
-            if request.load_spec is None:
+            if request.load_spec is None or not request.load_spec.can_load:
                continue
            last_idx = idx

        for idx, request in enumerate(metadata.requests):
-            if request.load_spec is None:
+            # Update metrics for all requests that have a load_spec
+            if request.load_spec is not None:
+                self._stats_monitor.update_interval_vllm_hit_tokens(
+                    request.load_spec.vllm_cached_tokens
+                )
+                self._stats_monitor.update_interval_prompt_tokens(
+                    len(request.token_ids)
+                )
+
+            if request.load_spec is None or not request.load_spec.can_load:
                continue


While the logic is correct, iterating over metadata.requests twice is inefficient. The first loop finds last_idx, and the second loop updates metrics and then processes loadable requests. This can be optimized into a single pass over metadata.requests to collect loadable requests and update metrics, followed by a loop over only the loadable requests. This would improve performance, especially when metadata.requests is large.

Suggested change

for idx, request in enumerate(metadata.requests):

if request.load_spec is None:

if request.load_spec is None or not request.load_spec.can_load:

continue

last_idx = idx

for idx, request in enumerate(metadata.requests):

if request.load_spec is None:

# Update metrics for all requests that have a load_spec

if request.load_spec is not None:

self._stats_monitor.update_interval_vllm_hit_tokens(

request.load_spec.vllm_cached_tokens

)

self._stats_monitor.update_interval_prompt_tokens(

len(request.token_ids)

)

if request.load_spec is None or not request.load_spec.can_load:

continue

loadable_requests = []

for idx, request in enumerate(metadata.requests):

# Update metrics for all requests that have a load_spec

if request.load_spec is not None:

self._stats_monitor.update_interval_vllm_hit_tokens(

request.load_spec.vllm_cached_tokens

)

self._stats_monitor.update_interval_prompt_tokens(

len(request.token_ids)

)

if request.load_spec and request.load_spec.can_load:

loadable_requests.append((idx, request))

if not loadable_requests:

return

last_idx = loadable_requests[-1][0]

for idx, request in loadable_requests:

ApostaC

LGTM! Do we want to add the Prometheus metrics curl request to the comprehensive test so that the behavior can be secured?

deng451e

LGTM

maobaolong

lgtm

maobaolong

lgtm

* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>

* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>

* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

Samuel Shen added 2 commits February 9, 2026 20:39

add hacky fix

a4bf0e0

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

more elegant fix

446c93d

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

gemini-code-assist Bot reviewed Feb 9, 2026

View reviewed changes

sammshen added the full Run comprehensive tests on this PR label Feb 9, 2026

sammshen requested review from ApostaC and kobe0938 February 9, 2026 22:57

ApostaC approved these changes Feb 10, 2026

View reviewed changes

sammshen requested review from deng451e and maobaolong February 12, 2026 21:51

deng451e approved these changes Feb 20, 2026

View reviewed changes

kobe0938 approved these changes Feb 20, 2026

View reviewed changes

Merge branch 'dev' into fix-vllm-hit-prom-metrics

c7552f3

kobe0938 enabled auto-merge (squash) February 20, 2026 22:40

maobaolong reviewed Feb 21, 2026

View reviewed changes

maobaolong approved these changes Feb 21, 2026

View reviewed changes

kobe0938 merged commit dcd2e4f into LMCache:dev Feb 22, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observability]: Fix vllm cached and prompt tokens#2576

[Observability]: Fix vllm cached and prompt tokens#2576
kobe0938 merged 3 commits intoLMCache:devfrom
sammshen:fix-vllm-hit-prom-metrics

sammshen commented Feb 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Uh oh!

ApostaC left a comment

Uh oh!

deng451e left a comment

Uh oh!

maobaolong left a comment

Uh oh!

maobaolong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sammshen commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Feb 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

deng451e left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sammshen commented Feb 9, 2026 •

edited

Loading