Skip to content

[Observability]: Fix vllm cached and prompt tokens#2576

Merged
kobe0938 merged 3 commits intoLMCache:devfrom
sammshen:fix-vllm-hit-prom-metrics
Feb 22, 2026
Merged

[Observability]: Fix vllm cached and prompt tokens#2576
kobe0938 merged 3 commits intoLMCache:devfrom
sammshen:fix-vllm-hit-prom-metrics

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Feb 9, 2026

FIX #2574

vllm cached tokens and prompt tokens were previously only valid for requests with can_load=True (lmcache cached tokens >= vllm cached tokens).

vllm with prefix caching

# choose a chunk size larger than vllm block size 16 or else we can't get negative need to load
PROMETHEUS_MULTIPROC_DIR=/tmp/lmcache_prometheus \
LMCACHE_CHUNK_SIZE=32 \
vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
curl -s http://localhost:8000/metrics | grep "num_vllm_hit_tokens_total"
# clear 
rm -rf /tmp/lmcache_prometheus/*
# NEGATIVE NEED TO LOAD
# TEXT len is 27
# (27 * 3) % 32 = 17 > 16
MULT=3
TEXT="Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts"

PROMPT=$(printf "%0.s$TEXT" $(seq 1 $MULT))

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-8B\",
    \"prompt\": \"$PROMPT\",
    \"max_tokens\": 100,
    \"temperature\": 0.7
  }"

Logs:

(EngineCore_DP0 pid=568279) [2026-02-09 22:55:49,661] LMCache INFO: Reqid: cmpl-3ee2fc427a6a47a285084e63d76ddbae-0, Total tokens 81, LMCache hit tokens: 64, need to load: -16 (vllm_v1_adapter.py:1285:lmcache.integration.vllm.vllm_v1_adapter)
# HELP lmcache:num_vllm_hit_tokens_total Number of hit tokens in vllm
# TYPE lmcache:num_vllm_hit_tokens_total counter
lmcache:num_vllm_hit_tokens_total{model_name="Qwen/Qwen3-8B",role="worker",served_model_name="Qwen/Qwen3-8B",worker_id="0"} 80.0

Samuel Shen added 2 commits February 9, 2026 20:39
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @sammshen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an observability gap in the vLLM integration where metrics for cached and prompt tokens were only being captured for requests explicitly marked for loading. The changes ensure that these crucial performance metrics are consistently reported for all requests, significantly enhancing the accuracy and completeness of monitoring data within the system.

Highlights

  • Metric Reporting Consistency: Ensured that load_spec is always retained in ReqMeta creation, even when can_load is false, to facilitate consistent metric reporting for all requests.
  • Expanded Metric Collection: Modified metric collection logic to update vllm_hit_tokens and prompt_tokens for all requests possessing a load_spec, irrespective of their can_load status, improving observability.
  • Adjusted Request Processing: Adjusted request iteration conditions to correctly handle requests where load_spec is present but can_load is false, aligning with the new metric collection strategy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/integration/vllm/vllm_v1_adapter.py
    • Removed the logic that would set load_spec to None if can_load was false, ensuring load_spec is always passed for metric purposes.
    • Moved the _stats_monitor.update_interval_vllm_hit_tokens and _stats_monitor.update_interval_prompt_tokens calls to a broader scope, allowing them to update metrics for any request with a load_spec.
    • Updated the conditional check in request processing loops from request.load_spec is None to request.load_spec is None or not request.load_spec.can_load to align with the new metric collection strategy.
Activity
  • No human activity (comments, reviews, etc.) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an observability issue in the vLLM integration, ensuring that cached and prompt tokens are correctly accounted for even when can_load=False. No vulnerabilities or critical issues were found, and the changes appear to be well-implemented and adhere to best practices. The fix correctly preserves the load_spec and updates metric collection logic. A suggestion has been made to refactor a section of the code to improve efficiency by consolidating two loops into a single pass.

Comment on lines 771 to 787
for idx, request in enumerate(metadata.requests):
if request.load_spec is None:
if request.load_spec is None or not request.load_spec.can_load:
continue
last_idx = idx

for idx, request in enumerate(metadata.requests):
if request.load_spec is None:
# Update metrics for all requests that have a load_spec
if request.load_spec is not None:
self._stats_monitor.update_interval_vllm_hit_tokens(
request.load_spec.vllm_cached_tokens
)
self._stats_monitor.update_interval_prompt_tokens(
len(request.token_ids)
)

if request.load_spec is None or not request.load_spec.can_load:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the logic is correct, iterating over metadata.requests twice is inefficient. The first loop finds last_idx, and the second loop updates metrics and then processes loadable requests. This can be optimized into a single pass over metadata.requests to collect loadable requests and update metrics, followed by a loop over only the loadable requests. This would improve performance, especially when metadata.requests is large.

Suggested change
for idx, request in enumerate(metadata.requests):
if request.load_spec is None:
if request.load_spec is None or not request.load_spec.can_load:
continue
last_idx = idx
for idx, request in enumerate(metadata.requests):
if request.load_spec is None:
# Update metrics for all requests that have a load_spec
if request.load_spec is not None:
self._stats_monitor.update_interval_vllm_hit_tokens(
request.load_spec.vllm_cached_tokens
)
self._stats_monitor.update_interval_prompt_tokens(
len(request.token_ids)
)
if request.load_spec is None or not request.load_spec.can_load:
continue
loadable_requests = []
for idx, request in enumerate(metadata.requests):
# Update metrics for all requests that have a load_spec
if request.load_spec is not None:
self._stats_monitor.update_interval_vllm_hit_tokens(
request.load_spec.vllm_cached_tokens
)
self._stats_monitor.update_interval_prompt_tokens(
len(request.token_ids)
)
if request.load_spec and request.load_spec.can_load:
loadable_requests.append((idx, request))
if not loadable_requests:
return
last_idx = loadable_requests[-1][0]
for idx, request in loadable_requests:

@sammshen sammshen added the full Run comprehensive tests on this PR label Feb 9, 2026
@sammshen sammshen requested review from ApostaC and kobe0938 February 9, 2026 22:57
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do we want to add the Prometheus metrics curl request to the comprehensive test so that the behavior can be secured?

Copy link
Copy Markdown
Collaborator

@deng451e deng451e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kobe0938 kobe0938 enabled auto-merge (squash) February 20, 2026 22:40
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kobe0938 kobe0938 merged commit dcd2e4f into LMCache:dev Feb 22, 2026
24 checks passed
sammshen added a commit to sammshen/LMCache that referenced this pull request Mar 1, 2026
* add hacky fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* more elegant fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
hlin99 pushed a commit to hlin99/LMCache that referenced this pull request Mar 2, 2026
* add hacky fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* more elegant fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
mauryaavinash95 pushed a commit to mauryaavinash95/LMCache that referenced this pull request Mar 7, 2026
* add hacky fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* more elegant fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
* add hacky fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* more elegant fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
* add hacky fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* more elegant fix

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] num_vllm_hit_tokens_total is 0 despite high need_to_load negative in logs

5 participants