Skip to content

fix: save_decode_cache is broken on vllm >= 0.9.2#2821

Closed
hlin99 wants to merge 6 commits intoLMCache:devfrom
hlin99:ww12_PR_save_decode_cache
Closed

fix: save_decode_cache is broken on vllm >= 0.9.2#2821
hlin99 wants to merge 6 commits intoLMCache:devfrom
hlin99:ww12_PR_save_decode_cache

Conversation

@hlin99
Copy link
Copy Markdown
Contributor

@hlin99 hlin99 commented Mar 19, 2026

In newer vLLM versions (>= 0.9.2), request.num_computed_tokens is already updated before build_connector_meta is called. This causes the slice request.all_token_ids[num_computed_tokens:...] to exceed the length of all_token_ids during decode, silently returning an empty list instead of the new token.

As a result, tracker.token_ids never grows, input_token_len stays at prompt length, skip_save is always True, and KV cache is never saved during decode phase.

Fix by using len(request_tracker.token_ids) as the slice base, which is only updated after the slice and is always in sync with LMCache's own state.


Note

Medium Risk
Adjusts how new tokens are sliced during decode/cached-request scheduling; incorrect indexing could still impact cache save behavior, but the change is localized to metadata construction.

Overview
Fixes build_connector_meta token slicing for scheduled cached requests by basing the slice start on LMCache’s tracked token count (and taking min(tracker_len, request.num_computed_tokens)), instead of relying solely on request.num_computed_tokens.

This prevents empty new_token_ids slices on vLLM >= 0.9.2 (and remains safer under preemption/rollback), allowing RequestTracker.token_ids to grow correctly so decode-phase KV cache can be saved as intended.

Written by Cursor Bugbot for commit c608c66. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug affecting the save_decode_cache functionality in vLLM versions 0.9.2 and newer. The issue stemmed from an out-of-sync num_computed_tokens value, which led to incorrect token ID slicing and prevented the Key-Value (KV) cache from being properly saved during the decode phase. The implemented fix ensures that the correct token IDs are always captured, restoring the intended caching behavior and improving the efficiency of long-running inference requests.

Highlights

  • Fix for vLLM >= 0.9.2: Resolved an issue where save_decode_cache was broken in vLLM versions 0.9.2 and above, preventing the KV cache from being saved during the decode phase.
  • Corrected Token ID Slicing: Addressed the problem of incorrect token ID slicing during the decode phase, which occurred because request.num_computed_tokens was prematurely updated in newer vLLM versions.
  • Improved Cache Synchronization: Implemented a fix by using len(request_tracker.token_ids) as the slice base, ensuring it remains in sync with the LMCache's state and correctly captures new token IDs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where save_decode_cache is broken in vLLM versions >= 0.9.2. The fix involves changing the base for slicing tokens, which appears to be a sound approach. I have one minor suggestion for improvement.

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated
…ing in decode phase

In newer vLLM versions (>= 0.9.2), request.num_computed_tokens is
already updated before build_connector_meta is called. This causes
the slice request.all_token_ids[num_computed_tokens:...] to exceed
the length of all_token_ids during decode, silently returning an
empty list instead of the new token.

As a result, tracker.token_ids never grows, input_token_len stays
at prompt length, skip_save is always True, and KV cache is never
saved during decode phase.

Fix by using len(request_tracker.token_ids) as the slice base,
which is only updated after the slice and is always in sync with
LMCache's own state.

Signed-off-by: Tony Lin <tony.lin@intel.com>
@hlin99 hlin99 force-pushed the ww12_PR_save_decode_cache branch from 0d40b9b to 54cfe69 Compare March 19, 2026 06:23
# TODO: this is a dangerous reference to the request object inside vllm
if request := self._unfinished_requests.get(req_id):
num_current_tokens = request.num_computed_tokens
# Use tracker's token_ids length as the slice base instead of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's not computed, how can we have saved the decode cache previously?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it worked either vllm < 0.9.2 where vllm uses the legacy way. or before the change of #2007 .... in #2007 num_current_tokens is changed to take from request.num_computed_tokens, before that it was from len(request_tracker.token_ids). @sammshen do you still remember, why we have such change in the PR?

if request := self._unfinished_requests.get(req_id):
num_current_tokens = request.num_computed_tokens
# Use tracker's token_ids length as the slice base instead of
# request.num_computed_tokens, because in newer vLLM versions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why will num_computed_tokens be updated before the forward pass which computes them is finished?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vllm flow is: engine->schedule+++ -> call LMCache this function -> schedule --- -> model executor. so forward path is not being called, and it's the vllm design flow, not vllm bug...

# num_computed_tokens may already be updated before
# build_connector_meta is called, causing the slice to be empty
# during decode phase.
tracker_len = len(request_tracker.token_ids)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the upstream issue is can we keep request_tracker.token_ids semantics concistent (is it the number of computed tokens, if so, why will it be mismatched with num_computed_tokens)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above... i think LMCache can't take num_computed_tokens to index token_ids. it's unsafe per vllm design.

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Mar 23, 2026

thanks for your input @sammshen . your points make sense to me. give me some time to deep dive the issue. it's weird that seems only occurs at some condition.

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Mar 24, 2026

Hi @sammshen I cleaned up my env and on cuda device I can easily repro the issue with below configs. (so it's a common issue across platforms)

LMCache: 3137ce0
vllm: 2e67fa756d05518823196815a30bc633031ddb60 (today's latest)

lmcache.yaml
local_cpu: True
max_local_cpu_size: 10
save_unfull_chunk: true
save_decode_cache: true

The root cause is that, num_computed_tokens is not updated together with request.all_token_ids. num_computed_tokens is updated in scheduler, while the fwd path is not updated request.all_token_ids yet when LMCache tries to get the new_token_ids from the list. https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L1536

the list is Out of bounds at this point, so input_token_len won't get increamented here (https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L317) and failed to hit the chunk boundary and get stored....

Let me know if you have any other question. thanks.

@hlin99 hlin99 requested a review from sammshen March 24, 2026 12:00
@sammshen
Copy link
Copy Markdown
Contributor

does this have to do with async scheduling (which is the default in the latest vllm versions)

@sammshen
Copy link
Copy Markdown
Contributor

@hlin99 could you share your testing procedure?

@sammshen
Copy link
Copy Markdown
Contributor

would the correct thing to do be to update lmcache's request tracker to be num_computed_tokens

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Mar 31, 2026

@hlin99 could you share your testing procedure?

hi @sammshen

LMCache: 3137ce0
vllm: 2e67fa756d05518823196815a30bc633031ddb60 (today's latest)

lmcache.yaml
local_cpu: True
max_local_cpu_size: 10
save_unfull_chunk: true
save_decode_cache: true

server launch cmd:
CUDA_VISIBLE_DEVICES=4 PYTHONHASHSEED=0 LMCACHE_CONFIG_FILE=lmcache.yaml vllm serve --model /workspace/Meta-Llama-3-8B-Instruct/ --dtype bfloat16 --gpu-memory-utilization 0.95 --max-num-seqs 512 --max-model-len 4096 --enforce-eager --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'

vllm bench:
vllm bench serve --host 127.0.0.1 --port 8000 --model /workspace/Meta-Llama-3-8B-Instruct/ --dataset-name random --random-input-len 100 --random-output-len 2000 --num-prompts 20 --burstiness 100 --request-rate 3.6

you'll see if decode save cache is triggered https://github.com/256 chunk boundary w & w/o this PR

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Mar 31, 2026

does this have to do with async scheduling (which is the default in the latest vllm versions)

actually not...
request.num_computed_tokens is updated before new decode token is appended on request.all_token_ids
so, at this code point, if lmcache uses request.num_computed_tokens to index request.all_token_ids, it must be out of bound. and new_token_ids is actually None.
and later on the code request_tracker.update(new_token_ids, ...) ~line 1630 actually do nothing update. so decode number of tokens never gets increased to hit chunk boundary.

           num_current_tokens = request.num_computed_tokens
           new_token_ids = request.all_token_ids[
                num_current_tokens : num_current_tokens + num_new_tokens
            ]

@sammshen
Copy link
Copy Markdown
Contributor

conceptually, consulting request_tracker instead of num_computed_tokens is right

@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Mar 31, 2026

I think this solves save decode cache but regresses preemptino case. the correct approach is probably to take a min:

num_current_tokens = request.num_computed_tokens
  tracker_len = len(request_tracker.token_ids)                                                                                                                                                                               
  slice_base = min(num_current_tokens, tracker_len)                                                                                                                                                                          
  new_token_ids = request.all_token_ids[                                                                                                                                                                                     
      slice_base : slice_base + num_new_tokens                                                                                                                                                                               
  ] 

neither one should be smaller than we expect. I made the change earlier becasue there were soem cases (e.g. preemption) where request_tracker.token_ids <= request_tracker.token_ids but it seems from your PR the opposite can be true too

@sammshen
Copy link
Copy Markdown
Contributor

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated
Use the minimum of tracker length and computed tokens to ensure
consistency. This handles two main scenarios:
1. Normal execution where tracker_len is the current state.
2. Preemption cases where num_computed_tokens might be smaller
   due to state rollback.

Signed-off-by: Tony Lin <tony.lin@intel.com>
@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Apr 1, 2026

I think this solves save decode cache but regresses preemptino case. the correct approach is probably to take a min:

num_current_tokens = request.num_computed_tokens
  tracker_len = len(request_tracker.token_ids)                                                                                                                                                                               
  slice_base = min(num_current_tokens, tracker_len)                                                                                                                                                                          
  new_token_ids = request.all_token_ids[                                                                                                                                                                                     
      slice_base : slice_base + num_new_tokens                                                                                                                                                                               
  ] 

neither one should be smaller than we expect. I made the change earlier becasue there were soem cases (e.g. preemption) where request_tracker.token_ids <= request_tracker.token_ids but it seems from your PR the opposite can be true too

taking a min is a good idea! fix has been updated. please take another look, thanks. @sammshen

hlin99 added 2 commits April 1, 2026 12:52
Signed-off-by: Tony Lin <tony.lin@intel.com>
@sammshen sammshen mentioned this pull request Apr 1, 2026
2 tasks
@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Apr 1, 2026

there are two bugs with save decode cache, this PR should do both: #2929

thanks for the catch @hlin99

@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Apr 2, 2026

there are two bugs with save decode cache, this PR should do both: #2929

thanks for the catch @hlin99

nice to have both issue fixed! thx.

@hlin99 hlin99 closed this Apr 2, 2026
@hlin99 hlin99 deleted the ww12_PR_save_decode_cache branch April 25, 2026 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants