fix: save_decode_cache is broken on vllm >= 0.9.2 by hlin99 · Pull Request #2821 · LMCache/LMCache

hlin99 · 2026-03-19T06:15:36Z

In newer vLLM versions (>= 0.9.2), request.num_computed_tokens is already updated before build_connector_meta is called. This causes the slice request.all_token_ids[num_computed_tokens:...] to exceed the length of all_token_ids during decode, silently returning an empty list instead of the new token.

As a result, tracker.token_ids never grows, input_token_len stays at prompt length, skip_save is always True, and KV cache is never saved during decode phase.

Fix by using len(request_tracker.token_ids) as the slice base, which is only updated after the slice and is always in sync with LMCache's own state.

Note

Medium Risk
Adjusts how new tokens are sliced during decode/cached-request scheduling; incorrect indexing could still impact cache save behavior, but the change is localized to metadata construction.

Overview
Fixes build_connector_meta token slicing for scheduled cached requests by basing the slice start on LMCache’s tracked token count (and taking min(tracker_len, request.num_computed_tokens)), instead of relying solely on request.num_computed_tokens.

This prevents empty new_token_ids slices on vLLM >= 0.9.2 (and remains safer under preemption/rollback), allowing RequestTracker.token_ids to grow correctly so decode-phase KV cache can be saved as intended.

^{Written by Cursor Bugbot for commit c608c66. This will update automatically on new commits. Configure here.}

gemini-code-assist · 2026-03-19T06:15:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug affecting the save_decode_cache functionality in vLLM versions 0.9.2 and newer. The issue stemmed from an out-of-sync num_computed_tokens value, which led to incorrect token ID slicing and prevented the Key-Value (KV) cache from being properly saved during the decode phase. The implemented fix ensures that the correct token IDs are always captured, restoring the intended caching behavior and improving the efficiency of long-running inference requests.

Highlights

Fix for vLLM >= 0.9.2: Resolved an issue where save_decode_cache was broken in vLLM versions 0.9.2 and above, preventing the KV cache from being saved during the decode phase.
Corrected Token ID Slicing: Addressed the problem of incorrect token ID slicing during the decode phase, which occurred because request.num_computed_tokens was prematurely updated in newer vLLM versions.
Improved Cache Synchronization: Implemented a fix by using len(request_tracker.token_ids) as the slice base, ensuring it remains in sync with the LMCache's state and correctly captures new token IDs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses an issue where save_decode_cache is broken in vLLM versions >= 0.9.2. The fix involves changing the base for slicing tokens, which appears to be a sound approach. I have one minor suggestion for improvement.

…ing in decode phase In newer vLLM versions (>= 0.9.2), request.num_computed_tokens is already updated before build_connector_meta is called. This causes the slice request.all_token_ids[num_computed_tokens:...] to exceed the length of all_token_ids during decode, silently returning an empty list instead of the new token. As a result, tracker.token_ids never grows, input_token_len stays at prompt length, skip_save is always True, and KV cache is never saved during decode phase. Fix by using len(request_tracker.token_ids) as the slice base, which is only updated after the slice and is always in sync with LMCache's own state. Signed-off-by: Tony Lin <tony.lin@intel.com>

sammshen · 2026-03-21T01:35:06Z

            # TODO: this is a dangerous reference to the request object inside vllm
            if request := self._unfinished_requests.get(req_id):
                num_current_tokens = request.num_computed_tokens
+                # Use tracker's token_ids length as the slice base instead of


if it's not computed, how can we have saved the decode cache previously?

it worked either vllm < 0.9.2 where vllm uses the legacy way. or before the change of #2007 .... in #2007 num_current_tokens is changed to take from request.num_computed_tokens, before that it was from len(request_tracker.token_ids). @sammshen do you still remember, why we have such change in the PR?

sammshen · 2026-03-21T01:35:55Z

            if request := self._unfinished_requests.get(req_id):
                num_current_tokens = request.num_computed_tokens
+                # Use tracker's token_ids length as the slice base instead of
+                # request.num_computed_tokens, because in newer vLLM versions


why will num_computed_tokens be updated before the forward pass which computes them is finished?

the vllm flow is: engine->schedule+++ -> call LMCache this function -> schedule --- -> model executor. so forward path is not being called, and it's the vllm design flow, not vllm bug...

sammshen · 2026-03-21T01:38:34Z

+                # num_computed_tokens may already be updated before
+                # build_connector_meta is called, causing the slice to be empty
+                # during decode phase.
+                tracker_len = len(request_tracker.token_ids)


I think the upstream issue is can we keep request_tracker.token_ids semantics concistent (is it the number of computed tokens, if so, why will it be mismatched with num_computed_tokens)

same as above... i think LMCache can't take num_computed_tokens to index token_ids. it's unsafe per vllm design.

hlin99 · 2026-03-23T23:44:07Z

thanks for your input @sammshen . your points make sense to me. give me some time to deep dive the issue. it's weird that seems only occurs at some condition.

hlin99 · 2026-03-24T08:16:32Z

Hi @sammshen I cleaned up my env and on cuda device I can easily repro the issue with below configs. (so it's a common issue across platforms)

LMCache: 3137ce0
vllm: 2e67fa756d05518823196815a30bc633031ddb60 (today's latest)

lmcache.yaml
local_cpu: True
max_local_cpu_size: 10
save_unfull_chunk: true
save_decode_cache: true

The root cause is that, num_computed_tokens is not updated together with request.all_token_ids. num_computed_tokens is updated in scheduler, while the fwd path is not updated request.all_token_ids yet when LMCache tries to get the new_token_ids from the list. https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L1536

the list is Out of bounds at this point, so input_token_len won't get increamented here (https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L317) and failed to hit the chunk boundary and get stored....

Let me know if you have any other question. thanks.

sammshen · 2026-03-31T01:35:26Z

does this have to do with async scheduling (which is the default in the latest vllm versions)

sammshen · 2026-03-31T01:35:35Z

@hlin99 could you share your testing procedure?

sammshen · 2026-03-31T01:36:28Z

would the correct thing to do be to update lmcache's request tracker to be num_computed_tokens

hlin99 · 2026-03-31T05:55:49Z

@hlin99 could you share your testing procedure?

hi @sammshen

LMCache: 3137ce0
vllm: 2e67fa756d05518823196815a30bc633031ddb60 (today's latest)

lmcache.yaml
local_cpu: True
max_local_cpu_size: 10
save_unfull_chunk: true
save_decode_cache: true

server launch cmd:
CUDA_VISIBLE_DEVICES=4 PYTHONHASHSEED=0 LMCACHE_CONFIG_FILE=lmcache.yaml vllm serve --model /workspace/Meta-Llama-3-8B-Instruct/ --dtype bfloat16 --gpu-memory-utilization 0.95 --max-num-seqs 512 --max-model-len 4096 --enforce-eager --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'

vllm bench:
vllm bench serve --host 127.0.0.1 --port 8000 --model /workspace/Meta-Llama-3-8B-Instruct/ --dataset-name random --random-input-len 100 --random-output-len 2000 --num-prompts 20 --burstiness 100 --request-rate 3.6

you'll see if decode save cache is triggered https://github.com/256 chunk boundary w & w/o this PR

hlin99 · 2026-03-31T06:03:30Z

does this have to do with async scheduling (which is the default in the latest vllm versions)

actually not...
request.num_computed_tokens is updated before new decode token is appended on request.all_token_ids
so, at this code point, if lmcache uses request.num_computed_tokens to index request.all_token_ids, it must be out of bound. and new_token_ids is actually None.
and later on the code request_tracker.update(new_token_ids, ...) ~line 1630 actually do nothing update. so decode number of tokens never gets increased to hit chunk boundary.

           num_current_tokens = request.num_computed_tokens
           new_token_ids = request.all_token_ids[
                num_current_tokens : num_current_tokens + num_new_tokens
            ]

sammshen · 2026-03-31T17:32:02Z

conceptually, consulting request_tracker instead of num_computed_tokens is right

sammshen · 2026-03-31T17:33:53Z

I think this solves save decode cache but regresses preemptino case. the correct approach is probably to take a min:

num_current_tokens = request.num_computed_tokens
  tracker_len = len(request_tracker.token_ids)                                                                                                                                                                               
  slice_base = min(num_current_tokens, tracker_len)                                                                                                                                                                          
  new_token_ids = request.all_token_ids[                                                                                                                                                                                     
      slice_base : slice_base + num_new_tokens                                                                                                                                                                               
  ]

neither one should be smaller than we expect. I made the change earlier becasue there were soem cases (e.g. preemption) where request_tracker.token_ids <= request_tracker.token_ids but it seems from your PR the opposite can be true too

sammshen · 2026-03-31T18:24:56Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Use the minimum of tracker length and computed tokens to ensure consistency. This handles two main scenarios: 1. Normal execution where tracker_len is the current state. 2. Preemption cases where num_computed_tokens might be smaller due to state rollback. Signed-off-by: Tony Lin <tony.lin@intel.com>

hlin99 · 2026-04-01T04:52:28Z

I think this solves save decode cache but regresses preemptino case. the correct approach is probably to take a min:

num_current_tokens = request.num_computed_tokens
  tracker_len = len(request_tracker.token_ids)                                                                                                                                                                               
  slice_base = min(num_current_tokens, tracker_len)                                                                                                                                                                          
  new_token_ids = request.all_token_ids[                                                                                                                                                                                     
      slice_base : slice_base + num_new_tokens                                                                                                                                                                               
  ]

neither one should be smaller than we expect. I made the change earlier becasue there were soem cases (e.g. preemption) where request_tracker.token_ids <= request_tracker.token_ids but it seems from your PR the opposite can be true too

taking a min is a good idea! fix has been updated. please take another look, thanks. @sammshen

Signed-off-by: Tony Lin <tony.lin@intel.com>

sammshen · 2026-04-01T18:34:29Z

there are two bugs with save decode cache, this PR should do both: #2929

thanks for the catch @hlin99

hlin99 · 2026-04-02T01:32:07Z

there are two bugs with save decode cache, this PR should do both: #2929

thanks for the catch @hlin99

nice to have both issue fixed! thx.

gemini-code-assist Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated

hlin99 force-pushed the ww12_PR_save_decode_cache branch from 0d40b9b to 54cfe69 Compare March 19, 2026 06:23

sammshen reviewed Mar 21, 2026

View reviewed changes

sammshen requested changes Mar 21, 2026

View reviewed changes

hlin99 requested a review from sammshen March 24, 2026 12:00

Merge branch 'dev' into ww12_PR_save_decode_cache

f58b559

Merge branch 'dev' into ww12_PR_save_decode_cache

4fcf91d

cursor Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py Outdated

hlin99 added 2 commits April 1, 2026 12:52

Merge branch 'dev' into ww12_PR_save_decode_cache

160dbce

fix typo

c608c66

Signed-off-by: Tony Lin <tony.lin@intel.com>

sammshen mentioned this pull request Apr 1, 2026

[Bugfix]: patch save_decode_cache #2929

Merged

2 tasks

hlin99 closed this Apr 2, 2026

hlin99 deleted the ww12_PR_save_decode_cache branch April 25, 2026 05:29

Conversation

hlin99 commented Mar 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

sammshen Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hlin99 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hlin99 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hlin99 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

hlin99 commented Mar 23, 2026

Uh oh!

hlin99 commented Mar 24, 2026

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

hlin99 commented Mar 31, 2026

Uh oh!

hlin99 commented Mar 31, 2026

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

sammshen commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hlin99 commented Apr 1, 2026

Uh oh!

sammshen commented Apr 1, 2026

Uh oh!

hlin99 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hlin99 commented Mar 19, 2026 •

edited by cursor Bot

Loading

sammshen commented Mar 31, 2026 •

edited

Loading