fix: save_decode_cache is broken on vllm >= 0.9.2#2821
fix: save_decode_cache is broken on vllm >= 0.9.2#2821hlin99 wants to merge 6 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug affecting the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where save_decode_cache is broken in vLLM versions >= 0.9.2. The fix involves changing the base for slicing tokens, which appears to be a sound approach. I have one minor suggestion for improvement.
…ing in decode phase In newer vLLM versions (>= 0.9.2), request.num_computed_tokens is already updated before build_connector_meta is called. This causes the slice request.all_token_ids[num_computed_tokens:...] to exceed the length of all_token_ids during decode, silently returning an empty list instead of the new token. As a result, tracker.token_ids never grows, input_token_len stays at prompt length, skip_save is always True, and KV cache is never saved during decode phase. Fix by using len(request_tracker.token_ids) as the slice base, which is only updated after the slice and is always in sync with LMCache's own state. Signed-off-by: Tony Lin <tony.lin@intel.com>
0d40b9b to
54cfe69
Compare
| # TODO: this is a dangerous reference to the request object inside vllm | ||
| if request := self._unfinished_requests.get(req_id): | ||
| num_current_tokens = request.num_computed_tokens | ||
| # Use tracker's token_ids length as the slice base instead of |
There was a problem hiding this comment.
if it's not computed, how can we have saved the decode cache previously?
There was a problem hiding this comment.
it worked either vllm < 0.9.2 where vllm uses the legacy way. or before the change of #2007 .... in #2007 num_current_tokens is changed to take from request.num_computed_tokens, before that it was from len(request_tracker.token_ids). @sammshen do you still remember, why we have such change in the PR?
| if request := self._unfinished_requests.get(req_id): | ||
| num_current_tokens = request.num_computed_tokens | ||
| # Use tracker's token_ids length as the slice base instead of | ||
| # request.num_computed_tokens, because in newer vLLM versions |
There was a problem hiding this comment.
why will num_computed_tokens be updated before the forward pass which computes them is finished?
There was a problem hiding this comment.
the vllm flow is: engine->schedule+++ -> call LMCache this function -> schedule --- -> model executor. so forward path is not being called, and it's the vllm design flow, not vllm bug...
| # num_computed_tokens may already be updated before | ||
| # build_connector_meta is called, causing the slice to be empty | ||
| # during decode phase. | ||
| tracker_len = len(request_tracker.token_ids) |
There was a problem hiding this comment.
I think the upstream issue is can we keep request_tracker.token_ids semantics concistent (is it the number of computed tokens, if so, why will it be mismatched with num_computed_tokens)
There was a problem hiding this comment.
same as above... i think LMCache can't take num_computed_tokens to index token_ids. it's unsafe per vllm design.
|
thanks for your input @sammshen . your points make sense to me. give me some time to deep dive the issue. it's weird that seems only occurs at some condition. |
|
Hi @sammshen I cleaned up my env and on cuda device I can easily repro the issue with below configs. (so it's a common issue across platforms) LMCache: 3137ce0 lmcache.yaml The root cause is that, num_computed_tokens is not updated together with request.all_token_ids. num_computed_tokens is updated in scheduler, while the fwd path is not updated request.all_token_ids yet when LMCache tries to get the new_token_ids from the list. https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L1536 the list is Out of bounds at this point, so input_token_len won't get increamented here (https://github.com/LMCache/LMCache/blob/dev/lmcache/integration/vllm/vllm_v1_adapter.py#L317) and failed to hit the chunk boundary and get stored.... Let me know if you have any other question. thanks. |
|
does this have to do with async scheduling (which is the default in the latest vllm versions) |
|
@hlin99 could you share your testing procedure? |
|
would the correct thing to do be to update lmcache's request tracker to be num_computed_tokens |
hi @sammshen LMCache: 3137ce0 lmcache.yaml server launch cmd: vllm bench: you'll see if decode save cache is triggered https://github.com/256 chunk boundary w & w/o this PR |
actually not... |
|
conceptually, consulting request_tracker instead of num_computed_tokens is right |
|
I think this solves save decode cache but regresses preemptino case. the correct approach is probably to take a min: num_current_tokens = request.num_computed_tokens
tracker_len = len(request_tracker.token_ids)
slice_base = min(num_current_tokens, tracker_len)
new_token_ids = request.all_token_ids[
slice_base : slice_base + num_new_tokens
] neither one should be smaller than we expect. I made the change earlier becasue there were soem cases (e.g. preemption) where request_tracker.token_ids <= request_tracker.token_ids but it seems from your PR the opposite can be true too |
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Use the minimum of tracker length and computed tokens to ensure consistency. This handles two main scenarios: 1. Normal execution where tracker_len is the current state. 2. Preemption cases where num_computed_tokens might be smaller due to state rollback. Signed-off-by: Tony Lin <tony.lin@intel.com>
taking a min is a good idea! fix has been updated. please take another look, thanks. @sammshen |

In newer vLLM versions (>= 0.9.2), request.num_computed_tokens is already updated before build_connector_meta is called. This causes the slice request.all_token_ids[num_computed_tokens:...] to exceed the length of all_token_ids during decode, silently returning an empty list instead of the new token.
As a result, tracker.token_ids never grows, input_token_len stays at prompt length, skip_save is always True, and KV cache is never saved during decode phase.
Fix by using len(request_tracker.token_ids) as the slice base, which is only updated after the slice and is always in sync with LMCache's own state.
Note
Medium Risk
Adjusts how new tokens are sliced during decode/cached-request scheduling; incorrect indexing could still impact cache save behavior, but the change is localized to metadata construction.
Overview
Fixes
build_connector_metatoken slicing for scheduled cached requests by basing the slice start on LMCache’s tracked token count (and takingmin(tracker_len, request.num_computed_tokens)), instead of relying solely onrequest.num_computed_tokens.This prevents empty
new_token_idsslices on vLLM >= 0.9.2 (and remains safer under preemption/rollback), allowingRequestTracker.token_idsto grow correctly so decode-phase KV cache can be saved as intended.Written by Cursor Bugbot for commit c608c66. This will update automatically on new commits. Configure here.