[Bugfix]: patch save_decode_cache#2929
Conversation
Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled: 1. Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). 2. The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm.
There was a problem hiding this comment.
Code Review
This pull request fixes token slicing and preemption assertion logic in the vLLM v1 adapter to correctly handle cases where computed tokens and tracker lengths differ, particularly during preemption or full cache hits. A new test suite is introduced to validate these scenarios. Review feedback suggests using request.num_tokens for consistency and identifies several instances in the new test file where type hints and docstring sections are missing, violating the repository's style guidelines.
| full_hit_adj = ( | ||
| lmcache_cached_tokens == len(request.all_token_ids) | ||
| and lmcache_cached_tokens > load_spec.vllm_cached_tokens | ||
| ) |
There was a problem hiding this comment.
For consistency with other parts of the codebase (e.g., line 1305), it is recommended to use request.num_tokens instead of len(request.all_token_ids). Both represent the total number of tokens in the request, but num_tokens is used as the primary reference for this value in the scheduler logic.
| full_hit_adj = ( | |
| lmcache_cached_tokens == len(request.all_token_ids) | |
| and lmcache_cached_tokens > load_spec.vllm_cached_tokens | |
| ) | |
| full_hit_adj = ( | |
| lmcache_cached_tokens == request.num_tokens | |
| and lmcache_cached_tokens > load_spec.vllm_cached_tokens | |
| ) |
| def compute_new_token_ids( | ||
| request: StubRequest, | ||
| tracker_token_ids: list[int], | ||
| num_new_tokens: int, | ||
| ) -> list[int]: | ||
| """ | ||
| Reproduces the fixed token slicing logic from build_connector_meta. | ||
| Uses min(num_computed_tokens, tracker_len) as slice base. | ||
| """ |
There was a problem hiding this comment.
The docstring for compute_new_token_ids is missing the Args and Returns sections required by the repository style guide (rule 25).
def compute_new_token_ids(
request: StubRequest,
tracker_token_ids: list[int],
num_new_tokens: int,
) -> list[int]:
"""
Reproduces the fixed token slicing logic from build_connector_meta.
Uses min(num_computed_tokens, tracker_len) as slice base.
Args:
request (StubRequest): The request object.
tracker_token_ids (list[int]): The token IDs in the tracker.
num_new_tokens (int): The number of new tokens.
Returns:
list[int]: The new token IDs.
"""References
- All new public functions have docstrings (what, args, return, exceptions) (link)
| def check_preemption_assertion( | ||
| request: StubRequest, | ||
| load_spec: StubLoadSpec, | ||
| ): | ||
| """ | ||
| Reproduces the fixed assertion from build_connector_meta for preempted | ||
| requests. On full cache hit where lmcache is dominant, expected is -1. | ||
| """ |
There was a problem hiding this comment.
The function check_preemption_assertion is missing a return type hint (-> None) and its docstring is missing the Args and Returns sections. These are required by the repository style guide (rules 24 and 25).
| def check_preemption_assertion( | |
| request: StubRequest, | |
| load_spec: StubLoadSpec, | |
| ): | |
| """ | |
| Reproduces the fixed assertion from build_connector_meta for preempted | |
| requests. On full cache hit where lmcache is dominant, expected is -1. | |
| """ | |
| def check_preemption_assertion( | |
| request: StubRequest, | |
| load_spec: StubLoadSpec, | |
| ) -> None: | |
| """ | |
| Reproduces the fixed assertion from build_connector_meta for preempted | |
| requests. On full cache hit where lmcache is dominant, expected is -1. | |
| Args: | |
| request (StubRequest): The request object. | |
| load_spec (StubLoadSpec): The load specification. | |
| Raises: | |
| AssertionError: If the num_computed_tokens does not match expected. | |
| """ |
| class TestTokenSlicing: | ||
| """Tests for the min()-based slice base in build_connector_meta.""" | ||
|
|
||
| def test_sync_decode_normal(self): |
There was a problem hiding this comment.
Test methods in this file are missing return type hints (-> None). According to the repository style guide (rule 24), all new functions must have type hints for both arguments and return values.
| def test_sync_decode_normal(self): | |
| def test_sync_decode_normal(self) -> None: |
References
- All new functions have type hints (arguments + return values) (link)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| expected = max(load_spec.lmcache_cached_tokens, load_spec.vllm_cached_tokens) | ||
| full_hit_adj = ( | ||
| load_spec.lmcache_cached_tokens == request.num_tokens | ||
| and load_spec.lmcache_cached_tokens > load_spec.vllm_cached_tokens |
There was a problem hiding this comment.
Test full-hit condition diverges from production code
Medium Severity
The test helper check_preemption_assertion checks load_spec.lmcache_cached_tokens == request.num_tokens, but the production code checks lmcache_cached_tokens == len(request.all_token_ids). Nearly all TestPreemptionAssertion tests don't set _all_token_ids on StubRequest, so it defaults to an empty list. This causes the test's full_hit_adj condition to evaluate differently than production — e.g., in test_full_hit_lmcache_dominant, the test sees 4032 == 4032 (True) while production would see 4032 == 0 (False), meaning the assertion would actually fail in production with that stub. Several tests give false confidence by passing only because the reimplemented logic diverges from the real code.
Additional Locations (1)
Triggered by project rule: LMCache Code Review Style Guide
[Bugfix] Fix save_decode_cache token slicing and preemption assertion Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled: 1. Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). 2. The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm. Co-authored-by: Samuel Shen <slshen@uchciago.edu>
[Bugfix] Fix save_decode_cache token slicing and preemption assertion Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled: 1. Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). 2. The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm. Co-authored-by: Samuel Shen <slshen@uchciago.edu>


Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled:
Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). original PR (@hlin99): fix: save_decode_cache is broken on vllm >= 0.9.2 #2821
The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm.
What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Touches vLLM connector scheduling/caching logic around preemption and decode cache saving; mistakes could cause incorrect KV save/load behavior or assertion crashes during high-throughput decode.
Overview
Fixes two
save_decode_cacheedge cases invllm_v1_adapter.build_connector_meta.During cached-request scheduling, new-token slicing now starts from
min(request.num_computed_tokens, len(request_tracker.token_ids))to handle both decode where the tracker lags and post-preemption where the tracker may be longer, preventing empty/incorrectnew_token_ids.For preempted requests, the strict
num_computed_tokens == max(lmcache_cached, vllm_cached)assertion is adjusted to allow the deterministic full LMCache hit off-by-one (vLLM forces last-token recomputation), avoiding false assertion failures. Adds a focused unit test suite covering normal decode, preemption, and full-hit scenarios without requiring vLLM/GPU.Written by Cursor Bugbot for commit 733c723. This will update automatically on new commits. Configure here.