[Bugfix]: patch save_decode_cache by sammshen · Pull Request #2929 · LMCache/LMCache

sammshen · 2026-04-01T18:33:06Z

Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled:

Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). original PR (@hlin99): fix: save_decode_cache is broken on vllm >= 0.9.2 #2821
The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Medium Risk
Touches vLLM connector scheduling/caching logic around preemption and decode cache saving; mistakes could cause incorrect KV save/load behavior or assertion crashes during high-throughput decode.

Overview
Fixes two save_decode_cache edge cases in vllm_v1_adapter.build_connector_meta.

During cached-request scheduling, new-token slicing now starts from min(request.num_computed_tokens, len(request_tracker.token_ids)) to handle both decode where the tracker lags and post-preemption where the tracker may be longer, preventing empty/incorrect new_token_ids.

For preempted requests, the strict num_computed_tokens == max(lmcache_cached, vllm_cached) assertion is adjusted to allow the deterministic full LMCache hit off-by-one (vLLM forces last-token recomputation), avoiding false assertion failures. Adds a focused unit test suite covering normal decode, preemption, and full-hit scenarios without requiring vLLM/GPU.

^{Written by Cursor Bugbot for commit 733c723. This will update automatically on new commits. Configure here.}

Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled: 1. Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). 2. The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm.

gemini-code-assist

Code Review

This pull request fixes token slicing and preemption assertion logic in the vLLM v1 adapter to correctly handle cases where computed tokens and tracker lengths differ, particularly during preemption or full cache hits. A new test suite is introduced to validate these scenarios. Review feedback suggests using request.num_tokens for consistency and identifies several instances in the new test file where type hints and docstring sections are missing, violating the repository's style guidelines.

gemini-code-assist · 2026-04-01T18:36:21Z

+                full_hit_adj = (
+                    lmcache_cached_tokens == len(request.all_token_ids)
+                    and lmcache_cached_tokens > load_spec.vllm_cached_tokens
+                )


For consistency with other parts of the codebase (e.g., line 1305), it is recommended to use request.num_tokens instead of len(request.all_token_ids). Both represent the total number of tokens in the request, but num_tokens is used as the primary reference for this value in the scheduler logic.

Suggested change

full_hit_adj = (

lmcache_cached_tokens == len(request.all_token_ids)

and lmcache_cached_tokens > load_spec.vllm_cached_tokens

)

full_hit_adj = (

lmcache_cached_tokens == request.num_tokens

and lmcache_cached_tokens > load_spec.vllm_cached_tokens

)

gemini-code-assist · 2026-04-01T18:36:21Z

+def compute_new_token_ids(
+    request: StubRequest,
+    tracker_token_ids: list[int],
+    num_new_tokens: int,
+) -> list[int]:
+    """
+    Reproduces the fixed token slicing logic from build_connector_meta.
+    Uses min(num_computed_tokens, tracker_len) as slice base.
+    """


The docstring for compute_new_token_ids is missing the Args and Returns sections required by the repository style guide (rule 25).

def compute_new_token_ids( request: StubRequest, tracker_token_ids: list[int], num_new_tokens: int, ) -> list[int]: """ Reproduces the fixed token slicing logic from build_connector_meta. Uses min(num_computed_tokens, tracker_len) as slice base. Args: request (StubRequest): The request object. tracker_token_ids (list[int]): The token IDs in the tracker. num_new_tokens (int): The number of new tokens. Returns: list[int]: The new token IDs. """

References

All new public functions have docstrings (what, args, return, exceptions) ^(link)

gemini-code-assist · 2026-04-01T18:36:21Z

+def check_preemption_assertion(
+    request: StubRequest,
+    load_spec: StubLoadSpec,
+):
+    """
+    Reproduces the fixed assertion from build_connector_meta for preempted
+    requests. On full cache hit where lmcache is dominant, expected is -1.
+    """


The function check_preemption_assertion is missing a return type hint (-> None) and its docstring is missing the Args and Returns sections. These are required by the repository style guide (rules 24 and 25).

Suggested change

def check_preemption_assertion(

request: StubRequest,

load_spec: StubLoadSpec,

):

"""

Reproduces the fixed assertion from build_connector_meta for preempted

requests. On full cache hit where lmcache is dominant, expected is -1.

"""

def check_preemption_assertion(

request: StubRequest,

load_spec: StubLoadSpec,

) -> None:

"""

Reproduces the fixed assertion from build_connector_meta for preempted

requests. On full cache hit where lmcache is dominant, expected is -1.

Args:

request (StubRequest): The request object.

load_spec (StubLoadSpec): The load specification.

Raises:

AssertionError: If the num_computed_tokens does not match expected.

"""

References

All new functions have type hints (arguments + return values) ^(link)

All new public functions have docstrings (what, args, return, exceptions) ^(link)

gemini-code-assist · 2026-04-01T18:36:21Z

+class TestTokenSlicing:
+    """Tests for the min()-based slice base in build_connector_meta."""
+
+    def test_sync_decode_normal(self):


Test methods in this file are missing return type hints (-> None). According to the repository style guide (rule 24), all new functions must have type hints for both arguments and return values.

Suggested change

def test_sync_decode_normal(self):

def test_sync_decode_normal(self) -> None:

References

All new functions have type hints (arguments + return values) ^(link)

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-04-01T18:40:37Z

+    expected = max(load_spec.lmcache_cached_tokens, load_spec.vllm_cached_tokens)
+    full_hit_adj = (
+        load_spec.lmcache_cached_tokens == request.num_tokens
+        and load_spec.lmcache_cached_tokens > load_spec.vllm_cached_tokens


Test full-hit condition diverges from production code

Medium Severity

The test helper check_preemption_assertion checks load_spec.lmcache_cached_tokens == request.num_tokens, but the production code checks lmcache_cached_tokens == len(request.all_token_ids). Nearly all TestPreemptionAssertion tests don't set _all_token_ids on StubRequest, so it defaults to an empty list. This causes the test's full_hit_adj condition to evaluate differently than production — e.g., in test_full_hit_lmcache_dominant, the test sees 4032 == 4032 (True) while production would see 4032 == 0 (False), meaning the assertion would actually fail in production with that stub. Several tests give false confidence by passing only because the reimplemented logic diverges from the real code.

Additional Locations (1)

lmcache/integration/vllm/vllm_v1_adapter.py#L1592-L1596

^{Triggered by project rule: LMCache Code Review Style Guide}

[Bugfix] Fix save_decode_cache token slicing and preemption assertion Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled: 1. Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). 2. The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm. Co-authored-by: Samuel Shen <slshen@uchciago.edu>

sammshen mentioned this pull request Apr 1, 2026

fix: save_decode_cache is broken on vllm >= 0.9.2 #2821

Closed

sammshen requested a review from deng451e April 1, 2026 18:34

gemini-code-assist Bot reviewed Apr 1, 2026

View reviewed changes

cursor Bot reviewed Apr 1, 2026

View reviewed changes

deng451e approved these changes Apr 1, 2026

View reviewed changes

sammshen requested a review from kobe0938 April 1, 2026 19:04

sammshen enabled auto-merge (squash) April 1, 2026 19:04

github-actions Bot added the full Run comprehensive tests on this PR label Apr 1, 2026

Merge branch 'dev' into preempt-and-save-decode-bugs

733c723

kobe0938 approved these changes Apr 1, 2026

View reviewed changes

sammshen merged commit 996da47 into LMCache:dev Apr 1, 2026
32 of 34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix]: patch save_decode_cache#2929

[Bugfix]: patch save_decode_cache#2929
sammshen merged 2 commits intoLMCache:devfrom
sammshen:preempt-and-save-decode-bugs

sammshen commented Apr 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def test_sync_decode_normal(self):
	def test_sync_decode_normal(self) -> None:

Conversation

sammshen commented Apr 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 1, 2026

Choose a reason for hiding this comment

Test full-hit condition diverges from production code

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sammshen commented Apr 1, 2026 •

edited by cursor Bot

Loading