Skip to content

[Bugfix]: patch save_decode_cache#2929

Merged
sammshen merged 2 commits intoLMCache:devfrom
sammshen:preempt-and-save-decode-bugs
Apr 1, 2026
Merged

[Bugfix]: patch save_decode_cache#2929
sammshen merged 2 commits intoLMCache:devfrom
sammshen:preempt-and-save-decode-bugs

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Apr 1, 2026

Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled:

  1. Token slicing used num_computed_tokens as the slice start, but after preemption the tracker can be longer than num_computed_tokens, causing an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len). original PR (@hlin99): fix: save_decode_cache is broken on vllm >= 0.9.2 #2821

  2. The assertion in build_connector_meta expected num_computed_tokens to exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one less than the raw lmcache value. Fix: apply the same -1 adjustment deterministically when lmcache_cached == num_tokens and lmcache > vllm.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Touches vLLM connector scheduling/caching logic around preemption and decode cache saving; mistakes could cause incorrect KV save/load behavior or assertion crashes during high-throughput decode.

Overview
Fixes two save_decode_cache edge cases in vllm_v1_adapter.build_connector_meta.

During cached-request scheduling, new-token slicing now starts from min(request.num_computed_tokens, len(request_tracker.token_ids)) to handle both decode where the tracker lags and post-preemption where the tracker may be longer, preventing empty/incorrect new_token_ids.

For preempted requests, the strict num_computed_tokens == max(lmcache_cached, vllm_cached) assertion is adjusted to allow the deterministic full LMCache hit off-by-one (vLLM forces last-token recomputation), avoiding false assertion failures. Adds a focused unit test suite covering normal decode, preemption, and full-hit scenarios without requiring vLLM/GPU.

Written by Cursor Bugbot for commit 733c723. This will update automatically on new commits. Configure here.

Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled:

1. Token slicing used num_computed_tokens as the slice start, but after
   preemption the tracker can be longer than num_computed_tokens, causing
   an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len).

2. The assertion in build_connector_meta expected num_computed_tokens to
   exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit
   vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one
   less than the raw lmcache value. Fix: apply the same -1 adjustment
   deterministically when lmcache_cached == num_tokens and lmcache > vllm.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes token slicing and preemption assertion logic in the vLLM v1 adapter to correctly handle cases where computed tokens and tracker lengths differ, particularly during preemption or full cache hits. A new test suite is introduced to validate these scenarios. Review feedback suggests using request.num_tokens for consistency and identifies several instances in the new test file where type hints and docstring sections are missing, violating the repository's style guidelines.

Comment on lines +1593 to +1596
full_hit_adj = (
lmcache_cached_tokens == len(request.all_token_ids)
and lmcache_cached_tokens > load_spec.vllm_cached_tokens
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other parts of the codebase (e.g., line 1305), it is recommended to use request.num_tokens instead of len(request.all_token_ids). Both represent the total number of tokens in the request, but num_tokens is used as the primary reference for this value in the scheduler logic.

Suggested change
full_hit_adj = (
lmcache_cached_tokens == len(request.all_token_ids)
and lmcache_cached_tokens > load_spec.vllm_cached_tokens
)
full_hit_adj = (
lmcache_cached_tokens == request.num_tokens
and lmcache_cached_tokens > load_spec.vllm_cached_tokens
)

Comment on lines +48 to +56
def compute_new_token_ids(
request: StubRequest,
tracker_token_ids: list[int],
num_new_tokens: int,
) -> list[int]:
"""
Reproduces the fixed token slicing logic from build_connector_meta.
Uses min(num_computed_tokens, tracker_len) as slice base.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for compute_new_token_ids is missing the Args and Returns sections required by the repository style guide (rule 25).

def compute_new_token_ids(
    request: StubRequest,
    tracker_token_ids: list[int],
    num_new_tokens: int,
) -> list[int]:
    """
    Reproduces the fixed token slicing logic from build_connector_meta.
    Uses min(num_computed_tokens, tracker_len) as slice base.

    Args:
        request (StubRequest): The request object.
        tracker_token_ids (list[int]): The token IDs in the tracker.
        num_new_tokens (int): The number of new tokens.

    Returns:
        list[int]: The new token IDs.
    """
References
  1. All new public functions have docstrings (what, args, return, exceptions) (link)

Comment on lines +63 to +70
def check_preemption_assertion(
request: StubRequest,
load_spec: StubLoadSpec,
):
"""
Reproduces the fixed assertion from build_connector_meta for preempted
requests. On full cache hit where lmcache is dominant, expected is -1.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function check_preemption_assertion is missing a return type hint (-> None) and its docstring is missing the Args and Returns sections. These are required by the repository style guide (rules 24 and 25).

Suggested change
def check_preemption_assertion(
request: StubRequest,
load_spec: StubLoadSpec,
):
"""
Reproduces the fixed assertion from build_connector_meta for preempted
requests. On full cache hit where lmcache is dominant, expected is -1.
"""
def check_preemption_assertion(
request: StubRequest,
load_spec: StubLoadSpec,
) -> None:
"""
Reproduces the fixed assertion from build_connector_meta for preempted
requests. On full cache hit where lmcache is dominant, expected is -1.
Args:
request (StubRequest): The request object.
load_spec (StubLoadSpec): The load specification.
Raises:
AssertionError: If the num_computed_tokens does not match expected.
"""
References
  1. All new functions have type hints (arguments + return values) (link)
  2. All new public functions have docstrings (what, args, return, exceptions) (link)

class TestTokenSlicing:
"""Tests for the min()-based slice base in build_connector_meta."""

def test_sync_decode_normal(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Test methods in this file are missing return type hints (-> None). According to the repository style guide (rule 24), all new functions must have type hints for both arguments and return values.

Suggested change
def test_sync_decode_normal(self):
def test_sync_decode_normal(self) -> None:
References
  1. All new functions have type hints (arguments + return values) (link)

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

expected = max(load_spec.lmcache_cached_tokens, load_spec.vllm_cached_tokens)
full_hit_adj = (
load_spec.lmcache_cached_tokens == request.num_tokens
and load_spec.lmcache_cached_tokens > load_spec.vllm_cached_tokens
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test full-hit condition diverges from production code

Medium Severity

The test helper check_preemption_assertion checks load_spec.lmcache_cached_tokens == request.num_tokens, but the production code checks lmcache_cached_tokens == len(request.all_token_ids). Nearly all TestPreemptionAssertion tests don't set _all_token_ids on StubRequest, so it defaults to an empty list. This causes the test's full_hit_adj condition to evaluate differently than production — e.g., in test_full_hit_lmcache_dominant, the test sees 4032 == 4032 (True) while production would see 4032 == 0 (False), meaning the assertion would actually fail in production with that stub. Several tests give false confidence by passing only because the reimplemented logic diverges from the real code.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

@sammshen sammshen requested a review from kobe0938 April 1, 2026 19:04
@sammshen sammshen enabled auto-merge (squash) April 1, 2026 19:04
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 1, 2026
@sammshen sammshen merged commit 996da47 into LMCache:dev Apr 1, 2026
32 of 34 checks passed
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
[Bugfix] Fix save_decode_cache token slicing and preemption assertion

Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled:

1. Token slicing used num_computed_tokens as the slice start, but after
   preemption the tracker can be longer than num_computed_tokens, causing
   an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len).

2. The assertion in build_connector_meta expected num_computed_tokens to
   exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit
   vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one
   less than the raw lmcache value. Fix: apply the same -1 adjustment
   deterministically when lmcache_cached == num_tokens and lmcache > vllm.

Co-authored-by: Samuel Shen <slshen@uchciago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
[Bugfix] Fix save_decode_cache token slicing and preemption assertion

Two bugs in vllm_v1_adapter.py when save_decode_cache is enabled:

1. Token slicing used num_computed_tokens as the slice start, but after
   preemption the tracker can be longer than num_computed_tokens, causing
   an empty or negative slice. Fix: use min(num_computed_tokens, tracker_len).

2. The assertion in build_connector_meta expected num_computed_tokens to
   exactly equal max(lmcache_cached, vllm_cached). On a full LMCache hit
   vLLM's get_num_new_matched_tokens subtracts 1, so num_computed is one
   less than the raw lmcache value. Fix: apply the same -1 adjustment
   deterministically when lmcache_cached == num_tokens and lmcache > vllm.

Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants