Skip to content

[Core][Hybrid allocator + connector 2/n] Unify remove_skipped_blocks by get_last_useful_token#25431

Merged
heheda12345 merged 9 commits into
vllm-project:mainfrom
KuntaiDu:kuntai-add-get-last-useful-token
Nov 6, 2025
Merged

[Core][Hybrid allocator + connector 2/n] Unify remove_skipped_blocks by get_last_useful_token#25431
heheda12345 merged 9 commits into
vllm-project:mainfrom
KuntaiDu:kuntai-add-get-last-useful-token

Conversation

@KuntaiDu

@KuntaiDu KuntaiDu commented Sep 22, 2025

Copy link
Copy Markdown
Collaborator

Purpose

This PR is separated from #23624 that aims to enable allocating tokens inside sliding window when long prefix is not cached in vLLM but cached in connector.

This PR unifies remove_skipped_blocks by adding a new function get_last_useful_token.

Test Plan

pytest -v -s tests/v1/core/test_single_type_kv_cache_manager.py

Test Result

image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…he attention window

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the remove_skipped_blocks method by moving its implementation to the base SingleTypeKVCacheManager class, delegating attention-specific logic to a new get_last_useful_token method. This is a good simplification for most attention managers. However, this refactoring has introduced a critical issue for ChunkedLocalAttentionManager by removing logic essential for maintaining prefix cache integrity. My review focuses on this critical regression.

Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Comment thread vllm/v1/core/single_type_kv_cache_manager.py
Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
@heheda12345

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
@sarckk

sarckk commented Oct 29, 2025

Copy link
Copy Markdown
Member

@KuntaiDu hi what's the status of this PR?

@KuntaiDu

KuntaiDu commented Nov 4, 2025

Copy link
Copy Markdown
Collaborator Author

Now circle back to this PR. Was making Ray Summit talk slides lol.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

@heheda12345 heheda12345 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only some nits related to the function name update in the last pass.

Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
@KuntaiDu

KuntaiDu commented Nov 4, 2025

Copy link
Copy Markdown
Collaborator Author

Now working on correctness test on ChunkedLocalAttention using Llama 4. The plan is to use lm_eval on long-context dataset, and make sure the GPU memory is small enough to trigger frequent chunked block eviction.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@KuntaiDu

KuntaiDu commented Nov 4, 2025

Copy link
Copy Markdown
Collaborator Author

Fixed the comments. Thanks for the feedback @heheda12345

@heheda12345

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm/v1/core/single_type_kv_cache_manager.py

@heheda12345 heheda12345 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you adjust the figures a little bit?

Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Comment thread vllm/v1/core/single_type_kv_cache_manager.py Outdated
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@KuntaiDu KuntaiDu requested a review from heheda12345 November 5, 2025 08:10

@heheda12345 heheda12345 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@heheda12345 heheda12345 enabled auto-merge (squash) November 5, 2025 19:06
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 5, 2025
@heheda12345 heheda12345 merged commit efe73e9 into vllm-project:main Nov 6, 2025
47 checks passed
@KuntaiDu KuntaiDu deleted the kuntai-add-get-last-useful-token branch November 6, 2025 21:00
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
…` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants