[Core][Hybrid allocator + connector 2/n] Unify remove_skipped_blocks by get_last_useful_token#25431
Conversation
…he attention window Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request refactors the remove_skipped_blocks method by moving its implementation to the base SingleTypeKVCacheManager class, delegating attention-specific logic to a new get_last_useful_token method. This is a good simplification for most attention managers. However, this refactoring has introduced a critical issue for ChunkedLocalAttentionManager by removing logic essential for maintaining prefix cache integrity. My review focuses on this critical regression.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
|
@KuntaiDu hi what's the status of this PR? |
|
Now circle back to this PR. Was making Ray Summit talk slides lol. |
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Only some nits related to the function name update in the last pass.
|
Now working on correctness test on ChunkedLocalAttention using Llama 4. The plan is to use |
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
|
Fixed the comments. Thanks for the feedback @heheda12345 |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Can you adjust the figures a little bit?
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Purpose
This PR is separated from #23624 that aims to enable allocating tokens inside sliding window when long prefix is not cached in vLLM but cached in connector.
This PR unifies
remove_skipped_blocksby adding a new functionget_last_useful_token.Test Plan
pytest -v -s tests/v1/core/test_single_type_kv_cache_manager.pyTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.