[Hybrid]: Decouple Kernel Block Size from KV Page Size#24486
[Hybrid]: Decouple Kernel Block Size from KV Page Size#24486vllm-bot merged 63 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a hybrid cache architecture to decouple logical and physical block sizes, which is a significant enhancement for memory management. The changes span configuration, platform-specific code, and the core block table management. The implementation in block_table.py appears solid. However, I've identified some critical issues in the tests intended to validate this new functionality. The tests are flawed and do not correctly verify the hybrid block logic, which could mask bugs. Additionally, there's a piece of logic in the GPUModelRunner that could be made more robust. My review focuses on fixing these test and implementation issues to ensure the new feature is reliable and well-tested.
4e3eeca to
8f2ee3d
Compare
|
Also CC @tdoublep |
heheda12345
left a comment
There was a problem hiding this comment.
Discussed with @zhiyuan1i offline. Two major concerns:
- I prefer to calculate kernel block size for each attention backend in gpu_model_runner
- would be great if
BlockTable.block_tableandBlockTable.physical_block_tablecan be merged into one tensor.
954ade4 to
0e0823a
Compare
|
@heheda12345 Thanks for the prompt feedback! I’ve addressed suggestion2 and merged BlockTable.block_table and BlockTable.physical_block_table into a single tensor as recommended. :) |
6d1735e to
0b544bf
Compare
62e5072 to
55e2235
Compare
|
CC @gshtras @hongxiayang as this also affect ROCm |
55e2235 to
b0e1d3b
Compare
5e0a1a0 to
3e70aa4
Compare
Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>
Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Documentation preview: https://vllm--24486.org.readthedocs.build/en/24486/ |
Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
heheda12345
left a comment
There was a problem hiding this comment.
LGTM! Thanks for this enhancement. Follow-ups:
- more clean-ups @heheda12345
- verify the
get_supported_kernel_block_sizeof each attention backend.
| else: | ||
| self.reorder_batch_threshold = reorder_batch_threshold_i | ||
|
|
||
| def _find_compatible_block_sizes( |
There was a problem hiding this comment.
(not a blocker) this function may be simplified.
| num_blocks = raw_tensor.numel() // kv_cache_spec.page_size_bytes | ||
| if isinstance(kv_cache_spec, AttentionSpec): | ||
| has_attn = True | ||
| kv_manager_block_size = kv_cache_spec.block_size |
There was a problem hiding this comment.
(not a blocker) should we use the common block size of all attention groups in the same kv cache group here?
|
|
||
| @staticmethod | ||
| def get_supported_kernel_block_size() -> list[Union[int, MultipleOf]]: | ||
| return [MultipleOf(16)] |
There was a problem hiding this comment.
Technically FA3 would support MultipleOf(1) while FA2 would support MultipleOf(16); I dont think its worth handling this though
Purpose
This PR introduces a hybrid cache architecture that separates logical kernel block size from
physical page size, enabling more flexible memory management. Key changes include:
This hybrid model decoupling enables independent development of high-performance operators
without being constrained by linear attention mechanisms like Mamba, addressing performance
bottlenecks discussed in issues #24280 and
#23161.
Test Plan
Added comprehensive tests in tests/v1/worker/test_gpu_model_runner.py to verify:
Test Result
pytest tests/v1/worker/test_gpu_model_runner.py - 20 passes
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.