[Hybrid]: Decouple Kernel Block Size from KV Page Size by zhiyuan1i · Pull Request #24486 · vllm-project/vllm

zhiyuan1i · 2025-09-09T06:52:08Z

Purpose

This PR introduces a hybrid cache architecture that separates logical kernel block size from
physical page size, enabling more flexible memory management. Key changes include:

Added kernel_block_size field to CacheConfig for logical block sizing
Enhanced platform-specific configurations for CUDA and ROCm to support hybrid blocks
Implemented block table conversion logic between physical and logical representations
Added support for different physical/logical block size ratios in V1 worker components

This hybrid model decoupling enables independent development of high-performance operators
without being constrained by linear attention mechanisms like Mamba, addressing performance
bottlenecks discussed in issues #24280 and
#23161.

Test Plan

Added comprehensive tests in tests/v1/worker/test_gpu_model_runner.py to verify:

Block table conversion between physical and logical representations
Proper handling of different block size ratios
Integration with existing GPU model runner functionality
Platform-specific configurations for CUDA and ROCm

Test Result

pytest tests/v1/worker/test_gpu_model_runner.py - 20 passes

tests/v1/worker/test_gpu_model_runner.py ....................                                                                                                                        [100%]

===================================================================================== warnings summary =====================================================================================
../../../../opt/conda/envs/vllm-upstream/lib/python3.12/site-packages/torch/cuda/__init__.py:63
  /opt/conda/envs/vllm-upstream/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================== 20 passed, 3 warnings in 89.20s (0:01:29) =========================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a hybrid cache architecture to decouple logical and physical block sizes, which is a significant enhancement for memory management. The changes span configuration, platform-specific code, and the core block table management. The implementation in block_table.py appears solid. However, I've identified some critical issues in the tests intended to validate this new functionality. The tests are flawed and do not correctly verify the hybrid block logic, which could mask bugs. Additionally, there's a piece of logic in the GPUModelRunner that could be made more robust. My review focuses on fixing these test and implementation issues to ensure the new feature is reliable and well-tested.

tests/v1/worker/test_gpu_model_runner.py

vllm/v1/worker/gpu_model_runner.py

heheda12345 · 2025-09-09T07:30:44Z

Also CC @tdoublep

heheda12345

Discussed with @zhiyuan1i offline. Two major concerns:

I prefer to calculate kernel block size for each attention backend in gpu_model_runner
would be great if BlockTable.block_table and BlockTable.physical_block_table can be merged into one tensor.

zhiyuan1i · 2025-09-09T09:31:44Z

@heheda12345 Thanks for the prompt feedback! I’ve addressed suggestion2 and merged BlockTable.block_table and BlockTable.physical_block_table into a single tensor as recommended. :)

vllm/platforms/cuda.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/worker/block_table.py

tjtanaa · 2025-09-11T03:52:24Z

CC @gshtras @hongxiayang as this also affect ROCm

vllm/v1/worker/block_table.py

Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>

mergify · 2025-10-09T03:43:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zhiyuan1i.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-10-09T03:46:28Z

Documentation preview: https://vllm--24486.org.readthedocs.build/en/24486/

Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

heheda12345

LGTM! Thanks for this enhancement. Follow-ups:

more clean-ups @heheda12345
verify the get_supported_kernel_block_size of each attention backend.

vllm/model_executor/models/config.py

heheda12345 · 2025-10-09T06:16:52Z

vllm/v1/worker/gpu_model_runner.py

                else:
                    self.reorder_batch_threshold = reorder_batch_threshold_i

+    def _find_compatible_block_sizes(


(not a blocker) this function may be simplified.

heheda12345 · 2025-10-09T06:21:50Z

vllm/v1/worker/gpu_model_runner.py

                num_blocks = raw_tensor.numel() // kv_cache_spec.page_size_bytes
                if isinstance(kv_cache_spec, AttentionSpec):
                    has_attn = True
+                    kv_manager_block_size = kv_cache_spec.block_size


(not a blocker) should we use the common block size of all attention groups in the same kv cache group here?

LucasWilkinson · 2025-10-11T14:48:22Z

vllm/v1/attention/backends/flash_attn.py


+    @staticmethod
+    def get_supported_kernel_block_size() -> list[Union[int, MultipleOf]]:
+        return [MultipleOf(16)]


Technically FA3 would support MultipleOf(1) while FA2 would support MultipleOf(16); I dont think its worth handling this though

zhiyuan1i requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 9, 2025 06:52

mergify bot added rocm Related to AMD ROCm v1 labels Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

tests/v1/worker/test_gpu_model_runner.py Outdated Show resolved Hide resolved

tests/v1/worker/test_gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

zhiyuan1i force-pushed the hybrid-cache-groups branch from 4e3eeca to 8f2ee3d Compare September 9, 2025 06:58

heheda12345 reviewed Sep 9, 2025

View reviewed changes

zhiyuan1i force-pushed the hybrid-cache-groups branch from 954ade4 to 0e0823a Compare September 9, 2025 09:28

zhiyuan1i force-pushed the hybrid-cache-groups branch 2 times, most recently from 6d1735e to 0b544bf Compare September 9, 2025 14:43

heheda12345 reviewed Sep 10, 2025

View reviewed changes

vllm/platforms/cuda.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/block_table.py Show resolved Hide resolved

vllm/v1/worker/block_table.py Outdated Show resolved Hide resolved

zhiyuan1i force-pushed the hybrid-cache-groups branch from 62e5072 to 55e2235 Compare September 10, 2025 07:36

heheda12345 reviewed Sep 11, 2025

View reviewed changes

zhiyuan1i force-pushed the hybrid-cache-groups branch from 55e2235 to b0e1d3b Compare September 11, 2025 06:25

zhiyuan1i requested a review from zhuohan123 as a code owner September 11, 2025 06:25

zhiyuan1i added 3 commits October 8, 2025 04:20

fix

ec1ca20

Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>

Merge commit '17edd8a' into hybrid-cache-groups

66e7685

ruff

3e70aa4

Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>

zhiyuan1i force-pushed the hybrid-cache-groups branch from 5e0a1a0 to 3e70aa4 Compare October 8, 2025 08:43

zhiyuan1i added 3 commits October 8, 2025 08:53

Merge commit 'd6953be' into HEAD

8db8f3f

Merge branch 'main' into hybrid-cache-groups

5c9f1ef

Merge branch 'main' into hybrid-cache-groups

c40ebc6

mergify bot removed the needs-rebase label Oct 8, 2025

zhiyuan1i added 4 commits October 8, 2025 16:05

fix tests and clean imps

ff3a7db

Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>

Merge branch 'main' into hybrid-cache-groups

d419f0f

Merge branch 'main' into hybrid-cache-groups

86e414c

revert some changes

10fabbb

Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>

Merge branch 'main' into hybrid-cache-groups

fba9bea

Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>

heheda12345 approved these changes Oct 9, 2025

View reviewed changes

heheda12345 mentioned this pull request Oct 9, 2025

[Hybrid] A simpler algorithm to find kernel_block_size #26476

Merged

5 tasks

tomeras91 mentioned this pull request Oct 9, 2025

[Feature]: decouple attention backend block size from KVCacheManager block size #24280

Closed

1 task

benchislett mentioned this pull request Oct 9, 2025

[Bug]: prepare_kernel_block_sizes doesn't parse UniformTypeKVCacheSpecs #26524

Closed

1 task

heheda12345 mentioned this pull request Oct 10, 2025

[deepseek] kernel block size for UniformTypeKVCacheSpecs #26559

Merged

5 tasks

vadiklyutiy mentioned this pull request Oct 10, 2025

[BUG] Qwen3-next MTP. Fix attn metadata build bug #26564

Merged

LucasWilkinson reviewed Oct 11, 2025

View reviewed changes

vadiklyutiy mentioned this pull request Oct 27, 2025

[BUG] Fix hybrid kvcache kernel page size issue #27547

Closed

zq1997 mentioned this pull request Nov 7, 2025

[AMD] Use Decoupled Kernel Block Size to Support AITER MLA block_size=1 #27715

Merged

NickLucche mentioned this pull request Nov 13, 2025

[Bugfix][Nixl] Fix kernel physical<>logical block_size issue #28677

Merged

rebel-jaehwang mentioned this pull request Nov 24, 2025

feat: sliding window attention RBLN-SW/vllm-rbln#167

Merged

12 tasks

heheda12345 mentioned this pull request Jan 7, 2026

fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN #31712

Merged

jennyyyyzhen mentioned this pull request Jan 16, 2026

[ROCM] Enable aiter attn backend for qwen3-next model #32492

Merged

5 tasks

Uh oh!

Conversation

zhiyuan1i commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Sep 9, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

zhiyuan1i commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tjtanaa commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

heheda12345 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

zhiyuan1i commented Sep 9, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Sep 11, 2025 •

edited

Loading