[V1][Hybrid] Mamba Prefix Caching with align mode#30877
Merged
heheda12345 merged 136 commits intoJan 23, 2026
Conversation
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
tdoublep
approved these changes
Jan 23, 2026
tdoublep
left a comment
Member
There was a problem hiding this comment.
Thanks for the great work! This feature enables prefix caching for a broader set of models.
Let's fix the issues that remain for MTP as a follow-up.
Comment on lines
+283
to
+285
| # We allocate 1 block for each request now, so max_memory_usage_bytes is | ||
| # the same as page_size_bytes. | ||
| # Need to update this when supporting prefix caching. |
Member
There was a problem hiding this comment.
This comment is redundant now I think
Contributor
Author
There was a problem hiding this comment.
Thanks for pointing that out. We can remove it later.
Comment on lines
+295
to
+296
| max_model_len = vllm_config.model_config.max_model_len | ||
| return cdiv(max_model_len, self.block_size) * self.page_size_bytes |
Member
There was a problem hiding this comment.
I think that this code for "all" is wrong actually, but it is not an issue introduced by this PR. Will fix it as a follow-up.
Contributor
Author
There was a problem hiding this comment.
Agreed. It seems "all" mode performs allocation at the granularity of mamba_block_size, so we need to fix this later.
5 tasks
Member
|
This PR appears to fail pre-commit, I have a fix: #32956 |
This was referenced Jan 25, 2026
cwazai
pushed a commit
to cwazai/vllm
that referenced
this pull request
Jan 25, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: 陈建华 <1647430658@qq.com>
lapy
pushed a commit
to lapy/vllm
that referenced
this pull request
Jan 27, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
5 tasks
6 tasks
MengqingCao
pushed a commit
to vllm-project/vllm-ascend
that referenced
this pull request
Mar 15, 2026
…de align` (#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>
Nagisa125
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Mar 17, 2026
…de align` (vllm-project#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>
1 task
yangzhe-2026
pushed a commit
to yangzhe-2026/vllm-ascend
that referenced
this pull request
May 6, 2026
…de align` (vllm-project#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com>
mystous
pushed a commit
to mystous/vllm_hybrid
that referenced
this pull request
May 10, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
nanxingMy
pushed a commit
to nanxingMy/vllm-ascend
that referenced
this pull request
May 15, 2026
…de align` (vllm-project#7103) ### What this PR does / why we need it? To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly follows the design in [#30877](vllm-project/vllm#30877) and inherits changes to functions which are overridden in vLLM-Ascend. Note: 1. `--mamba-cache-mode align` && PD disaggregation is still not supported yet in vLLM v0.17.0(see https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295). 2. The current implementation of hybrid kv cache might result in a very large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B with `-tp 2`, the block_size is adjusted to 2048, which means that any prefix shorter than 2048 will never be cached. Although this behavior is consistent with vLLM, it still needs improvements in the future. 3. `--mamba-cache-mode align` requires to copy mamba states during forward steps. vLLM uses a triton kernel to implement it. However, the original version run into some bugs on Ascend hardwares. Thus we patch a new triton kernel to avoid this bug. ### Does this PR introduce _any_ user-facing change? To use mamba prefix cache, set `--enable-prefix-caching` and `--mamba-cache-mode align`. Note that the mamba state copy function(see [do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132)) does not provide a torch native version, thus it might have trouble if users can't use triton. - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: nanxing <1014662416@qq.com>
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
my-other-github-account
pushed a commit
to my-other-github-account/vllm
that referenced
this pull request
May 15, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
0826joyce
pushed a commit
to 0826joyce/vllm-serving-optimization
that referenced
this pull request
May 19, 2026
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
18 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The cleaned-up version of #29272
Purpose
This PR enhances the design of #28176 , adopting the same memory layout as FullAttention while adding support for decode caching and speculative decoding.
The core idea of this Mamba Prefix-Caching implementation (referred to as LPC) is to directly cache Mamba states through block-aligned scheduling. This approach enables rapid support for Prefix-caching in Mamba models without modifications to the underlying kernel code. Furthermore, it maintains full compatibility with Speculative-Decoding/MTP/EAGLE.
Currently, this solution supports all Mamba model architectures including GDN, Mamba1, Mamba2, and Short Conv Attention, and has been adapted for relevant Mamba models such as Qwen3-Next-80B-A3B-Instruct and LFM2-700M.
Usage
To enable this feature, start the engine with the
--enable-prefix-cachingand--mamba-cache-mode alignflags.Design Details
Block-Aligned Scheduling
Following the design in #28176 , requests in the prefill phase are scheduled in multiples of
block_size. This ensures that the Mamba states can be mapped to a specific block's hash value.The prefix-caching stores variable-length chunk states—i.e., the number of tokens (or the incremental length) associated with each cached Mamba state may vary, but it is always a multiple of
block_size.Scheduler Logic with Mamba Prefix-Caching Enabled:
block_size, except for the final chunk of the request.block_size, ensuring its size is≤ block_size. This maximizes the length of the prompt that can be cached during the prefill phase.Block Allocation Design
Prefill Stage
During the prefill stage, requests are scheduled at a block-aligned chunk granularity. For a single scheduling step consisting of chunk_len tokens, the system allocates
chunk_len // block_sizeblocks:(chunk_len // block_size) - 1blocks are populated with null-blocks (placeholders).Note on Speculative Decoding (SPS): In the prefill stage with SPS enabled, the initial execution requires the allocation of

gammaadditional speculative blocks, which are subsequently reused in following steps.Decode Stage
Since only a small number of tokens are scheduled per step during decoding, the allocation logic is consistent with FullAttention, where blocks are incrementally allocated one by one.

Prefix Caching Logic
Scheduler-side Logic
Similar to the FullAttention prefix-caching logic. Only immutable blocks that store Mamba states are cached (excluding the null-blocks). And prefix matching is performed via a reverse hash lookup that requires only a single block to be matched.

Worker-side Logic
Prefill Phase:
The Preprocess stage is responsible for copying Mamba states before the model forward:
Condition 1: Copy the Mamba state from the previous step to the current step.

Condition 2: Copy the Mamba state from the prefix-cache hit block to the current step.

Decode Phase:
Without Speculative Decoding: The logic remains consistent with the standard Prefill Phase.
With Speculative Decoding:
The Preprocess stage copies Mamba states when a new block is allocated:
num_accepted_tokens.After receiving the full number of tokens corresponding to the previous block, the Post-process stage copies the Mamba state back to the previous block.

Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.