[Performance][MLA][ROCm] AITER fused QK-RoPE + KV cache + q-absorb + q-cat + q-quant for decode by xaguilar-amd · Pull Request #41839 · vllm-project/vllm

xaguilar-amd · 2026-05-06T16:11:59Z

TL;DR

Builds on top of #40392 to additionally fuse the decode-side q path
(q-absorb BMM, q-concat, FP8 q-quant) into AITER's
fused_qk_rope_concat_and_cache_mla kernel — collapsing 4 ops into 1
on every decode step on AMD MI300X / MI355X. Decode-bucket only by
construction; prefill graphs are byte-for-byte identical to #40392.
Disabled by default; opt-in via pass_config.fuse_aiter_qk_rope_kvcache_mla=True.

Purpose

After #40392, the decode hot path on ROCm + AITER looks like:

fused_rope_unified_mla_kv_cache_update(...)   # RoPE + KV cache write   (already fused by #40392)
   ↓
do_decode_q_prep(q)                            # q-absorb BMM + cat + FP8 quant   (NOT fused)
   ↓
unified_mla_attention_with_output(q_prepped, ...)

The q-prep stage is composed of four small ops launched per layer per
decode step (BMM, split, concat, FP8 quant). AITER ships a single
kernel — fused_qk_rope_concat_and_cache_mla — that does all of it,
including the RoPE+KV-cache half that #40392 already fuses. Hooking it up
needs the q-prep ops to be visible to the FX graph (currently they live
inside forward_impl).

This PR:

Lifts q-prep into a custom op (mla_decode_q_prep) above
unified_mla_attention_with_output, in a new compilation pass
(MLADecodeQPrepLiftPass).
Folds the pair (fused_rope_unified_mla_kv_cache_update, mla_decode_q_prep) into one AITER call, in a second new pass
(MLAAiterQkRopeKVCacheFusionPass).
Bounds memory & CUDA-graph safety by gating both passes on
compile_range.end <= max_num_seqs × (1 + num_speculative_tokens),
the same formula CudaGraphManager uses to classify decode-mode
captures.

Net result: one fused decode kernel for RoPE + KV-cache + q-absorb +
q-cat + q-quant, with zero overhead on the prefill / mixed graphs.

Design choices (the parts reviewers will ask about)

1. INVARIANT 1: `mla_decode_q_prep_impl` does not lie about its shape

A previous attempt at this fusion (closed, by request — was the predecessor of #41568)
declared an mla_decode_q_prep whose fake_impl
shape was q.shape but whose real impl returned q[:num_decode].
Inductor sized downstream ops to the full T; runtime returned 0 rows
during high-range CUDA-graph warmup; static_per_tensor_quant launched
with grid_dim = T against an empty buffer → null-pointer GPU fault on
the (4682, 16384) compile range. In addition, it also had some design flaws.

The fix: the impl processes every row of
q, never slices on attention metadata. The fake_impl declares
[q.size(0), num_heads, kv_lora + qk_rope] and the real impl honors it.
The decode-bucket gate (next section) is what makes this allocation
free.

There's an explicit unit test —
test_mla_decode_q_prep_invariant_1 — that asserts
output.size(0) == q.size(0) for T ∈ {1, 16, 64, 256}. There's also a
CUDA-graph capture/replay regression test
(test_mla_aiter_fusion_cuda_graph_capture) that exercises both ends of
the decode bucket end-to-end.

2. Auto-derived decode-bucket threshold

MLADecodeQPrepLiftPass and MLAAiterQkRopeKVCacheFusionPass only fire
for compile ranges with
end <= aiter_qk_rope_kvcache_fusion_max_token_num. The default value
is auto-derived in VllmConfig._set_compile_ranges:

decode_query_len = 1 + num_speculative_tokens
max_token_num = scheduler_config.max_num_seqs * decode_query_len

This is exactly the formula CudaGraphManager._init_candidates already
uses to classify decode-mode CUDA-graph captures. Keeping the pass gate
aligned with that classification eliminates a footgun (you can't tune
one without the other accidentally going stale) and removes the need
for a manual knob in 99% of deployments. An explicit value is still
honored as an override.

3. Building on top of PR #40392's fused-RoPE+KVCache

Sequencing: MLARoPEKVCacheCatFusionPass ([Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA #40392) → MLADecodeQPrepLiftPass → MLAAiterQkRopeKVCacheFusionPass. The AITER pass matches the pair (auto_functionalized(fused_rope_unified_mla_kv_cache_update, ...), mla_decode_q_prep) keyed by layer_name and folds them into one call.
Auto-enabling [Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA #40392: turning on fuse_aiter_qk_rope_kvcache_mla auto-enables fuse_rope_kvcache_cat_mla (it's a strict prerequisite). A clear log line is emitted.
Cycle-breaking via _unwrap_q_orig: [Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA #40392 leaves the model's q[..., qk_nope:] = q_pe_rotated write functionalized as slice_scatter(q_orig, copy(slice_dst, getitem(frmkv, 1))). Naively reusing that as q for the new fused node closes a cycle (new_node → slice_scatter → new_q_pe = new_node[1]). We walk back to q_orig (which AITER doesn't need rotated since the kernel does RoPE itself and only consumes q_nope), breaking the cycle. There's a corresponding tweak in FixFunctionalizationPass so view_temp is not erased — it's now a live input to the new fused op.

4. vLLM stores FP8 KV cache as `torch.uint8`

STR_DTYPE_TO_TORCH_DTYPE["fp8"] -> torch.uint8. AITER's torch→AITER
dtype mapping rejects uint8 for kv_cache (AITER_DTYPE_u8 isn't in the
kernel's whitelist) and crashes at warmup with
[AITER] kv cache data type is not supported. PR #40392's path goes
through vLLM's own _C_cache_ops.concat_and_cache_mla_rope_fused, which
takes an explicit kv_cache_dtype: str and accepts uint8, so the
issue doesn't surface there. We zero-copy-view the kv_cache as
current_platform.fp8_dtype() (float8_e4m3fn on gfx950,
float8_e4m3fnuz on gfx94) before dispatch when
is_quantized_kv_cache(kv_cache_dtype).

Compatibility / no-effect cases

Non-ROCm or non-AITER: __post_init__ disables
fuse_aiter_qk_rope_kvcache_mla with a warning. Pass is never built.
Prefill compile ranges: is_applicable_for_range returns False,
pass is skipped, prefill graphs are unchanged from [Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA #40392.
Default settings: the flag is opt-in. With it off, this PR is a
pure no-op (modulo refactored q-prep helpers in mla_attention.py,
which preserve identical behavior).

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

…ide q path (q-absorb BMM, q-concat, FP8 q-quant) into AITER's fused_qk_rope_concat_and_cache_mla kernel Signed-off-by: Xavier Aguilar <xavier.aguilarfruto@amd.com>

gemini-code-assist

Code Review

This pull request introduces a series of compilation passes designed to optimize Multi-Head Latent Attention (MLA) on ROCm by leveraging AITER fused kernels. The changes include new passes for lifting query preparation and fusing RoPE, KV cache updates, and query concatenation into single operations. Additionally, the PR adds pattern matching support for DeepSeek-style scaling rotary embeddings and updates the MLAAttention layer to support pre-prepared query tensors. Extensive unit tests and parity checks are included to ensure correctness and CUDA graph stability. Feedback from the reviewer identifies potential safety issues in the defunctionalization logic that could lead to runtime errors and suggests ensuring tensor contiguity after transpositions in the query preparation methods to maintain compatibility with downstream kernels.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm/compilation/passes/utility/fix_functionalization.py (193-218)

This block lacks safety checks and initialization for copy_temp, slice_temp, and slice_scatter_temp. If the expected aten.copy.default or aten.slice_scatter.default nodes are not found in the graph, this will raise an UnboundLocalError or AttributeError. Additionally, it assumes that getitem_nodes contains indices 1 and 2 without checking, which could lead to a KeyError. Please follow the safer pattern used in the subsequent elif block (lines 242-269).

                getitem_nodes = self.getitem_users(node)
                if 1 in getitem_nodes:
                    q_pe_out = getitem_nodes[1]
                    copy_temp = None
                    for user in list(q_pe_out.users):
                        if is_func(user, torch.ops.aten.copy.default):
                            copy_temp = user
                            break
                    if copy_temp is not None:
                        slice_temp = copy_temp.args[0]
                        slice_scatter_temp = None
                        for user in list(copy_temp.users):
                            if is_func(user, torch.ops.aten.slice_scatter.default):
                                slice_scatter_temp = user
                                break
                        if slice_scatter_temp is not None:
                            view_temp = slice_scatter_temp.args[0]
                            view_orig = slice_temp.args[0]
                            slice_scatter_temp.replace_all_uses_with(view_orig)
                            self._remove(slice_scatter_temp)
                            self._remove(copy_temp)
                            self._remove(slice_temp)
                            self._remove(view_temp)
                    self._remove(q_pe_out)

                # defunctionalize k_pe manually; self.replace_users_with_mutated_args
                # does not support only replacing specific kwargs
                if 2 in getitem_nodes:
                    k_pe_in = node.kwargs["k_pe"]
                    k_pe_out = getitem_nodes[2]
                    k_pe_out.replace_all_uses_with(k_pe_in)
                    self._remove(k_pe_out)

vllm/model_executor/layers/attention/mla_attention.py (564)

The fallback path returns a non-contiguous tensor due to the transpose operation. Since AITER kernels and other downstream operations often expect contiguous inputs for performance and correctness, it is safer to ensure the result is contiguous.

            ql_nope = ql_nope.transpose(0, 1).contiguous()

vllm/model_executor/layers/attention/mla_attention.py (629)

The fallback path returns a non-contiguous tensor due to the transpose operation. Since AITER kernels and other downstream operations often expect contiguous inputs for performance and correctness, it is safer to ensure the result is contiguous.

            ql_nope = ql_nope.transpose(0, 1).contiguous()

mergify · 2026-05-23T09:57:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xaguilar-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Rohan138 and others added 30 commits April 20, 2026 14:02

rope+kvcache+cat mla fusion squash into single commit

6791331

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix defaults

2f73bcb

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix lint and merge

42d21c0

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix lint and merge

e280559

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

8983507

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

27a66ba

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

a18af5f

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

57453fd

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix neox rope

23b0da2

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

mild name cleanup

e5677fa

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Refactor to VllmFusionPatternMatcherPass

05ef0ff

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

rename fusion func

39677be

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

739deec

fix defaults

a572e64

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

use get_attention_context

50cb607

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix csrc and kernel name

e4e638b

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix lint

6b918d6

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

bugfix

059bd48

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

rename to q_pe

8dd9ae5

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Match q_pe and remove copy+scatter ops during defunctionalization

e16f8d4

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add fix_functionalization to the unit test

578ea91

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add fix_functionalization to the unit test

3f6da4c

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add fix_functionalization to the unit test

dff5d99

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add UT for input qkv tensor

d1f0219

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Use FlexibleLayout to eliminate k_pe copy

a0d96d9

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

1e5389c

Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>

fix merge and drop qkv_lora UT

b5c21b6

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix lint and dual_chunk_rope

de4c56d

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

57d4fdd

add back resolve_layer_name

4b88c91

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Rohan138 and others added 11 commits April 29, 2026 15:56

Add use_flashinfer

60d6082

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add query to matcher inputs

4b5cb3b

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Change user to copy_temp

25072bb

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

gate UT behind is_cuda_alike

067e3ab

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

7eb7b98

Merge branch 'main' into mla_rope_kvcache_fusion

dc61b57

fix lint and UT failures

039cda2

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

merge main

b0b2eb2

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add Eliza as coauthor Co-authored-by: ElizaWszola ewszola@redhat.com

4d8d6ef

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

de9a2b9

Builds on top of vllm-project#40392 to additionally fuse the decode-s…

ee325e6

…ide q path (q-absorb BMM, q-concat, FP8 q-quant) into AITER's fused_qk_rope_concat_and_cache_mla kernel Signed-off-by: Xavier Aguilar <xavier.aguilarfruto@amd.com>

mergify Bot added the rocm Related to AMD ROCm label May 6, 2026

github-project-automation Bot added this to AMD May 6, 2026

github-project-automation Bot moved this to Todo in AMD May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

shanyulu mentioned this pull request May 8, 2026

[Attention][MLA] Add Triton-fused TurboQuant decode backend #41803

Open

4 tasks

xaguilar-amd mentioned this pull request May 10, 2026

[Performance][MLA] Lift decode Q-prep (q-absorb + cat + FP8 quant) out of forward_impl #41568

Open

mergify Bot added the needs-rebase label May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance][MLA][ROCm] AITER fused QK-RoPE + KV cache + q-absorb + q-cat + q-quant for decode#41839

[Performance][MLA][ROCm] AITER fused QK-RoPE + KV cache + q-absorb + q-cat + q-quant for decode#41839
xaguilar-amd wants to merge 41 commits into
vllm-project:mainfrom
xaguilar-amd:mla_qk_rope_cache_fusion

xaguilar-amd commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

xaguilar-amd commented May 6, 2026

TL;DR

Purpose

Design choices (the parts reviewers will ask about)

1. INVARIANT 1: mla_decode_q_prep_impl does not lie about its shape

2. Auto-derived decode-bucket threshold

3. Building on top of PR #40392's fused-RoPE+KVCache

4. vLLM stores FP8 KV cache as torch.uint8

Compatibility / no-effect cases

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

vllm/compilation/passes/utility/fix_functionalization.py (193-218)

vllm/model_executor/layers/attention/mla_attention.py (564)

vllm/model_executor/layers/attention/mla_attention.py (629)

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. INVARIANT 1: `mla_decode_q_prep_impl` does not lie about its shape

4. vLLM stores FP8 KV cache as `torch.uint8`