[Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA by Rohan138 · Pull Request #40392 · vllm-project/vllm

Rohan138 · 2026-04-20T19:04:00Z

Purpose

Reland updated version of #35245 #35879 #38646, to fuse MLA RoPE and KV Cache ops. Adding some pattern matching fixes/minimization on top of #35879.

Test Plan

# server startup
vllm serve amd/DeepSeek-R1-0528-MXFP4 --tensor-parallel-size=8 --gpu-memory-utilization=0.94 --dtype=auto --kv-cache-dtype=fp8 --max-num-batched-tokens=8192 --attention-backend ROCM_AITER_MLA -cc.pass_config.fuse_rope_kvcache_cat_mla=True -cc.use_inductor_graph_partition=True --no-enable-prefix-caching --disable-uvicorn-access-log

# benchmark serving
vllm bench serve --model amd/DeepSeek-R1-0528-MXFP4 --percentile-metrics tpot,itl,e2el --ignore-eos --dataset-name random --random-input-len 1024 --random-output-len 1024 --max-concurrency 32 --num-prompts 320

# benchmark accuracy
lm_eval --model local-completions --model_args model=amd/DeepSeek-R1-0528-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048 --batch_size auto --tasks gsm8k --num_fewshot 5 --limit 200 --output_path .

Test Result

Main:

============ Serving Benchmark Result ============
Successful requests:                     320       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  188.52    
Total input tokens:                      327680    
Total generated tokens:                  327680    
Request throughput (req/s):              1.70      
Output token throughput (tok/s):         1738.18   
Peak output token throughput (tok/s):    1952.00   
Peak concurrent requests:                64.00     
Total token throughput (tok/s):          3476.36   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.77     
Median TPOT (ms):                        17.64     
P99 TPOT (ms):                           19.82     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.77     
Median ITL (ms):                         16.87     
P99 ITL (ms):                            17.86     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          18844.60  
Median E2EL (ms):                        18443.61  
P99 E2EL (ms):                           22806.19  
==================================================

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.95|±  |0.0154|
|     |       |strict-match    |     5|exact_match|↑  | 0.95|±  |0.0154|

Fused:

============ Serving Benchmark Result ============
Successful requests:                     320       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  185.98    
Total input tokens:                      327680    
Total generated tokens:                  327680    
Request throughput (req/s):              1.72      
Output token throughput (tok/s):         1761.94   
Peak output token throughput (tok/s):    1952.00   
Peak concurrent requests:                64.00     
Total token throughput (tok/s):          3523.89   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.45     
Median TPOT (ms):                        17.23     
P99 TPOT (ms):                           21.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.45     
Median ITL (ms):                         16.62     
P99 ITL (ms):                            17.58     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          18594.01  
Median E2EL (ms):                        18189.58  
P99 E2EL (ms):                           22745.50  
==================================================

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.97|±  |0.0121|
|     |       |strict-match    |     5|exact_match|↑  | 0.00|±  |0.0000|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

gemini-code-assist

Code Review

This pull request introduces the MLARoPEKVCacheCatFusionPass to optimize MLA RoPE KV cache updates by fusing concatenation and caching operations. The implementation includes updates to the CUDA kernels for flexible data type support, new pattern matchers for DeepSeek scaling and standard RoPE, and integration into the compilation pipeline. Feedback identifies a typo in a test import path and a configuration error where the fusion was enabled for the O0 optimization level, which should remain unoptimized.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

ProExpertProg

Looks good, thanks for this work! Can we add this fusion to the E2E tests as well?

ElizaWszola · 2026-05-07T11:03:32Z

+
+class MLARoPEKVCacheCatFusionPass(VllmFusionPatternMatcherPass):
+    def __init__(self, config: VllmConfig) -> None:
+        super().__init__(config, "mla_rope_kv_cache_fusion_pass")


nit: can you make this name consistent with other passes?

Do you mean camel case or something else? I was trying to keep it consistent with MLAAttentionQuantFusionPass and RopeKVCacheFusionPass

Yes, this is what I mean -- would it make sense to make MLAAttentionQuantFusionPass camel case as well so all pass names are consistent?

It's already camelcase: MLAAttnQuantFusionPass

tjtanaa · 2026-05-08T03:50:44Z

@Rohan138 can you try to address this issue?

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

…t#40392) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com> Co-authored-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>

…t#40392) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com> Co-authored-by: ElizaWszola <ewszola@redhat.com>

…t#40392) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com> Co-authored-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…t#40392) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com> Co-authored-by: ElizaWszola <ewszola@redhat.com>

rope+kvcache+cat mla fusion squash into single commit

6791331

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread tests/compile/passes/test_mla_rope_kvcache_cat_fusion.py Outdated

Comment thread vllm/config/vllm.py Outdated

Rohan138 added 15 commits April 20, 2026 14:13

fix defaults

2f73bcb

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix lint and merge

42d21c0

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix lint and merge

e280559

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

8983507

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

27a66ba

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

a18af5f

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

wip

57453fd

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix neox rope

23b0da2

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

mild name cleanup

e5677fa

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Refactor to VllmFusionPatternMatcherPass

05ef0ff

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

rename fusion func

39677be

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

739deec

fix defaults

a572e64

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

use get_attention_context

50cb607

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

fix csrc and kernel name

e4e638b

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Rohan138 changed the title ~~[WIP] Fused RoPE+KVCache+q_concat for MLA~~ [Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA Apr 21, 2026

Rohan138 marked this pull request as ready for review April 21, 2026 23:49

Rohan138 requested review from LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 21, 2026 23:49

Rohan138 added 2 commits May 1, 2026 17:59

Merge branch 'main' into mla_rope_kvcache_fusion

dc61b57

fix lint and UT failures

039cda2

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

xaguilar-amd mentioned this pull request May 3, 2026

[Performance][MLA] Lift decode Q-prep (q-absorb + cat + FP8 quant) out of forward_impl #41568

Open

Rohan138 and others added 2 commits May 4, 2026 16:27

merge main

b0b2eb2

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add Eliza as coauthor Co-authored-by: ElizaWszola ewszola@redhat.com

4d8d6ef

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Rohan138 force-pushed the mla_rope_kvcache_fusion branch from 4bb9f57 to 4d8d6ef Compare May 4, 2026 17:20

ProExpertProg enabled auto-merge (squash) May 4, 2026 18:54

ProExpertProg disabled auto-merge May 4, 2026 18:54

Merge branch 'main' into mla_rope_kvcache_fusion

de9a2b9

xaguilar-amd mentioned this pull request May 6, 2026

[Performance][MLA][ROCm] AITER fused QK-RoPE + KV cache + q-absorb + q-cat + q-quant for decode #41839

Draft

Merge branch 'main' into mla_rope_kvcache_fusion

c94d290

ProExpertProg approved these changes May 6, 2026

View reviewed changes

Comment thread tests/compile/passes/test_mla_rope_kvcache_cat_fusion.py

Comment thread tests/compile/passes/test_mla_rope_kvcache_cat_fusion.py Outdated

ElizaWszola reviewed May 7, 2026

View reviewed changes

Rohan138 added 2 commits May 8, 2026 05:46

remove TODO

d1f8c0f

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into mla_rope_kvcache_fusion

9afcdeb

ProExpertProg enabled auto-merge (squash) May 8, 2026 07:00

dllehr-amd and others added 2 commits May 8, 2026 08:18

Merge branch 'main' into mla_rope_kvcache_fusion

792f2ea

Merge branch 'main' into mla_rope_kvcache_fusion

287b221

ProExpertProg merged commit a51376b into vllm-project:main May 11, 2026
159 checks passed

functionstackx mentioned this pull request May 19, 2026

[Bug][Perf] MiniMax-M2.5 FP8 on MI325X — ~38% throughput regression between vLLM ROCm v0.18.0 and v0.21.0 #43029

Closed

Rohan138 deleted the mla_rope_kvcache_fusion branch May 31, 2026 16:35

Franky2679 mentioned this pull request Jun 11, 2026

[Fix] Handle auto_functionalized for DeepSeek-V4 MLA fused kernel #43058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA#40392

[Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA#40392
ProExpertProg merged 45 commits into
vllm-project:mainfrom
ROCm:mla_rope_kvcache_fusion

Rohan138 commented Apr 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ElizaWszola May 7, 2026

Uh oh!

Rohan138 May 7, 2026

Uh oh!

ElizaWszola May 7, 2026

Uh oh!

Rohan138 May 8, 2026

Uh oh!

tjtanaa commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

Rohan138 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ElizaWszola May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ElizaWszola May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Rohan138 commented Apr 20, 2026 •

edited

Loading

ProExpertProg left a comment •

edited

Loading