Skip to content

[Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA#40392

Merged
ProExpertProg merged 45 commits into
vllm-project:mainfrom
ROCm:mla_rope_kvcache_fusion
May 11, 2026
Merged

[Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA#40392
ProExpertProg merged 45 commits into
vllm-project:mainfrom
ROCm:mla_rope_kvcache_fusion

Conversation

@Rohan138

@Rohan138 Rohan138 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Purpose

Reland updated version of #35245 #35879 #38646, to fuse MLA RoPE and KV Cache ops. Adding some pattern matching fixes/minimization on top of #35879.

Test Plan

# server startup
vllm serve amd/DeepSeek-R1-0528-MXFP4 --tensor-parallel-size=8 --gpu-memory-utilization=0.94 --dtype=auto --kv-cache-dtype=fp8 --max-num-batched-tokens=8192 --attention-backend ROCM_AITER_MLA -cc.pass_config.fuse_rope_kvcache_cat_mla=True -cc.use_inductor_graph_partition=True --no-enable-prefix-caching --disable-uvicorn-access-log

# benchmark serving
vllm bench serve --model amd/DeepSeek-R1-0528-MXFP4 --percentile-metrics tpot,itl,e2el --ignore-eos --dataset-name random --random-input-len 1024 --random-output-len 1024 --max-concurrency 32 --num-prompts 320

# benchmark accuracy
lm_eval --model local-completions --model_args model=amd/DeepSeek-R1-0528-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048 --batch_size auto --tasks gsm8k --num_fewshot 5 --limit 200 --output_path .

Test Result

Main:

============ Serving Benchmark Result ============
Successful requests:                     320       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  188.52    
Total input tokens:                      327680    
Total generated tokens:                  327680    
Request throughput (req/s):              1.70      
Output token throughput (tok/s):         1738.18   
Peak output token throughput (tok/s):    1952.00   
Peak concurrent requests:                64.00     
Total token throughput (tok/s):          3476.36   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.77     
Median TPOT (ms):                        17.64     
P99 TPOT (ms):                           19.82     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.77     
Median ITL (ms):                         16.87     
P99 ITL (ms):                            17.86     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          18844.60  
Median E2EL (ms):                        18443.61  
P99 E2EL (ms):                           22806.19  
==================================================

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.95|±  |0.0154|
|     |       |strict-match    |     5|exact_match|↑  | 0.95|±  |0.0154|

Fused:

============ Serving Benchmark Result ============
Successful requests:                     320       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  185.98    
Total input tokens:                      327680    
Total generated tokens:                  327680    
Request throughput (req/s):              1.72      
Output token throughput (tok/s):         1761.94   
Peak output token throughput (tok/s):    1952.00   
Peak concurrent requests:                64.00     
Total token throughput (tok/s):          3523.89   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.45     
Median TPOT (ms):                        17.23     
P99 TPOT (ms):                           21.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.45     
Median ITL (ms):                         16.62     
P99 ITL (ms):                            17.58     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          18594.01  
Median E2EL (ms):                        18189.58  
P99 E2EL (ms):                           22745.50  
==================================================

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.97|±  |0.0121|
|     |       |strict-match    |     5|exact_match|↑  | 0.00|±  |0.0000|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the MLARoPEKVCacheCatFusionPass to optimize MLA RoPE KV cache updates by fusing concatenation and caching operations. The implementation includes updates to the CUDA kernels for flexible data type support, new pattern matchers for DeepSeek scaling and standard RoPE, and integration into the compilation pipeline. Feedback identifies a typo in a test import path and a configuration error where the fusion was enabled for the O0 optimization level, which should remain unoptimized.

Comment thread tests/compile/passes/test_mla_rope_kvcache_cat_fusion.py Outdated
Comment thread vllm/config/vllm.py Outdated
Rohan138 added 15 commits April 20, 2026 14:13
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@Rohan138 Rohan138 changed the title [WIP] Fused RoPE+KVCache+q_concat for MLA [Performance][DSR1]: Fused RoPE+KVCache+q_concat for MLA Apr 21, 2026
@Rohan138 Rohan138 marked this pull request as ready for review April 21, 2026 23:49
Rohan138 added 2 commits May 1, 2026 17:59
Rohan138 and others added 2 commits May 4, 2026 16:27
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@Rohan138 Rohan138 force-pushed the mla_rope_kvcache_fusion branch from 4bb9f57 to 4d8d6ef Compare May 4, 2026 17:20
@ProExpertProg ProExpertProg enabled auto-merge (squash) May 4, 2026 18:54
@ProExpertProg ProExpertProg disabled auto-merge May 4, 2026 18:54

@ProExpertProg ProExpertProg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for this work! Can we add this fusion to the E2E tests as well?

Comment thread tests/compile/passes/test_mla_rope_kvcache_cat_fusion.py
Comment thread tests/compile/passes/test_mla_rope_kvcache_cat_fusion.py Outdated

class MLARoPEKVCacheCatFusionPass(VllmFusionPatternMatcherPass):
def __init__(self, config: VllmConfig) -> None:
super().__init__(config, "mla_rope_kv_cache_fusion_pass")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make this name consistent with other passes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean camel case or something else? I was trying to keep it consistent with MLAAttentionQuantFusionPass and RopeKVCacheFusionPass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I mean -- would it make sense to make MLAAttentionQuantFusionPass camel case as well so all pass names are consistent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already camelcase: MLAAttnQuantFusionPass

@tjtanaa

tjtanaa commented May 8, 2026

Copy link
Copy Markdown
Member

@Rohan138 can you try to address this issue?

Rohan138 added 2 commits May 8, 2026 05:46
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@ProExpertProg ProExpertProg enabled auto-merge (squash) May 8, 2026 07:00
@ProExpertProg ProExpertProg merged commit a51376b into vllm-project:main May 11, 2026
159 checks passed
hissu-hyvarinen pushed a commit to hissu-hyvarinen/vllm that referenced this pull request May 11, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
@Rohan138 Rohan138 deleted the mla_rope_kvcache_fusion branch May 31, 2026 16:35
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…t#40392)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants