[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE by Rohan138 · Pull Request #35180 · vllm-project/vllm

Rohan138 · 2026-02-24T07:21:21Z

Purpose

Follow-up to #25135 and #33443 to fix and enable the AITER RoPE custom op by default, and enable RoPE+KVCache fusion. During prefill (q.shape[0] > 256), we thus use the AITER unfused RoPE kernel instead of the vllm native custom op, which gives another 1% uplift on gpt-oss.

This is also a mild prerequisite for MLA RoPE+KVCache fusion on ROCm, since there currently isn't a vllm native custom op for DeepseekScalingRotaryEmbedding.

Test Plan

Before:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  118.01    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         2504.76   
Total Token throughput (tok/s):          5001.60  

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0213|
|     |       |strict-match    |     5|exact_match|↑  | 0.17|±  |0.0266|

Test Result

After:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  116.47    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.75      
Output token throughput (tok/s):         2537.77   
Total Token throughput (tok/s):          5067.52

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0213|
|     |       |strict-match    |     5|exact_match|↑  | 0.19|±  |0.0278|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

gemini-code-assist

Code Review

This pull request enables the AITER RoPE custom op and RoPE+KVCache fusion for ROCm. The changes look good overall, enabling the new custom op and updating relevant parts of the code, including tests and configuration. However, I found a critical issue in the configuration that prevents the new RoPE+KVCache fusion from being enabled by default, which contradicts the goal of this PR. Please see the detailed comment.

gemini-code-assist · 2026-02-24T07:23:12Z

+    return (
+        cfg.compilation_config.is_custom_op_enabled("rotary_embedding")
+        and cfg.compilation_config.use_inductor_graph_partition
+    )


The condition and cfg.compilation_config.use_inductor_graph_partition will cause this function to return False for the default optimization levels (O2, O3), as use_inductor_graph_partition is set to False for them. This effectively disables the fuse_rope_kvcache feature that this pull request intends to enable. To ensure the fusion is enabled by default as intended, this condition should be removed.

return cfg.compilation_config.is_custom_op_enabled("rotary_embedding")

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

mergify · 2026-02-24T20:22:25Z

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

mergify · 2026-02-24T20:28:50Z

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-24T20:41:35Z

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

ProExpertProg · 2026-02-24T20:43:40Z

+    return (
+        rocm_aiter_ops.is_enabled()
+        and cfg.compilation_config.is_custom_op_enabled("rotary_embedding")
+        and cfg.compilation_config.use_inductor_graph_partition


So are you guys using inductor graph partition on rocm by default? Otherwise we should also return true here I'd dynamo partition and kv cache op not in splitting ops

(Somehow GH ate my original PR comment that explained this)

This PR is necessary but not sufficient to actually enable this fusion by default. We also need:

[ROCm] Add extra step in config initialization to populate custom ops before compilation config init #34848 to actually enable the rotary_embedding custom op before this check

[Feature]: Remove attention layer name from unified_kv_cache_update #33267 to remove layer_name from unified_kv_cache_update (and thus remove the op from splitting_ops if using dynamo partition).

We need some offline discussion on improving our pass management depending on platform, AITER/FI enabled, etc.

return true if dynamo partition and kv cache op not in splitting ops

https://github.com/vllm-project/vllm/blob/main/vllm/config/compilation.py#1001 is called in https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L961 after the defaults are set in https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L807. So if inductor partition is not enabled, we would return true for this, then append kv cache to splitting ops, which would silently break the fusion.

Links are broken but I know what you mean - but if splitting_ops=[] is passed kvcache won't be added so it should still work. So this check should be if inductor_partition or len(splitting_ops)==0

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

…ROCm/vllm into fused_aiter_triton_rope_kvcache

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

tjtanaa · 2026-03-05T09:45:28Z

@Rohan138 is the lm eval score from DeepSeek-R1? The GSM8K score should be at 0.95. 0.9 seems low.

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Enable rope+kvcache fusion for AITER rope; turn AITER rope on by default

e0013f6

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Rohan138 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and zou3519 as code owners February 24, 2026 07:21

mergify Bot added the rocm Related to AMD ROCm label Feb 24, 2026

github-project-automation Bot added this to AMD Feb 24, 2026

github-project-automation Bot moved this to Todo in AMD Feb 24, 2026

gemini-code-assist Bot reviewed Feb 24, 2026

View reviewed changes

Rohan138 added 4 commits February 24, 2026 13:21

Merge branch 'main' into fused_aiter_triton_rope_kvcache

9624497

Enable by default (conditions apply)

ca9ec3f

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

refactor

cfa149e

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

put +rotary_embedding back

83ff4bd

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Add +rotary_embedding to rocm.py

8a2c1ff

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'main' into fused_aiter_triton_rope_kvcache

e512759

ProExpertProg approved these changes Feb 24, 2026

View reviewed changes

Rohan138 added 3 commits February 24, 2026 15:31

add back qknorm rope fusion flag

b5f2bb8

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Merge branch 'fused_aiter_triton_rope_kvcache' of https://github.com/…

1d9ab8e

…ROCm/vllm into fused_aiter_triton_rope_kvcache

fix _aiter_ops import

1139789

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

gshtras enabled auto-merge (squash) February 24, 2026 22:57

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2026

Merge branch 'main' into fused_aiter_triton_rope_kvcache

11177e8

Rohan138 mentioned this pull request Feb 24, 2026

[ROCm][WIP]: Fused aiter rope kvcache mla #35245

Closed

5 tasks

gshtras merged commit f38f8c9 into vllm-project:main Feb 25, 2026
67 of 68 checks passed

github-project-automation Bot moved this from Todo to Done in AMD Feb 25, 2026

Rohan138 mentioned this pull request Feb 27, 2026

[ROCm]: fix aiter rope functionalization #35533

Merged

5 tasks

Rohan138 deleted the fused_aiter_triton_rope_kvcache branch February 27, 2026 21:19

Rohan138 mentioned this pull request Feb 28, 2026

[ROCm][Bugfix]: Disable AITER Triton ROPE by default #35601

Merged

5 tasks

jennyyyyzhen mentioned this pull request Mar 3, 2026

[Bug]: Qwen3-Next accuracy regression on AMD #35828

Closed

Rohan138 mentioned this pull request Mar 5, 2026

vllm 0.17.0 gpt-oss updates SemiAnalysisAI/InferenceX#867

Closed

cquil11 mentioned this pull request Mar 7, 2026

[AMD] GPT-OSS vLLM 0.17.0 AMD update SemiAnalysisAI/InferenceX#889

Merged

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (vllm-…

4aa7079

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

jiangkuaixue123 pushed a commit to jiangkuaixue123/vllm that referenced this pull request Apr 28, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (vllm-…

acd7e33

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (vllm-…

2a8ed69

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (vllm-…

4186b0c

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (vllm-…

6a5f46a

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (vllm-…

63a6a6e

…project#35180) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE#35180

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE#35180
gshtras merged 11 commits into
vllm-project:mainfrom
ROCm:fused_aiter_triton_rope_kvcache

Rohan138 commented Feb 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 24, 2026

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

Uh oh!

ProExpertProg Feb 24, 2026

Uh oh!

Rohan138 Feb 24, 2026 •

edited

Loading

Uh oh!

ProExpertProg Feb 26, 2026

Uh oh!

Uh oh!

tjtanaa commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Rohan138 commented Feb 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

mergify Bot commented Feb 24, 2026

Uh oh!

Uh oh!

ProExpertProg Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tjtanaa commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rohan138 commented Feb 24, 2026 •

edited by github-actions Bot

Loading

Rohan138 Feb 24, 2026 •

edited

Loading