Skip to content

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE#35180

Merged
gshtras merged 11 commits into
vllm-project:mainfrom
ROCm:fused_aiter_triton_rope_kvcache
Feb 25, 2026
Merged

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE#35180
gshtras merged 11 commits into
vllm-project:mainfrom
ROCm:fused_aiter_triton_rope_kvcache

Conversation

@Rohan138

@Rohan138 Rohan138 commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Purpose

Follow-up to #25135 and #33443 to fix and enable the AITER RoPE custom op by default, and enable RoPE+KVCache fusion. During prefill (q.shape[0] > 256), we thus use the AITER unfused RoPE kernel instead of the vllm native custom op, which gives another 1% uplift on gpt-oss.

This is also a mild prerequisite for MLA RoPE+KVCache fusion on ROCm, since there currently isn't a vllm native custom op for DeepseekScalingRotaryEmbedding.

Test Plan

Before:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  118.01    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         2504.76   
Total Token throughput (tok/s):          5001.60  

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0213|
|     |       |strict-match    |     5|exact_match|↑  | 0.17|±  |0.0266|

Test Result

After:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  116.47    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.75      
Output token throughput (tok/s):         2537.77   
Total Token throughput (tok/s):          5067.52

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.90|±  |0.0213|
|     |       |strict-match    |     5|exact_match|↑  | 0.19|±  |0.0278|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the AITER RoPE custom op and RoPE+KVCache fusion for ROCm. The changes look good overall, enabling the new custom op and updating relevant parts of the code, including tests and configuration. However, I found a critical issue in the configuration that prevents the new RoPE+KVCache fusion from being enabled by default, which contradicts the goal of this PR. Please see the detailed comment.

Comment thread vllm/config/vllm.py Outdated
Comment on lines +112 to +115
return (
cfg.compilation_config.is_custom_op_enabled("rotary_embedding")
and cfg.compilation_config.use_inductor_graph_partition
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition and cfg.compilation_config.use_inductor_graph_partition will cause this function to return False for the default optimization levels (O2, O3), as use_inductor_graph_partition is set to False for them. This effectively disables the fuse_rope_kvcache feature that this pull request intends to enable. To ensure the fusion is enabled by default as intended, this condition should be removed.

    return cfg.compilation_config.is_custom_op_enabled("rotary_embedding")

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@mergify

mergify Bot commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@mergify

mergify Bot commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify

mergify Bot commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Hi @Rohan138, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Comment thread vllm/config/compilation.py
Comment thread vllm/config/vllm.py
return (
rocm_aiter_ops.is_enabled()
and cfg.compilation_config.is_custom_op_enabled("rotary_embedding")
and cfg.compilation_config.use_inductor_graph_partition

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you guys using inductor graph partition on rocm by default? Otherwise we should also return true here I'd dynamo partition and kv cache op not in splitting ops

@Rohan138 Rohan138 Feb 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Somehow GH ate my original PR comment that explained this)

This PR is necessary but not sufficient to actually enable this fusion by default. We also need:

return true if dynamo partition and kv cache op not in splitting ops

https://github.com/vllm-project/vllm/blob/main/vllm/config/compilation.py#1001 is called in https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L961 after the defaults are set in https://github.com/vllm-project/vllm/blob/main/vllm/config/vllm.py#L807. So if inductor partition is not enabled, we would return true for this, then append kv cache to splitting ops, which would silently break the fusion.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links are broken but I know what you mean - but if splitting_ops=[] is passed kvcache won't be added so it should still work. So this check should be if inductor_partition or len(splitting_ops)==0

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@gshtras gshtras enabled auto-merge (squash) February 24, 2026 22:57
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 24, 2026
@gshtras gshtras merged commit f38f8c9 into vllm-project:main Feb 25, 2026
67 of 68 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Feb 25, 2026
@Rohan138 Rohan138 deleted the fused_aiter_triton_rope_kvcache branch February 27, 2026 21:19
@tjtanaa

tjtanaa commented Mar 5, 2026

Copy link
Copy Markdown
Member

@Rohan138 is the lm eval score from DeepSeek-R1? The GSM8K score should be at 0.95. 0.9 seems low.

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
jiangkuaixue123 pushed a commit to jiangkuaixue123/vllm that referenced this pull request Apr 28, 2026
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants