Skip to content

[vLLM IR][Rope] Port RotaryEmbedding and DeepseekScalingRotaryEmbedding to IR Ops#39488

Closed
wxsIcey wants to merge 15 commits into
vllm-project:luka/vllm-ir/rms-norm-inplacefrom
wxsIcey:wxs/vllm-ir/rotary-embedding
Closed

[vLLM IR][Rope] Port RotaryEmbedding and DeepseekScalingRotaryEmbedding to IR Ops#39488
wxsIcey wants to merge 15 commits into
vllm-project:luka/vllm-ir/rms-norm-inplacefrom
wxsIcey:wxs/vllm-ir/rotary-embedding

Conversation

@wxsIcey

@wxsIcey wxsIcey commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

Port RotaryEmbedding and DeepseekScalingRotaryEmbedding to vLLM IR. For the base RotaryEmbedding, keep the key=None path out of the IR maybe_inplace call and fall back to the existing static implementation.

This preserves cross-layer KV sharing behavior and avoids torch.library schema inference failures on optional returns.

Test Plan

To be tested

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added deepseek Related to DeepSeek models nvidia rocm Related to AMD ROCm cpu Related to CPU backends labels Apr 10, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 10, 2026
@wxsIcey

wxsIcey commented Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Some test files need to be modified.

sin,
is_neox_style,
)
elif cos_sin_format == "deepseek":

@wxsIcey wxsIcey Apr 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if registering with the same IR ops is the best approach.

cos_sin_cache = self._match_cos_sin_cache_dtype(query)
return self.forward_static(
if key is None:
return self.forward_static(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch custom op registration does not support None (i.e., Optional[Tensor]) in return types. So when key is None, I fall it to static implement.

(EngineCore_DP0 pid=115375)   File "/home/wangxiaoshuang/vllm/.venv/lib/python3.11/site-packages/torch/_library/infer_schema.py", line 71, in error_fn
(EngineCore_DP0 pid=115375)     raise ValueError(f"infer_schema(func): {what} Got func with signature {sig})")
(EngineCore_DP0 pid=115375) ValueError: infer_schema(func): Return has unsupported type tuple[torch.Tensor, torch.Tensor | None]. The valid types are: {<class 'torch.Tensor'>: 'Tensor', typing.List[torch.Tensor]: 'Tensor[]', list[torch.Tensor]: 'Tensor[]', <class 'int'>: 'SymInt', <class 'float'>: 'float', <class 'bool'>: 'bool', int | float | bool: 'Scalar'}. Got func with signature (positions: torch.Tensor, query: torch.Tensor, key: torch.Tensor | None, head_size: int, cos_sin_cache: torch.Tensor, is_neox_style: bool) -> tuple[torch.Tensor, torch.Tensor | None])
[rank0]:[W318 06:36:34.959129788 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a unified rotary_embedding IR operation to replace various ad-hoc implementations across the codebase. It refactors fusion passes and platform-specific kernels (AITER, VLLM C, XPU) to register and use this new operation, while also updating platform configurations to manage operation priorities. Feedback was provided regarding missing validation checks in the kernel implementations for XPU, AITER, and VLLM C, specifically concerning the handling of offsets and partial rotations where rotary_dim might differ from head_size.

Comment thread vllm/kernels/xpu_ops.py
Comment on lines +57 to +66
if cos_sin_format == "standard":
torch.ops._C.rotary_embedding(
positions,
query,
key,
head_size,
cos_sin_cache,
is_neox_style,
)
return query, key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The standard format implementation for XPU kernels does not check if offsets is None. If offsets is provided with the standard format, it will be silently ignored by the underlying torch.ops._C.rotary_embedding call, leading to incorrect results. Please add a check for offsets is None in the supports_args or within the implementation.

Suggested change
if cos_sin_format == "standard":
torch.ops._C.rotary_embedding(
positions,
query,
key,
head_size,
cos_sin_cache,
is_neox_style,
)
return query, key
if cos_sin_format == "standard":
assert offsets is None, "Standard RoPE with offsets is not supported on XPU"
torch.ops._C.rotary_embedding(
positions,
query,
key,
head_size,
cos_sin_cache,
is_neox_style,
)
return query, key

Comment thread vllm/kernels/aiter_ops.py
Comment on lines +83 to +94
def rotary_no_offsets_16bit_only(
positions: Tensor,
query: Tensor,
key: Tensor,
head_size: int,
rotary_dim: int,
cos_sin_cache: Tensor,
is_neox_style: bool,
offsets: Tensor | None = None,
cos_sin_format: str = "standard",
) -> bool:
return offsets is None and cos_sin_format == "standard"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The rotary_no_offsets_16bit_only check does not verify that rotary_dim == head_size. The rocm_aiter_triton_rotary_embedding kernel is called with head_size but without rotary_dim, which suggests it might not support partial rotation (where rotary_dim < head_size). If this is the case, the implementation should only be selected when rotary_dim == head_size to avoid incorrect results for models with partial RoPE.

Suggested change
def rotary_no_offsets_16bit_only(
positions: Tensor,
query: Tensor,
key: Tensor,
head_size: int,
rotary_dim: int,
cos_sin_cache: Tensor,
is_neox_style: bool,
offsets: Tensor | None = None,
cos_sin_format: str = "standard",
) -> bool:
return offsets is None and cos_sin_format == "standard"
def rotary_no_offsets_16bit_only(
positions: Tensor,
query: Tensor,
key: Tensor,
head_size: int,
rotary_dim: int,
cos_sin_cache: Tensor,
is_neox_style: bool,
offsets: Tensor | None = None,
cos_sin_format: str = "standard",
) -> bool:
return offsets is None and cos_sin_format == "standard" and rotary_dim == head_size

Comment thread vllm/kernels/vllm_c.py
Comment on lines +64 to +75
def rotary_no_offsets(
positions: Tensor,
query: Tensor,
key: Tensor,
head_size: int,
rotary_dim: int,
cos_sin_cache: Tensor,
is_neox_style: bool,
offsets: Tensor | None = None,
cos_sin_format: str = "standard",
) -> bool:
return offsets is None and cos_sin_format == "standard"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The rotary_no_offsets check should verify that rotary_dim == head_size if the underlying torch.ops._C.rotary_embedding kernel does not support partial rotation. Since the kernel is called with head_size as the 4th argument and no rotary_dim is provided, it is likely to rotate the entire head. This would be incorrect for models where rotary_dim < head_size.

Suggested change
def rotary_no_offsets(
positions: Tensor,
query: Tensor,
key: Tensor,
head_size: int,
rotary_dim: int,
cos_sin_cache: Tensor,
is_neox_style: bool,
offsets: Tensor | None = None,
cos_sin_format: str = "standard",
) -> bool:
return offsets is None and cos_sin_format == "standard"
def rotary_no_offsets(
positions: Tensor,
query: Tensor,
key: Tensor,
head_size: int,
rotary_dim: int,
cos_sin_cache: Tensor,
is_neox_style: bool,
offsets: Tensor | None = None,
cos_sin_format: str = "standard",
) -> bool:
return offsets is None and cos_sin_format == "standard" and rotary_dim == head_size

@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 8b5e95e to 6b2a07f Compare April 18, 2026 01:53
@mergify mergify Bot added the intel-gpu Related to Intel GPU label Apr 18, 2026
@mergify

mergify Bot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--39488.org.readthedocs.build/en/39488/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Apr 18, 2026
@mergify

mergify Bot commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wxsIcey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch from 7194c47 to 978ffe0 Compare April 27, 2026 16:01
ProExpertProg and others added 3 commits April 28, 2026 01:44
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
ProExpertProg and others added 4 commits April 28, 2026 01:58
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…ng to vLLM IR Ops

Signed-off-by: Icey <1790571317@qq.com>
@wxsIcey wxsIcey force-pushed the wxs/vllm-ir/rotary-embedding branch from e7aa987 to 1553195 Compare April 28, 2026 02:22
@mergify mergify Bot removed the needs-rebase label Apr 28, 2026
Signed-off-by: Icey <1790571317@qq.com>
@ProExpertProg ProExpertProg force-pushed the luka/vllm-ir/rms-norm-inplace branch 2 times, most recently from cbdcc4f to e71bea3 Compare April 29, 2026 18:41
@ProExpertProg ProExpertProg deleted the branch vllm-project:luka/vllm-ir/rms-norm-inplace May 2, 2026 03:41
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 2, 2026
@github-project-automation github-project-automation Bot moved this to Done in NVIDIA May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation intel-gpu Related to Intel GPU nvidia rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants