[vLLM IR][Rope] Port RotaryEmbedding and DeepseekScalingRotaryEmbedding to IR Ops#39488
[vLLM IR][Rope] Port RotaryEmbedding and DeepseekScalingRotaryEmbedding to IR Ops#39488wxsIcey wants to merge 15 commits into
Conversation
|
Some test files need to be modified. |
| sin, | ||
| is_neox_style, | ||
| ) | ||
| elif cos_sin_format == "deepseek": |
There was a problem hiding this comment.
I'm unsure if registering with the same IR ops is the best approach.
| cos_sin_cache = self._match_cos_sin_cache_dtype(query) | ||
| return self.forward_static( | ||
| if key is None: | ||
| return self.forward_static( |
There was a problem hiding this comment.
PyTorch custom op registration does not support None (i.e., Optional[Tensor]) in return types. So when key is None, I fall it to static implement.
(EngineCore_DP0 pid=115375) File "/home/wangxiaoshuang/vllm/.venv/lib/python3.11/site-packages/torch/_library/infer_schema.py", line 71, in error_fn
(EngineCore_DP0 pid=115375) raise ValueError(f"infer_schema(func): {what} Got func with signature {sig})")
(EngineCore_DP0 pid=115375) ValueError: infer_schema(func): Return has unsupported type tuple[torch.Tensor, torch.Tensor | None]. The valid types are: {<class 'torch.Tensor'>: 'Tensor', typing.List[torch.Tensor]: 'Tensor[]', list[torch.Tensor]: 'Tensor[]', <class 'int'>: 'SymInt', <class 'float'>: 'float', <class 'bool'>: 'bool', int | float | bool: 'Scalar'}. Got func with signature (positions: torch.Tensor, query: torch.Tensor, key: torch.Tensor | None, head_size: int, cos_sin_cache: torch.Tensor, is_neox_style: bool) -> tuple[torch.Tensor, torch.Tensor | None])
[rank0]:[W318 06:36:34.959129788 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
There was a problem hiding this comment.
Code Review
This pull request introduces a unified rotary_embedding IR operation to replace various ad-hoc implementations across the codebase. It refactors fusion passes and platform-specific kernels (AITER, VLLM C, XPU) to register and use this new operation, while also updating platform configurations to manage operation priorities. Feedback was provided regarding missing validation checks in the kernel implementations for XPU, AITER, and VLLM C, specifically concerning the handling of offsets and partial rotations where rotary_dim might differ from head_size.
| if cos_sin_format == "standard": | ||
| torch.ops._C.rotary_embedding( | ||
| positions, | ||
| query, | ||
| key, | ||
| head_size, | ||
| cos_sin_cache, | ||
| is_neox_style, | ||
| ) | ||
| return query, key |
There was a problem hiding this comment.
The standard format implementation for XPU kernels does not check if offsets is None. If offsets is provided with the standard format, it will be silently ignored by the underlying torch.ops._C.rotary_embedding call, leading to incorrect results. Please add a check for offsets is None in the supports_args or within the implementation.
| if cos_sin_format == "standard": | |
| torch.ops._C.rotary_embedding( | |
| positions, | |
| query, | |
| key, | |
| head_size, | |
| cos_sin_cache, | |
| is_neox_style, | |
| ) | |
| return query, key | |
| if cos_sin_format == "standard": | |
| assert offsets is None, "Standard RoPE with offsets is not supported on XPU" | |
| torch.ops._C.rotary_embedding( | |
| positions, | |
| query, | |
| key, | |
| head_size, | |
| cos_sin_cache, | |
| is_neox_style, | |
| ) | |
| return query, key |
| def rotary_no_offsets_16bit_only( | ||
| positions: Tensor, | ||
| query: Tensor, | ||
| key: Tensor, | ||
| head_size: int, | ||
| rotary_dim: int, | ||
| cos_sin_cache: Tensor, | ||
| is_neox_style: bool, | ||
| offsets: Tensor | None = None, | ||
| cos_sin_format: str = "standard", | ||
| ) -> bool: | ||
| return offsets is None and cos_sin_format == "standard" |
There was a problem hiding this comment.
The rotary_no_offsets_16bit_only check does not verify that rotary_dim == head_size. The rocm_aiter_triton_rotary_embedding kernel is called with head_size but without rotary_dim, which suggests it might not support partial rotation (where rotary_dim < head_size). If this is the case, the implementation should only be selected when rotary_dim == head_size to avoid incorrect results for models with partial RoPE.
| def rotary_no_offsets_16bit_only( | |
| positions: Tensor, | |
| query: Tensor, | |
| key: Tensor, | |
| head_size: int, | |
| rotary_dim: int, | |
| cos_sin_cache: Tensor, | |
| is_neox_style: bool, | |
| offsets: Tensor | None = None, | |
| cos_sin_format: str = "standard", | |
| ) -> bool: | |
| return offsets is None and cos_sin_format == "standard" | |
| def rotary_no_offsets_16bit_only( | |
| positions: Tensor, | |
| query: Tensor, | |
| key: Tensor, | |
| head_size: int, | |
| rotary_dim: int, | |
| cos_sin_cache: Tensor, | |
| is_neox_style: bool, | |
| offsets: Tensor | None = None, | |
| cos_sin_format: str = "standard", | |
| ) -> bool: | |
| return offsets is None and cos_sin_format == "standard" and rotary_dim == head_size |
| def rotary_no_offsets( | ||
| positions: Tensor, | ||
| query: Tensor, | ||
| key: Tensor, | ||
| head_size: int, | ||
| rotary_dim: int, | ||
| cos_sin_cache: Tensor, | ||
| is_neox_style: bool, | ||
| offsets: Tensor | None = None, | ||
| cos_sin_format: str = "standard", | ||
| ) -> bool: | ||
| return offsets is None and cos_sin_format == "standard" |
There was a problem hiding this comment.
The rotary_no_offsets check should verify that rotary_dim == head_size if the underlying torch.ops._C.rotary_embedding kernel does not support partial rotation. Since the kernel is called with head_size as the 4th argument and no rotary_dim is provided, it is likely to rotate the entire head. This would be incorrect for models where rotary_dim < head_size.
| def rotary_no_offsets( | |
| positions: Tensor, | |
| query: Tensor, | |
| key: Tensor, | |
| head_size: int, | |
| rotary_dim: int, | |
| cos_sin_cache: Tensor, | |
| is_neox_style: bool, | |
| offsets: Tensor | None = None, | |
| cos_sin_format: str = "standard", | |
| ) -> bool: | |
| return offsets is None and cos_sin_format == "standard" | |
| def rotary_no_offsets( | |
| positions: Tensor, | |
| query: Tensor, | |
| key: Tensor, | |
| head_size: int, | |
| rotary_dim: int, | |
| cos_sin_cache: Tensor, | |
| is_neox_style: bool, | |
| offsets: Tensor | None = None, | |
| cos_sin_format: str = "standard", | |
| ) -> bool: | |
| return offsets is None and cos_sin_format == "standard" and rotary_dim == head_size |
8b5e95e to
6b2a07f
Compare
|
Documentation preview: https://vllm--39488.org.readthedocs.build/en/39488/ |
|
This pull request has merge conflicts that must be resolved before it can be |
6b2a07f to
7194c47
Compare
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
7194c47 to
978ffe0
Compare
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
…ng to vLLM IR Ops Signed-off-by: Icey <1790571317@qq.com>
e7aa987 to
1553195
Compare
Signed-off-by: Icey <1790571317@qq.com>
cbdcc4f to
e71bea3
Compare
Purpose
Port
RotaryEmbeddingandDeepseekScalingRotaryEmbeddingto vLLM IR. For the baseRotaryEmbedding, keep thekey=Nonepath out of the IRmaybe_inplacecall and fall back to the existing static implementation.This preserves cross-layer KV sharing behavior and avoids
torch.libraryschema inference failures on optional returns.Test Plan
To be tested
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.