[VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel#18856
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a substantial performance enhancement for the Ernie4.5-VL model by optimizing its rotary embedding mechanism. It achieves this by replacing a series of fragmented PyTorch operations with a highly efficient, fused Triton kernel. This integration streamlines the computation of rotary positional embeddings for both query and key tensors, leading to faster inference and more efficient GPU utilization. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request introduces a fused Triton kernel to optimize the Ernie4.5-VL rotary embedding, which significantly improves performance by reducing kernel launches and intermediate tensor allocations. The changes are well-motivated and the performance gains are substantial. I have a few suggestions to improve code clarity and maintainability by removing redundant code and correcting type hints.
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
…gl-project#18856) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
…gl-project#18856) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
Current Ernie4.5VL MRoPE occupies a major portion in inference time. It has many small ops which introduces quite a lot of GPU bubbles.
This PR is to introduce a fused Triton kernel for Ernie4.5 VL MRoPE (THW) rotary embedding. It applies rotary positional embeddings to both Q and K in-place for Ernie4.5 VL’s 3D MRoPE layout. The kernel fuses two previously separate steps:
One of the tricky parts of this enhancement is Ernie adopts specific frequency reordering ([h, w, h, w, ..., t, t, t]) from THW positions, which was previously implemented via multiple PyTorch ops (index_select/chunk/stack/reshape/cat) and materialized intermediate tensors. So we can't leverage the existing fused rotary embedding triton kernel for the traditional [t, h, w, t, h, w, ... ] layout.
Instead of constructing (cos, sin) tensors on the Python side, the kernel selects the appropriate position (h vs w interleaved for the spatial section, and t for the temporal tail) per rotary pair index and directly gathers cos/sin from the cache. This eliminates intermediate allocations, reduces kernel launches, and improves memory locality by keeping the entire operation within a single Triton launch.
Benchmarking and Profiling
In our profiling, the fused kernel reduces the Ernie4.5 VL rotary embedding time from 670µs to 123µs, RotaryEmbedding kernel wise 5.4x speedup. (≈17% end-to-end improvement for the measured workload)
Before PR:

After PR:


zoom in:
Without PR:
root@c7e9bb6a6789:/sgl-workspace/bench_script# bash bench_n_image.sh
{"id":"804d220f0f3a43feb9de208b67289ed9","object":"chat.completion","created":1771144593,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"图中植物是刺芹,属于伞形科刺芹属的多年生草本植物。其茎直立,有刺,叶片羽状分裂,边缘有尖锐的刺齿。刺芹常见于野外,具有一定的观赏价值,同时它也是一些动物的食物来源。\n刺芹含有挥发油等化学成分,具有一定的香气。需要注意的是,虽然刺芹本身不是传统意义上的有毒植物,但其茎叶上的刺可能对皮肤造成机械性刺激或损伤,接触后可能会引起不适。此外,对于某些特定人群(如过敏体质者)来说,接触或食用刺芹可能会引发过敏反应。\n在野外遇到刺芹时,建议不要随意采摘或食用,以免发生意外。如果需要了解某种植物是否可食用或具有药用价值,最好咨询专业的植物学家或相关领域的专家。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":2}],"usage":{"prompt_tokens":973,"total_tokens":1143,"completion_tokens":170,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real 0m1.317s
user 0m0.002s
sys 0m0.004s
With PR:
root@c7e9bb6a6789:/sgl-workspace/bench_script# bash bench_n_image.sh
{"id":"5c2afe89093346e48bfc153baeafbf56","object":"chat.completion","created":1771145589,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"图中植物是刺芹,属于伞形科刺芹属的多年生草本植物。其茎直立,有刺,叶片羽状分裂,边缘有尖锐的刺齿。刺芹常见于野外,具有一定的观赏价值,同时它也是一些动物的食物来源。\n刺芹含有挥发油等化学成分,具有一定的香气。需要注意的是,虽然刺芹本身不是传统意义上的有毒植物,但其茎叶上的刺可能会对皮肤造成机械性损伤,接触后可能引起不适。此外,对于某些特定人群(如过敏体质者)来说,接触或误食刺芹可能会引发过敏反应或其他不良反应。因此,在野外遇到刺芹时,应避免随意触摸或采摘食用。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":2}],"usage":{"prompt_tokens":973,"total_tokens":1119,"completion_tokens":146,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real 0m1.085s
user 0m0.003s
sys 0m0.002s
Modifications
Accuracy Tests
Main:
PR:
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci