Skip to content

[VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel#18856

Merged
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:optimize_ernie_vl_rotary_embedding
Feb 16, 2026
Merged

[VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel#18856
BBuf merged 1 commit intosgl-project:mainfrom
antgroup:optimize_ernie_vl_rotary_embedding

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Feb 15, 2026

Motivation

Current Ernie4.5VL MRoPE occupies a major portion in inference time. It has many small ops which introduces quite a lot of GPU bubbles.

python3 -m sglang.launch_server --model-path baidu/ERNIE-4.5-VL-28B-A3B-PT \
 --served-model-name ERNIE-45-VL-28B \ 
 --port 30000 \ 
 --trust-remote-code
image image

This PR is to introduce a fused Triton kernel for Ernie4.5 VL MRoPE (THW) rotary embedding. It applies rotary positional embeddings to both Q and K in-place for Ernie4.5 VL’s 3D MRoPE layout. The kernel fuses two previously separate steps:

One of the tricky parts of this enhancement is Ernie adopts specific frequency reordering ([h, w, h, w, ..., t, t, t]) from THW positions, which was previously implemented via multiple PyTorch ops (index_select/chunk/stack/reshape/cat) and materialized intermediate tensors. So we can't leverage the existing fused rotary embedding triton kernel for the traditional [t, h, w, t, h, w, ... ] layout.

Instead of constructing (cos, sin) tensors on the Python side, the kernel selects the appropriate position (h vs w interleaved for the spatial section, and t for the temporal tail) per rotary pair index and directly gathers cos/sin from the cache. This eliminates intermediate allocations, reduces kernel launches, and improves memory locality by keeping the entire operation within a single Triton launch.

Benchmarking and Profiling

In our profiling, the fused kernel reduces the Ernie4.5 VL rotary embedding time from 670µs to 123µs, RotaryEmbedding kernel wise 5.4x speedup. (≈17% end-to-end improvement for the measured workload)

Before PR:
image

After PR:
image
zoom in:
image

Without PR:
root@c7e9bb6a6789:/sgl-workspace/bench_script# bash bench_n_image.sh
{"id":"804d220f0f3a43feb9de208b67289ed9","object":"chat.completion","created":1771144593,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"图中植物是刺芹,属于伞形科刺芹属的多年生草本植物。其茎直立,有刺,叶片羽状分裂,边缘有尖锐的刺齿。刺芹常见于野外,具有一定的观赏价值,同时它也是一些动物的食物来源。\n刺芹含有挥发油等化学成分,具有一定的香气。需要注意的是,虽然刺芹本身不是传统意义上的有毒植物,但其茎叶上的刺可能对皮肤造成机械性刺激或损伤,接触后可能会引起不适。此外,对于某些特定人群(如过敏体质者)来说,接触或食用刺芹可能会引发过敏反应。\n在野外遇到刺芹时,建议不要随意采摘或食用,以免发生意外。如果需要了解某种植物是否可食用或具有药用价值,最好咨询专业的植物学家或相关领域的专家。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":2}],"usage":{"prompt_tokens":973,"total_tokens":1143,"completion_tokens":170,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real 0m1.317s
user 0m0.002s
sys 0m0.004s

With PR:
root@c7e9bb6a6789:/sgl-workspace/bench_script# bash bench_n_image.sh
{"id":"5c2afe89093346e48bfc153baeafbf56","object":"chat.completion","created":1771145589,"model":"auto","choices":[{"index":0,"message":{"role":"assistant","content":"图中植物是刺芹,属于伞形科刺芹属的多年生草本植物。其茎直立,有刺,叶片羽状分裂,边缘有尖锐的刺齿。刺芹常见于野外,具有一定的观赏价值,同时它也是一些动物的食物来源。\n刺芹含有挥发油等化学成分,具有一定的香气。需要注意的是,虽然刺芹本身不是传统意义上的有毒植物,但其茎叶上的刺可能会对皮肤造成机械性损伤,接触后可能引起不适。此外,对于某些特定人群(如过敏体质者)来说,接触或误食刺芹可能会引发过敏反应或其他不良反应。因此,在野外遇到刺芹时,应避免随意触摸或采摘食用。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":2}],"usage":{"prompt_tokens":973,"total_tokens":1119,"completion_tokens":146,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real 0m1.085s
user 0m0.003s
sys 0m0.002s

Modifications

Accuracy Tests

Main:

root@c7e9bb6a6789:/sgl-workspace/sglang# python3 -m lmms_eval --model openai_compatible   --model_args model_version=baidu/ERNIE-4.5-VL-28B-A3B-PT   --tasks mmmu_val   --batch_size 16
2026-02-15 10:53:36 | INFO     | __main__:cli_evaluate:311 - Verbosity set to INFO
2026-02-15 10:53:38 | INFO     | __main__:cli_evaluate_single:400 - Evaluation tracker args: {'token': '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'}
2026-02-15 10:53:38 | INFO     | __main__:cli_evaluate_single:480 - Selected Tasks: ['mmmu_val']
2026-02-15 10:53:38 | INFO     | lmms_eval.evaluator:simple_evaluate:162 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2026-02-15 10:53:41 | INFO     | lmms_eval.evaluator:evaluate:403 - Running on rank 0 (local rank 0)
2026-02-15 10:53:41 | INFO     | lmms_eval.api.task:build_all_requests:431 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14077.68it/s]
2026-02-15 10:53:41 | INFO     | lmms_eval.evaluator:evaluate:496 - Running generate_until requests
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [11:47<00:00,  9.84s/it]2026-02-15 11:05:29 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:48 - Metric summary - Total time: 8564.083s, Total tokens: 112885, Avg speed: 13.2 tokens/s
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [11:47<00:00, 12.42s/it]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 9288.82it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.26667}, 'Art': {'num': 30, 'acc': 0.23333}, 'Art_Theory': {'num': 30, 'acc': 0.23333}, 'Design': {'num': 30, 'acc': 0.36667}, 'Music': {'num': 30, 'acc': 0.23333}, 'Overall-Business': {'num': 150, 'acc': 0.22667}, 'Accounting': {'num': 30, 'acc': 0.16667}, 'Economics': {'num': 30, 'acc': 0.26667}, 'Finance': {'num': 30, 'acc': 0.23333}, 'Manage': {'num': 30, 'acc': 0.2}, 'Marketing': {'num': 30, 'acc': 0.26667}, 'Overall-Science': {'num': 150, 'acc': 0.27333}, 'Biology': {'num': 30, 'acc': 0.33333}, 'Chemistry': {'num': 30, 'acc': 0.13333}, 'Geography': {'num': 30, 'acc': 0.26667}, 'Math': {'num': 30, 'acc': 0.36667}, 'Physics': {'num': 30, 'acc': 0.26667}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.26667}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.3}, 'Clinical_Medicine': {'num': 30, 'acc': 0.23333}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.26667}, 'Pharmacy': {'num': 30, 'acc': 0.23333}, 'Public_Health': {'num': 30, 'acc': 0.3}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.275}, 'History': {'num': 30, 'acc': 0.5}, 'Literature': {'num': 30, 'acc': 0.4}, 'Sociology': {'num': 30, 'acc': 0.13333}, 'Psychology': {'num': 30, 'acc': 0.06667}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.27619}, 'Agriculture': {'num': 30, 'acc': 0.33333}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.2}, 'Computer_Science': {'num': 30, 'acc': 0.3}, 'Electronics': {'num': 30, 'acc': 0.23333}, 'Energy_and_Power': {'num': 30, 'acc': 0.3}, 'Materials': {'num': 30, 'acc': 0.23333}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.33333}, 'Overall': {'num': 900, 'acc': 0.26444}}
2026-02-15 11:05:30 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:239 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=baidu/ERNIE-4.5-VL-28B-A3B-PT), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|Stderr_CLT|Stderr_Clustered|
|--------|------:|------|-----:|--------|---|-----:|---|------|----------|----------------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.2644|±  |N/A   |N/A       |N/A             |

PR:

root@c7e9bb6a6789:/sgl-workspace/bench_script# python3 -m lmms_eval --model openai_compatible   --model_args model_version=baidu/ERNIE-4.5-VL-28B-A3B-PT   --tasks mmmu_val   --batch_size 16
2026-02-16 02:20:12 | INFO     | __main__:cli_evaluate:311 - Verbosity set to INFO
2026-02-16 02:20:14 | INFO     | __main__:cli_evaluate_single:400 - Evaluation tracker args: {'token': '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'}
2026-02-16 02:20:14 | INFO     | __main__:cli_evaluate_single:480 - Selected Tasks: ['mmmu_val']
2026-02-16 02:20:14 | INFO     | lmms_eval.evaluator:simple_evaluate:162 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2026-02-16 02:20:17 | INFO     | lmms_eval.evaluator:evaluate:403 - Running on rank 0 (local rank 0)
2026-02-16 02:20:17 | INFO     | lmms_eval.api.task:build_all_requests:431 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 13926.80it/s]
2026-02-16 02:20:17 | INFO     | lmms_eval.evaluator:evaluate:496 - Running generate_until requests
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [12:38<00:00, 10.69s/it]2026-02-16 02:32:56 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:48 - Metric summary - Total time: 10344.727s, Total tokens: 112510, Avg speed: 10.9 tokens/s
Model Responding: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [12:38<00:00, 13.31s/it]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 9167.37it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.25833}, 'Art': {'num': 30, 'acc': 0.16667}, 'Art_Theory': {'num': 30, 'acc': 0.26667}, 'Design': {'num': 30, 'acc': 0.43333}, 'Music': {'num': 30, 'acc': 0.16667}, 'Overall-Business': {'num': 150, 'acc': 0.18667}, 'Accounting': {'num': 30, 'acc': 0.13333}, 'Economics': {'num': 30, 'acc': 0.16667}, 'Finance': {'num': 30, 'acc': 0.16667}, 'Manage': {'num': 30, 'acc': 0.23333}, 'Marketing': {'num': 30, 'acc': 0.23333}, 'Overall-Science': {'num': 150, 'acc': 0.29333}, 'Biology': {'num': 30, 'acc': 0.3}, 'Chemistry': {'num': 30, 'acc': 0.3}, 'Geography': {'num': 30, 'acc': 0.26667}, 'Math': {'num': 30, 'acc': 0.26667}, 'Physics': {'num': 30, 'acc': 0.33333}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.32667}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.4}, 'Clinical_Medicine': {'num': 30, 'acc': 0.3}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.26667}, 'Pharmacy': {'num': 30, 'acc': 0.26667}, 'Public_Health': {'num': 30, 'acc': 0.4}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.25}, 'History': {'num': 30, 'acc': 0.33333}, 'Literature': {'num': 30, 'acc': 0.43333}, 'Sociology': {'num': 30, 'acc': 0.13333}, 'Psychology': {'num': 30, 'acc': 0.1}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.25714}, 'Agriculture': {'num': 30, 'acc': 0.2}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.16667}, 'Computer_Science': {'num': 30, 'acc': 0.26667}, 'Electronics': {'num': 30, 'acc': 0.1}, 'Energy_and_Power': {'num': 30, 'acc': 0.6}, 'Materials': {'num': 30, 'acc': 0.2}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.26667}, 'Overall': {'num': 900, 'acc': 0.26222}}
fatal: not a git repository (or any of the parent directories): .git
2026-02-16 02:32:57 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:239 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=baidu/ERNIE-4.5-VL-28B-A3B-PT), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|Stderr_CLT|Stderr_Clustered|
|--------|------:|------|-----:|--------|---|-----:|---|------|----------|----------------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.2622|±  |N/A   |N/A       |N/A             |

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a substantial performance enhancement for the Ernie4.5-VL model by optimizing its rotary embedding mechanism. It achieves this by replacing a series of fragmented PyTorch operations with a highly efficient, fused Triton kernel. This integration streamlines the computation of rotary positional embeddings for both query and key tensors, leading to faster inference and more efficient GPU utilization.

Highlights

  • Performance Optimization: Introduced a fused Triton kernel for Ernie4.5 VL MRoPE, significantly reducing inference time from ~670µs to ~123µs, leading to an overall 17% end-to-end speedup.
  • Kernel Fusion: Fused Ernie-specific frequency reordering and rotary rotation into a single Triton kernel, eliminating multiple PyTorch operations and intermediate tensor materializations.
  • Memory Efficiency: Improved memory locality and reduced kernel launches by directly gathering cos/sin values from cache within the Triton kernel, avoiding the construction of separate cos/sin tensors on the Python side.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/layers/rotary_embedding.py
    • Added _triton_ernie45_rope_qk_fused Triton kernel for in-place rotary embedding application.
    • Introduced triton_ernie45_rope_fused_inplace Python function to orchestrate the Triton kernel launch.
    • Updated Ernie4_5_VLRotaryEmbedding.__init__ to store additional parameters and compile _apply_rotary_emb.
    • Modified Ernie4_5_VLRotaryEmbedding.forward_native to utilize the newly compiled _apply_rotary_emb_wrapped.
    • Enhanced Ernie4_5_VLRotaryEmbedding.forward_cuda to dispatch to the new fused Triton kernel for 2D positions, while retaining existing CUDA or native fallbacks for 1D positions.
    • Changed Ernie4_5_VLRotaryEmbedding.forward to directly call forward_cuda.
Activity
  • No specific activity (comments, reviews, progress updates) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fused Triton kernel to optimize the Ernie4.5-VL rotary embedding, which significantly improves performance by reducing kernel launches and intermediate tensor allocations. The changes are well-motivated and the performance gains are substantial. I have a few suggestions to improve code clarity and maintainability by removing redundant code and correcting type hints.

Comment thread python/sglang/srt/layers/rotary_embedding.py
Comment thread python/sglang/srt/layers/rotary_embedding.py
Comment thread python/sglang/srt/layers/rotary_embedding.py
@yuan-luo yuan-luo changed the title [VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel [WIP][VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel Feb 15, 2026
@yuan-luo yuan-luo changed the title [WIP][VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel [VLM] Optimize Ernie4.5-VL rotary embedding with fused triton kernel Feb 15, 2026
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

2 similar comments
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Wait for ci.

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 15, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 16, 2026

@BBuf BBuf merged commit 8a82c70 into sgl-project:main Feb 16, 2026
526 of 566 checks passed
@yuan-luo yuan-luo deleted the optimize_ernie_vl_rotary_embedding branch February 18, 2026 14:04
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants