[Feature] Fuse mrope all in 1 kernel#14906
Conversation
Summary of ChangesHello @DarkSharpness, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant optimization to the multimodal Rotary Positional Embedding (mRoPE) implementation by consolidating several operations into a single, fused Triton kernel. This change aims to improve the overall efficiency and speed of the model's inference process, particularly in terms of decode throughput, by reducing computational overhead and enhancing data locality within the GPU. The modifications simplify the underlying API and integrate seamlessly into the existing Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
a761d94 to
a03a37a
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors the multimodal rotary positional embedding (mrope) operation by fusing it into a single Triton kernel. This change simplifies the Python code by offloading the complex logic to the kernel and, as shown in the benchmarks, improves performance. The changes are well-implemented. I have a couple of minor suggestions to further improve code quality and readability.
|
Could you please paste the test_mrope.py result? |
|
all passed @yuan-luo |
Awesome. Thanks. |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
67ea46d to
a0836b7
Compare
|
@DarkSharpness Can you please take a look at this failure |
|
fixed by bypassing shape check for pass local test on H200: python3 -m sglang.launch_server --model-path "Qwen/Qwen2.5-VL-7B-Instruct" \
--enable-piecewise-cuda-graph --piecewise-cuda-graph-compiler eager --disable-radix-cache --load-format dummy |
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py


Motivation
Same as #13199.
Modifications
Further fuse the all
mropeop into 1. Not 100% sure we have the best performance, but should be better than original implementation.Accuracy Tests
Benchmarking and Profiling
Before:
Decode throughput: 177 token/s

After:
Decode throughput: 189 token/s

Checklist