[sgl-kernel] add rotary embed kernel for trivial head_sizes by mickqian · Pull Request #6530 · sgl-project/sglang

mickqian · 2025-05-22T12:02:29Z

Motivation

Previously for Attention with head_size not in [64, 128, 256, 512] (which is common for Multimodal Attention), sgl will adopt rotary_embedding from vllm.

This pr copied and adapted the mentioned kernel with minor improvements.

Modifications

Add kernel for rotary-embedding with common head_sizes
Add according tests

Benchmark

test_rotary_embedding_benchmark[80-80-1000000.0-1000000.0-False-dtype1-cuda-1-4000-16-16]

实现	Min	Max	Mean	StdDev	Median	IQR	Outliers	OPS	Rounds	Iterations
native	274.0989 (1.00)	561.6210 (1.00)	278.5268 (1.00)	2.5004 (1.00)	278.5940 (1.00)	1.7020 (1.00)	892;200	3,590.3191 (1.00)	20000 (1.00)	1 (1.00)
original (vllm)	117.0728 (0.43)	195.9121 (0.35)	119.9629 (0.43)	1.1579 (0.46)	119.9138 (0.43)	0.8049 (0.47)	1595;562	8,335.9090 (2.32)	20000 (1.00)	1 (1.00)
sgl	29.7832 (0.11)	731.3141 (1.30)	32.2445 (0.12)	5.2406 (2.10)	33.1532 (0.12)	3.0030 (1.76)	25;21	31.0130 (0.01)	20000 (1.00)	1 (1.00)

test_rotary_embedding_benchmark[80-80-1000000.0-1000000.0-False-dtype0-cuda-1-8840-16-16]

实现	Min	Max	Mean	StdDev	Median	IQR	Outliers	OPS	Rounds	Iterations
native	556.5220 (1.00)	1,655.5311 (1.00)	560.8802 (1.00)	12.6544 (1.00)	560.6054 (1.00)	1.5302 (1.00)	19;177	1,782.9120 (1.00)	20000 (1.00)	1 (1.00)
original (vllm)	237.4239 (0.43)	1,746.1870 (1.05)	240.1866 (0.43)	16.5433 (1.31)	239.7769 (0.43)	1.1339 (0.74)	14;272	4,163.4303 (2.33)	20000 (1.00)	1 (1.00)
sgl	92.4878 (0.17)	874.3163 (0.53)	94.0537 (0.17)	5.6077 (0.44)	93.9691 (0.17)	0.5364 (0.35)	19;445	10.6322 (0.01)	20000 (1.00)	1 (1.00)

test_rotary_embedding_benchmark[80-80-1000000.0-1000000.0-True-dtype2-cuda-8-8840-16-16]

实现	Min	Max	Mean	StdDev	Median	IQR	Outliers	OPS	Rounds	Iterations
native	3,992.0881 (1.00)	4,380.6531 (1.00)	4,000.0806 (1.00)	4.0606 (1.00)	3,999.8861 (1.00)	2.7446 (1.00)	1060;247	249.9950 (1.00)	20000 (1.00)	1 (1.00)
original (vllm)	1,640.7282 (0.41)	3,510.4910 (0.80)	1,645.7016 (0.41)	16.5359 (4.07)	1,645.3189 (0.41)	1.7250 (0.63)	31;195	607.6436 (2.43)	20000 (1.00)	1 (1.00)
sgl	642.2000 (0.16)	1,061.5052 (0.24)	645.0266 (0.16)	5.0471 (1.24)	644.9120 (0.16)	0.9649 (0.35)	33;254	1.5503 (0.01)	20000 (1.00)	1 (1.00)

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

FlamingoPg · 2025-05-22T12:07:57Z

Others LGTM

Swipe4057 · 2025-06-25T06:39:07Z

I wanted to test your kernel but it seems there are conflicts with the current version main. Are there any plans to update or do I need to roll back?

JustinTong0323 · 2025-06-28T00:49:03Z

I wanted to test your kernel but it seems there are conflicts with the current version main. Are there any plans to update or do I need to roll back?

Updated.

…alone testing

yuan-luo · 2025-12-12T16:01:54Z

@mickqian May I know why this PR was closed?

mickqian · 2025-12-12T16:08:57Z

@mickqian May I know why this PR was closed?

Because I don't have enough time to refine this kernel

mickqian added 9 commits May 22, 2025 19:14

initial

9a823c5

add test

86ac8eb

test passed

2c7c71c

cherry-pick update-test

64e45e7

cleanups

4cf093d

update test

2b958b3

update test

226c210

update kernel with sram

68b6844

reduce dynamic cast

a783bbb

mickqian requested review from BBuf, ByronHsu, FlamingoPg, HaiShaw, HandH1998, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, yizhang2077, zhaochenyang20 and zhyncs as code owners May 22, 2025 12:02

remove python parts

de64889

mickqian changed the title ~~add rotary embed kernel for common head_sizes~~ [sgl-kernel] add rotary embed kernel for common head_sizes May 22, 2025

mickqian changed the title ~~[sgl-kernel] add rotary embed kernel for common head_sizes~~ [sgl-kernel] add rotary embed kernel for trivial head_sizes May 22, 2025

mickqian mentioned this pull request Jun 24, 2025

[Bug] Gemma3 throughput is 2x lower than vLLM #7471

Closed

5 tasks

Merge branch 'main' into rotary_emb_kernel

052cf93

AlienKevin mentioned this pull request Sep 16, 2025

Fast rotary embedding #10527

Closed

This was referenced Nov 10, 2025

[Roadmap] Diffusion (2025 Q4) #12799

Closed

diffusion: rotary embedding kernel #12985

Closed

RubiaCx added a commit to RubiaCx/sglang that referenced this pull request Nov 12, 2025

Import rotary embedding kernel from PR sgl-project#6530 and add stand…

1dda361

…alone testing

mickqian closed this Nov 13, 2025

RubiaCx mentioned this pull request Dec 2, 2025

diffusion: rotary embedding kernel #14302

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sgl-kernel] add rotary embed kernel for trivial head_sizes#6530

[sgl-kernel] add rotary embed kernel for trivial head_sizes#6530
mickqian wants to merge 11 commits intosgl-project:mainfrom
mickqian:rotary_emb_kernel

mickqian commented May 22, 2025 •

edited

Loading

Uh oh!

FlamingoPg commented May 22, 2025

Uh oh!

Swipe4057 commented Jun 25, 2025

Uh oh!

JustinTong0323 commented Jun 28, 2025

Uh oh!

yuan-luo commented Dec 12, 2025

Uh oh!

mickqian commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mickqian commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark

test_rotary_embedding_benchmark[80-80-1000000.0-1000000.0-False-dtype1-cuda-1-4000-16-16]

test_rotary_embedding_benchmark[80-80-1000000.0-1000000.0-False-dtype0-cuda-1-8840-16-16]

test_rotary_embedding_benchmark[80-80-1000000.0-1000000.0-True-dtype2-cuda-8-8840-16-16]

Checklist

Uh oh!

FlamingoPg commented May 22, 2025

Uh oh!

Swipe4057 commented Jun 25, 2025

Uh oh!

JustinTong0323 commented Jun 28, 2025

Uh oh!

yuan-luo commented Dec 12, 2025

Uh oh!

mickqian commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mickqian commented May 22, 2025 •

edited

Loading