Opt jit qknorm_across_heads cuda kernel by BBuf · Pull Request #21503 · sgl-project/sglang

BBuf · 2026-03-27T02:42:20Z

Motivation

Follow #18073

The old kernel handled both q and k inside one CTA, which kept too much
state live at the same time:

q
k
q_weight
k_weight
separate output vectors
dual reduction buffers in shared memory

The new kernel still performs the work in a single launch, but splits the work
with grid.y = 2:

blockIdx.y == 0: normalize q
blockIdx.y == 1: normalize k

This reduces per-thread live state and shrinks the shared reduction buffer from
two lanes to one lane.

On H200 with shape (batch_size=2048, hidden_dim=8192):

registers/thread: 48 -> 26
static shared memory/block: 256 B -> 128 B
theoretical occupancy: 50% -> 100%
achieved occupancy: 45.25% -> 88.17%
achieved active warps/SM: 28.96 -> 56.43

Microbench results

H200, bf16:

Shape	Baseline	Optimized	Speedup
`(256, 1024)`	`0.0207 ms`	`0.0194 ms`	`1.0711x`
`(1024, 4096)`	`0.0688 ms`	`0.0598 ms`	`1.1506x`
`(2048, 8192)`	`0.1786 ms`	`0.1592 ms`	`1.1218x`

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-27T02:42:23Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-03-27T02:42:50Z

/tag-and-rerun-ci

ud

513c62e

BBuf requested review from DarkSharpness, HydraQYH, celve and yuan-luo as code owners March 27, 2026 02:42

github-actions Bot added the jit-kernel label Mar 27, 2026

github-actions Bot added the run-ci label Mar 27, 2026

HydraQYH reviewed Mar 27, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/csrc/elementwise/qknorm_across_heads.cuh Outdated

ud

70a8533

HydraQYH approved these changes Mar 27, 2026

View reviewed changes

DarkSharpness approved these changes Mar 27, 2026

View reviewed changes

BBuf merged commit e8d46f1 into main Mar 27, 2026
40 of 70 checks passed

BBuf deleted the opt_qknorm_across_heads branch March 27, 2026 05:30

schetlur-nv mentioned this pull request Apr 2, 2026

[Feature] VisualGen: Add qknorm + rope fuse kernel for cross-head norm (Wan/LTX-2) NVIDIA/TensorRT-LLM#12716

Open

1 task

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

Opt jit qknorm_across_heads cuda kernel (sgl-project#21503)

6e23694

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Opt jit qknorm_across_heads cuda kernel (sgl-project#21503)

b367364

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Opt jit qknorm_across_heads cuda kernel (sgl-project#21503)

293efb1

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opt jit qknorm_across_heads cuda kernel#21503

Opt jit qknorm_across_heads cuda kernel#21503
BBuf merged 2 commits intomainfrom
opt_qknorm_across_heads

BBuf commented Mar 27, 2026

Uh oh!

gemini-code-assist Bot commented Mar 27, 2026

Uh oh!

BBuf commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BBuf commented Mar 27, 2026

Motivation

Microbench results

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 27, 2026

Uh oh!

BBuf commented Mar 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants