Fix(jit): support rmsnorm for hidden_size in {64, 128, 256} by Johnsonms · Pull Request #20661 · sgl-project/sglang

Johnsonms · 2026-03-16T05:35:41Z

Motivation

jit_rmsnorm silently failed for hidden_size ∈ {64, 128, 256} and hidden_size = 16384 during benchmarking on B200. The root cause:

RMSNormKernel only implemented the CTA norm path (hidden_size > 256), so instantiating it for small hidden sizes triggered a static_assert failure ("Hidden size invalid for RMSNorm") at JIT compile time
hidden_size = 16384 exceeded the CTA kernel's supported range (≤ 8192), triggering "Unsupported norm configuration"

Both cases caused nvcc to exit with status 2, flooding stderr with full compilation logs for every failing (hidden_size, batch_size, dtype) combination.

Modifications

csrc/elementwise/rmsnorm.cuh

Add rmsnorm_warp kernel: uses tile::Memory::warp() and apply_norm_warp() — one warp (32 threads) per token, no shared memory needed
Add RMSNormWarpKernel<kDim, kUsePDL, DType> struct for kDim ∈ {64, 128, 256}, symmetric to the existing RMSNormKernel

python/sglang/jit_kernel/norm.py

Add _is_supported_rmsnorm_hidden_size(hidden_size): returns True for warp sizes {64, 128, 256} and CTA sizes (multiples of 256, in range (256, 8192])
Add _rmsnorm_kernel_class(hidden_size): returns "RMSNormWarpKernel" or "RMSNormKernel" based on hidden size
_jit_rmsnorm_module now routes to the correct kernel class
rmsnorm() raises a clean RuntimeError for unsupported hidden sizes (e.g. 16384) instead of falling through to noisy nvcc failures

python/sglang/jit_kernel/tests/test_norm_jit.py

Extend RMSNORM_HIDDEN_SIZES to include [64, 128, 256]
Add test_rmsnorm_hidden_size_support: unit tests for _is_supported_rmsnorm_hidden_size covering edge cases {0, 64, 128, 256, 512, 8192, 16384}
Add test_rmsnorm_kernel_dispatch: verifies correct kernel class selection for all size categories
Add test_rmsnorm_rejects_unsupported_hidden_size: verifies clean RuntimeError is raised for {0, 16384}

Accuracy Tests

test_rmsnorm_jit validates correctness against flashinfer reference for all supported hidden sizes including the newly added {64, 128, 256}, across bf16/fp16 and batch_size ∈ {1, 19, 99, 989}.

Benchmarking and Profiling

Measured on B200 (hidden_size=64, batch_size=1, bf16), jit_rmsnorm with the new warp norm kernel is the fastest among all providers:

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-16T05:35:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HydraQYH

Have you run unit tests yourself?

HydraQYH

I don't think these unit tests are necessary; tests for these functionalities are already included in the kernel's unit tests.

HydraQYH · 2026-03-16T10:57:25Z

+    output_ptr = pointer::offset<Float>(output, i * output_stride);
+    output_vec = norm::apply_norm_warp<kDim>(input_vec, weight_vec, eps);
+  }
+  gmem.store(output_ptr, output_vec);


This gmem.store(output_ptr, output_vec); should be inside a for loop, and the if statement inside the for loop is meaningless.

Perhaps this is correct, as the for loop writes the token processed in the previous for loop each time, and the token processed in the last for loop must be written back separately. However, this makes the code less readable, and I think it would be better to process a token within a single loop.

Key takeaways

hidden_size=8192: Sequential is consistently and significantly faster — up to 19% at large batch. This is the most impactful regime (e.g. Llama/Qwen models with hidden_size=8192).

hidden_size 3072/5120: Results are mixed and small (±3–5%), largely within noise margins.

Small batch (≤128): Pipeline wins by ~1–3% — negligible in practice since absolute latency is already ~4µs.

The pipeline pattern adds register pressure (keeping output_ptr/output_vec live across the loop + a conditional branch), which clearly hurts occupancy at hidden_size=8192.

So replace the pipeline kernels with the sequential variants. The sequential pattern is simpler, more readable, and strictly better for the most common large-model configurations.

HydraQYH · 2026-03-16T12:18:20Z

Please provide complete unit test results (screenshots or logs) once the above comments are resolved.

Johnsonms · 2026-03-22T05:25:24Z

Please provide complete unit test results (screenshots or logs) once the above comments are resolved.

/sgl-workspace/sglang/python/sglang/jit_kernel# python tests/test_norm_jit.py

root@gpu-dp-nwrpk-b25k7:/sgl-workspace/sglang/python/sglang/jit_kernel# python benchmark/bench_rmsnorm.py

Summary:

Overall conclusion
SGL JIT is faster than SGL AOT across nearly all configurations, with gains ranging from ~6% at small batch to ~24% at large batch. The advantage grows with batch size and is consistent across all hidden sizes. Only at hidden_size=1536 with batch=1 does AOT have a slight edge (~4%), which is within noise margin. The JIT sequential kernel is the clear winner.

Johnsonms · 2026-03-22T05:36:43Z

Have you run unit tests yourself?

Yes, it did as https://github.com/sgl-project/sglang/pull/20661#issue-4080337611, but no screenshot for this

HydraQYH

It seems the kernel implementation is OK. And I don't think some unit tests are necessary, but adding them won't hurt. If you insist they are necessary, you can keep them.

HydraQYH · 2026-03-23T04:03:09Z

    torch.testing.assert_close(r_jit, r_ref, rtol=1e-2, atol=1e-2)


+@pytest.mark.parametrize(


There's only one question: are these new unit tests really necessary?

I will remove that

BBuf · 2026-03-23T12:57:18Z

/tag-and-rerun-ci

…ect#20661)

Johnsonms requested review from BBuf, DarkSharpness, HydraQYH, celve and yuan-luo as code owners March 16, 2026 05:35

github-actions Bot added the jit-kernel label Mar 16, 2026

HydraQYH reviewed Mar 16, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh Outdated

HydraQYH reviewed Mar 16, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh

HydraQYH reviewed Mar 16, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/tests/test_norm_jit.py

Comment thread python/sglang/jit_kernel/tests/test_norm_jit.py

HydraQYH reviewed Mar 16, 2026

View reviewed changes

Johnsonms marked this pull request as draft March 16, 2026 13:33

fix(jit): support rmsnorm for hidden_size in {64, 128, 256}

3ab6f8c

fix(jit): simplify rmsnorm kernels to sequential loop pattern

672df76

Johnsonms force-pushed the fix/jit-rmsnorm-unsupported-hidden-size branch from b62da97 to 672df76 Compare March 22, 2026 05:42

Johnsonms marked this pull request as ready for review March 22, 2026 05:43

Merge branch 'main' into fix/jit-rmsnorm-unsupported-hidden-size

8a9ad2a

Johnsonms requested a review from HydraQYH March 22, 2026 05:44

HydraQYH approved these changes Mar 23, 2026

View reviewed changes

HydraQYH added the run-ci label Mar 23, 2026

Johnsonms added 2 commits March 22, 2026 22:47

Merge branch 'main' into fix/jit-rmsnorm-unsupported-hidden-size

31496cb

Merge branch 'main' into fix/jit-rmsnorm-unsupported-hidden-size

7034eb2

BBuf merged commit 777edb6 into sgl-project:main Mar 23, 2026
92 of 121 checks passed

adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026

Fix(jit): support rmsnorm for hidden_size in {64, 128, 256} (sgl-proj…

c6ef84a

…ect#20661)

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

Fix(jit): support rmsnorm for hidden_size in {64, 128, 256} (sgl-proj…

4c1a28c

…ect#20661)

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Fix(jit): support rmsnorm for hidden_size in {64, 128, 256} (sgl-proj…

fbd1aef

…ect#20661)

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Fix(jit): support rmsnorm for hidden_size in {64, 128, 256} (sgl-proj…

df0fb9d

…ect#20661)

		torch.testing.assert_close(r_jit, r_ref, rtol=1e-2, atol=1e-2)


		@pytest.mark.parametrize(

Conversation

Johnsonms commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 16, 2026

Uh oh!

Uh oh!

HydraQYH left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HydraQYH left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HydraQYH Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

HydraQYH Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Johnsonms Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HydraQYH commented Mar 16, 2026

Uh oh!

Johnsonms commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Johnsonms commented Mar 22, 2026

Uh oh!

HydraQYH left a comment

Choose a reason for hiding this comment

Uh oh!

HydraQYH Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Johnsonms Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Johnsonms commented Mar 16, 2026 •

edited

Loading

HydraQYH left a comment •

edited

Loading

HydraQYH left a comment •

edited

Loading

Johnsonms commented Mar 22, 2026 •

edited

Loading