Enable mixed type LayerNorm kernel for NSA indexer by akhilg-nv · Pull Request #12044 · sgl-project/sglang

akhilg-nv · 2025-10-24T03:24:32Z

Motivation

Currently we cast input to layernorm to fp32 and use native Torch layernorm, then cast back. Instead we pull in a more efficient TRT-LLM kernel (via flashinfer) that supports mixed precision inputs.

Modifications

Add flashinfer layernorm kernel and apply in NSA indexer.

Accuracy Tests

python -m sglang.launch_server --model-path model_fp4/ --tp 4 --dp 4 --enable-dp-attention --reasoning-parser deepseek-v3

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

sample results

# with old layernorm
Repeat: 8, mean: 0.783
Scores: ['0.788', '0.793', '0.808', '0.783', '0.742', '0.788', '0.783', '0.778']
====================/198 [22:35<00:52,  4.04s/it]
Writing report to /tmp/gpqa_model_fp4_.html
{'chars': np.float64(1464.3333333333333), 'chars:std': np.float64(356.735535690296), 'score:std': np.float64(0.41573970964154905), 'score': np.float64(0.7777777777777778)}
Writing results to /tmp/gpqa_model_fp4_.json
Total latency: 1422.309 s
Score: 0.778

# with new layernorm 
Repeat: 8, mean: 0.785
Scores: ['0.758', '0.763', '0.788', '0.753', '0.823', '0.808', '0.793', '0.793']
====================198 [22:02<2:29:43, 51.63s/it]
Writing report to /tmp/gpqa_model_fp4_.html
{'chars': np.float64(1474.1515151515152), 'chars:std': np.float64(341.7505399975447), 'score:std': np.float64(0.40520665017240837), 'score': np.float64(0.7929292929292929)}
Writing results to /tmp/gpqa_model_fp4_.json
Total latency: 1290.363 s
Score: 0.793

nsys trace shows the new kernel being used:

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-24T03:24:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hlu1 · 2025-10-24T23:10:27Z

@akhilg-nv Please add the nsys profile result showing how much time the three kernels were using before applying the change

hlu1 · 2025-10-28T23:50:15Z

+        if not self.elementwise_affine:
+            return self.forward_native(x)
+
+        if _flashinfer_layernorm_available and x.dtype == torch.bfloat16 and self.dtype == torch.float32:


Combine the two if branches

I did some benchmarking and found for most cases I looked at, using the flashinfer kernel with weight = ones and bias = zeros is still faster than torch or torch compile. So I combined these execution paths (by default I now initialize weight/bias to ones/zeros). Ideally we can make the affine transform optional to the flashinfer kernel, but this shouldn't affect DSv3.2 perf anyway.

akhilg-nv · 2025-11-04T00:13:38Z

@Fridge003 Could you provide insight on resolving the failing CI/CD tests?

I see errors that seem unrelated to my changes, like:

  File "/sglang-checkout/python/sglang/srt/models/deepseek_v2.py", line 2098, in forward_absorb_fused_mla_rope_prepare
    forward_batch.attn_backend.forward_metadata
AttributeError: 'HybridAttnBackend' object has no attribute 'forward_metadata'. Did you mean: 'init_forward_metadata'?

and

  File "/sglang-checkout/python/sglang/srt/models/deepseek_v2.py", line 2097, in forward_absorb_fused_mla_rope_prepare
    attn_logits, _, kv_indptr, kv_indices, _, _, _ = (
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: cannot unpack non-iterable ForwardMetadata object

Fridge003 · 2025-11-04T00:25:55Z

@akhilg-nv That's unrelated to this PR

akhilg-nv requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners October 24, 2025 03:24

b8zhong added the run-ci label Oct 24, 2025

Fridge003 mentioned this pull request Oct 24, 2025

[Feature] NSA optimization roadmap #11989

Closed

Fridge003 reviewed Oct 24, 2025

View reviewed changes

Comment thread python/sglang/test/test_layernorm.py

Comment thread python/sglang/srt/layers/attention/nsa/nsa_indexer.py Outdated

akhilg-nv changed the title ~~Enable mixed type LayerNorm kernel for NSA~~ Enable mixed type LayerNorm kernel for NSA indexer Oct 24, 2025

hlu1 reviewed Oct 24, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/layernorm.py Outdated

Comment thread python/sglang/srt/layers/layernorm.py Outdated

Comment thread python/sglang/srt/layers/layernorm.py Outdated

akhilg-nv force-pushed the layer_norm branch from 0e8cec2 to b45a69f Compare October 28, 2025 22:09

hlu1 reviewed Oct 28, 2025

View reviewed changes

akhilg-nv added 4 commits October 30, 2025 15:41

Add flashinfer layernorm kernel with mixed type support

2673eb1

Use flashinfer layernorm kernel in dsv3.2 nsa

c1476a3

Allow flexible dtype combinations

6111167

simplify branching and improve testing

e684483

akhilg-nv force-pushed the layer_norm branch from a6c7ef3 to e684483 Compare October 30, 2025 22:47

Fridge003 approved these changes Nov 4, 2025

View reviewed changes

Fridge003 merged commit e607850 into sgl-project:main Nov 4, 2025
60 of 73 checks passed

Fridge003 mentioned this pull request Nov 5, 2025

[Test] Add DeepSeekV3.2 NSA Indexer Test Suite #12520

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable mixed type LayerNorm kernel for NSA indexer#12044

Enable mixed type LayerNorm kernel for NSA indexer#12044
Fridge003 merged 4 commits intosgl-project:mainfrom
akhilg-nv:layer_norm

akhilg-nv commented Oct 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

hlu1 commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlu1 Oct 28, 2025

Uh oh!

akhilg-nv Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akhilg-nv commented Nov 4, 2025

Uh oh!

Fridge003 commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

akhilg-nv commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

hlu1 commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlu1 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

akhilg-nv Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akhilg-nv commented Nov 4, 2025

Uh oh!

Fridge003 commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

akhilg-nv commented Oct 24, 2025 •

edited

Loading