CUTLASS FP8 Blockwise GEMM improvement of SM120 by b8zhong · Pull Request #20887 · sgl-project/sglang

b8zhong · 2026-03-18T23:01:54Z

Motivation

The SM120 fp8 blockwise GEMM kernel was using KernelScheduleAuto as the schedule, which on SM120 happens to select the cooperative kernel only. The single-kernel approach misses a performance opportunity, the pingpong schedule is about 2x faster than cooperative for small M. I adapted the example from the CUTLASS repo.

Modifications

sgl-kernel/csrc/gemm/fp8_blockwise_gemm_kernel.cu:

Replaced KernelScheduleAuto with KernelScheduleSm120Blockwise for the cooperative path, when M > 64, to avoid
the specific CUTLASS issue (the refcheck output will explode in relative error, I'm not sure of the cause exactly, but it appears to be potentially a CUTLASS library issue)
Added a pingpong path using KernelTmaWarpSpecializedBlockwisePingpongSm120 with a 64x128x128 tile shape for M ≤ 64.
Added a m <= 64 runtime check to pick between the two paths.
Refactor the kernel setup a bit.

Accuracy and UT

RTX 5090 (SM120) against Flashinfer across Qwen/Qwen3.5-27B-FP8 shapes at M from 1 to 512:

Performance (RTX 5090, N=1536, K=5120)

M	this PR	FlashInfer	Triton
8	0.034 ms	0.063 ms	0.041 ms
64	0.034 ms	0.063 ms	0.041 ms
128	0.063 ms	0.063 ms	0.042 ms
512	0.063 ms	0.063 ms	0.043 ms

Profiles:

E2E Accuracy:

python -m sglang.test.run_eval --base-url http://localhost:30000 --eval-name gsm8k --num-examples 200 --max-tokens 16000 --repeat 5 --num-threads 48 --num-shots 5 --temperature 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --chat-template-kwargs '{"enable_thinking": true}'

Before:
{'score:std': np.float64(0.07053367989832945), 'scores': ['0.995', '0.975', '0.980', '0.990', '0.995'], 'mean_score': np.float64(0.9870000000000001)}

After:
{'score:std': np.float64(0.12155245781143219), 'scores': ['0.990', '0.985', '0.995', '0.995', '0.985'], 'mean_score': np.float64(0.99)}

BS = 1 speed:

Before:
(It will use the cooperative schedule)

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|   29.996    |  1024  |   1.000    |      34.14      |
+-------------+--------+------------+-----------------+

After:

(I only zoom in really far, to show the ping-pong schedule name)

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|   19.652    |  1024  |   1.000    |      52.11      |
+-------------+--------+------------+-----------------+

I think we can change the default GEMM backend on SM120 later.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2026-03-18T23:01:58Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2026-03-20T01:51:31Z

I further attached some reports using NCU:

Before (Cooperative Mainloop / 2-Stage)

--------------------------------------
Kernel: device_kernel<...MainloopSm120TmaWarpSpecializedBlockwiseScaling<2, 2...>>
Duration:               89.06 us
Compute (SM) Throughput: 3.24 %
Memory Throughput:      5.08 %
Memory Bandwidth:       89.61 Gbyte/s
L2 Hit Rate:            50.77 %
Local Memory Spilling:  0 requests

After (Pingpong Mainloop / 3-Stage)

--------------------------------------
Kernel: device_kernel<...MainloopSm120TmaWarpSpecializedBlockwiseScaling<3, 2...>>
Duration:               47.65 us
Compute (SM) Throughput: 3.01 %
Memory Throughput:      9.51 %
Memory Bandwidth:       167.55 Gbyte/s
L2 Hit Rate:            34.99 %
Local Memory Spilling:  252 requests

It can also be seen that the memory bandwidth increased by around 80%, for M = 16. Anyway, there is still a lot to improve in it's performance, but it should be an okay first step.

BBuf · 2026-03-20T14:17:29Z

/tag-and-rerun-ci again

more

b8259cc

b8zhong requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy and yizhang2077 as code owners March 18, 2026 23:01

github-actions Bot added the sgl-kernel label Mar 18, 2026

b8zhong mentioned this pull request Mar 18, 2026

SM120 Performance Optimization Plan #19637

Open

49 tasks

BBuf approved these changes Mar 20, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 20, 2026

BBuf merged commit 009eee8 into main Mar 22, 2026
247 of 327 checks passed

BBuf deleted the brayden/optimize-fp8-gemm-sm120 branch March 22, 2026 09:55

OrangeRedeng pushed a commit to OrangeRedeng/sglang that referenced this pull request Mar 22, 2026

CUTLASS FP8 Blockwise GEMM improvement of SM120 (sgl-project#20887)

9e45ce1

Nekofish-L mentioned this pull request Mar 24, 2026

[Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM vllm-project/vllm#37970

Merged

5 tasks

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

CUTLASS FP8 Blockwise GEMM improvement of SM120 (sgl-project#20887)

7df0e8b

dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026

CUTLASS FP8 Blockwise GEMM improvement of SM120 (sgl-project#20887)

834f6a6

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

CUTLASS FP8 Blockwise GEMM improvement of SM120 (sgl-project#20887)

8e48450

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

CUTLASS FP8 Blockwise GEMM improvement of SM120 (sgl-project#20887)

db3608d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS FP8 Blockwise GEMM improvement of SM120#20887

CUTLASS FP8 Blockwise GEMM improvement of SM120#20887
BBuf merged 1 commit intomainfrom
brayden/optimize-fp8-gemm-sm120

b8zhong commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

b8zhong commented Mar 20, 2026

Uh oh!

BBuf commented Mar 20, 2026 •

edited by b8zhong

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b8zhong commented Mar 18, 2026

Motivation

Modifications

Accuracy and UT

Performance (RTX 5090, N=1536, K=5120)

Checklist

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

b8zhong commented Mar 20, 2026

Uh oh!

BBuf commented Mar 20, 2026 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BBuf commented Mar 20, 2026 •

edited by b8zhong

Loading