Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40 by cs-cat · Pull Request #20368 · sgl-project/sglang

cs-cat · 2026-03-11T13:59:43Z

Motivation

This PR fixes #20366 .

Modifications

bugfix for tuning_block_wise_kernel.py.
add FP8 kernel config for L40

Benchmarking and Profiling

Inference on single L40, python3 -m sglang.bench_offline_throughput --model-path /models/Qwen3-30B-A3B-Instruct-2507-FP8/ --num-prompts ${N} --mem-fraction-static 0.85 --context-length 65536 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --max-running-requests ${P} --kv-cache-dtype fp8_e4m3:

Total throughput	baseline (tuned MoE)	baseline + buggy FP8 tuning	baseline + tuned FP8 (this PR)	Improvements due to bugfix
N=32,P=1	245.80	164.77 (-32.9%)	304.41 (+23.8%)	84.7%
N=256,P=8	832.57	783.07 (-5.9%)	977.99 (+17.4%)	24.8%
N=256,P=256	2436.34	2352.88 (-3.4%)	2702.24 (+10.9%)	14.8%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

gemini-code-assist · 2026-03-11T13:59:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

cs-cat · 2026-03-11T14:04:24Z

File listed below should be updated as well. However, I do not have the corresponding GPU, so I cannot update the results.

python/sglang/srt/layers/quantization/configs/N=1280,K=5120,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=1536,K=1536,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=2304,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=24576,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=256,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=5120,K=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=5120,K=3200,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=6400,K=5120,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=1024,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=1152,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=128,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=18432,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json

BBuf · 2026-03-17T01:19:03Z

/tag-and-rerun-ci again

b8zhong · 2026-03-19T23:50:05Z

/rerun-failed-ci

…config for L40 (sgl-project#20368) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

cs-cat added 2 commits March 11, 2026 21:09

fix: correctly save fp8 kernel tuning results in multi-GPU mode

4777fc6

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

Add L40 fp8 w8a8 gemm tuning configs for Qwen3-30B-A3B

22f756c

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

cs-cat requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw and ch-wan as code owners March 11, 2026 13:59

BBuf approved these changes Mar 17, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 17, 2026

BBuf merged commit 22e378a into sgl-project:main Mar 20, 2026
268 of 293 checks passed

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel …

35c1a57

…config for L40 (sgl-project#20368) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel …

0a372fa

…config for L40 (sgl-project#20368) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel …

a273d6f

…config for L40 (sgl-project#20368) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel …

41078b8

…config for L40 (sgl-project#20368) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel …

352213f

…config for L40 (sgl-project#20368) Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40#20368

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40#20368
BBuf merged 2 commits intosgl-project:mainfrom
cs-cat:main

cs-cat commented Mar 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

cs-cat commented Mar 11, 2026 •

edited

Loading

Uh oh!

BBuf commented Mar 17, 2026 •

edited by b8zhong

Loading

Uh oh!

b8zhong commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cs-cat commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

cs-cat commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf commented Mar 17, 2026 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8zhong commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cs-cat commented Mar 11, 2026 •

edited

Loading

cs-cat commented Mar 11, 2026 •

edited

Loading

BBuf commented Mar 17, 2026 •

edited by b8zhong

Loading