Skip to content

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40#20368

Merged
BBuf merged 2 commits intosgl-project:mainfrom
cs-cat:main
Mar 20, 2026
Merged

Fix result writer in tuning_block_wise_kernel.py, and add FP8 kernel config for L40#20368
BBuf merged 2 commits intosgl-project:mainfrom
cs-cat:main

Conversation

@cs-cat
Copy link
Copy Markdown
Contributor

@cs-cat cs-cat commented Mar 11, 2026

Motivation

This PR fixes #20366 .

Modifications

  1. bugfix for tuning_block_wise_kernel.py.
  2. add FP8 kernel config for L40

Benchmarking and Profiling

Inference on single L40, python3 -m sglang.bench_offline_throughput --model-path /models/Qwen3-30B-A3B-Instruct-2507-FP8/ --num-prompts ${N} --mem-fraction-static 0.85 --context-length 65536 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --max-running-requests ${P} --kv-cache-dtype fp8_e4m3:

Total throughput baseline (tuned MoE) baseline + buggy FP8 tuning baseline + tuned FP8 (this PR) Improvements due to bugfix
N=32,P=1 245.80 164.77 (-32.9%) 304.41 (+23.8%) 84.7%
N=256,P=8 832.57 783.07 (-5.9%) 977.99 (+17.4%) 24.8%
N=256,P=256 2436.34 2352.88 (-3.4%) 2702.24 (+10.9%) 14.8%

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

cs-cat added 2 commits March 11, 2026 21:09
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@cs-cat
Copy link
Copy Markdown
Contributor Author

cs-cat commented Mar 11, 2026

File listed below should be updated as well. However, I do not have the corresponding GPU, so I cannot update the results.

python/sglang/srt/layers/quantization/configs/N=1280,K=5120,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=1536,K=1536,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=2304,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=24576,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=256,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=5120,K=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=5120,K=3200,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=6400,K=5120,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=1024,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=1152,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=128,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=18432,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 17, 2026

/tag-and-rerun-ci again

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Mar 19, 2026

/rerun-failed-ci

@BBuf BBuf merged commit 22e378a into sgl-project:main Mar 20, 2026
268 of 293 checks passed
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…config for L40 (sgl-project#20368)

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
…config for L40 (sgl-project#20368)

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026
…config for L40 (sgl-project#20368)

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…config for L40 (sgl-project#20368)

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…config for L40 (sgl-project#20368)

Signed-off-by: cs-cat <118669451+cs-cat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] FP8 performance regression due to incorrect result produced by tuning_block_wise_kernel.py

3 participants