Skip to content

[Bug] FP8 performance regression due to incorrect result produced by tuning_block_wise_kernel.py #20366

@cs-cat

Description

@cs-cat

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

tuning_block_wise_kernel.py is a utility in SGLang used to determine the optimal configuration for the FP8 kernel. However, it has a bug when running in a multi-GPU environment, generating the incomplete optimal configuration, resulting in a significant inference performance drop.
Moreover, 17 kernel configs provided by sglang are affected by this bug:

python/sglang/srt/layers/quantization/configs/N=1280,K=5120,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=1536,K=1536,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=1536,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=2048,K=512,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=2304,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=24576,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=256,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=32768,K=512,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=5120,K=1024,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=5120,K=3200,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=6400,K=5120,device_name=NVIDIA_B200,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=1024,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=1152,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=128,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=16384,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json
python/sglang/srt/layers/quantization/configs/N=7168,K=18432,device_name=NVIDIA_L20Y,dtype=fp8_w8a8,block_shape=[128, 128].json

Reproduction

In a 4 * L40 environment, tuning FP8 kernel for Qwen-3-30B-A3B:

python benchmark/kernels/quantization/tuning_block_wise_kernel.py --N 2048 --K 4096 -tp 1 --out-dtype bfloat16
python benchmark/kernels/quantization/tuning_block_wise_kernel.py --N 5120 --K 2048 -tp 1 --out-dtype bfloat16

For each (N,K) pair, the last worker to finish will overwrite the final result, resulting in the result containing only a portion of the batch size's tuning data.
Micro throughput tests on single L40 showed a significant performance regression:
(python3 -m sglang.bench_offline_throughput --model-path /models/Qwen3-30B-A3B-Instruct-2507-FP8/ --num-prompts ${N} --mem-fraction-static 0.85 --context-length 65536 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --max-running-requests ${P} --kv-cache-dtype fp8_e4m3)

Total throughput baseline (tuned MoE) baseline + buggy FP8 tuning baseline + tuned FP8(bug fixed) Improvements due to bugfix
N=32,P=1 245.80 164.77 (-32.9%) 304.41 (+23.8%) 84.7%
N=256,P=8 832.57 783.07 (-5.9%) 977.99 (+17.4%) 24.8%
N=256,P=256 2436.34 2352.88 (-3.4%) 2702.24 (+10.9%) 14.8%

Environment

Python: 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA L40
GPU 0,1,2,3 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.88
CUDA Driver Version: 580.82.07
PyTorch: 2.9.1+cu130
sglang: 0.0.0.dev1+gd28f35240
sgl_kernel: 0.3.21
flashinfer_python: 0.6.4
flashinfer_cubin: 0.6.4
flashinfer_jit_cache: 0.6.4+cu130
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.4.2
aiohttp: 3.13.3
fastapi: 0.135.1
hf_transfer: 0.1.9
huggingface_hub: 0.36.2
interegular: 0.3.3
modelscope: 1.34.0
orjson: 3.11.7
outlines: 0.1.11
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
python-multipart: 0.0.22
pyzmq: 27.1.0
uvicorn: 0.41.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.84.0
litellm: Module Not Found
decord2: 3.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE SYS SYS 0,2,4,6,8,10 0 N/A
GPU1 NODE X SYS SYS 0,2,4,6,8,10 0 N/A
GPU2 SYS SYS X NODE 1,3,5,7,9,11 1 N/A
GPU3 SYS SYS NODE X 1,3,5,7,9,11 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions