Fix scale_step_k computation in the fp8_kernel by Muqi1029 · Pull Request #20819 · sgl-project/sglang

Muqi1029 · 2026-03-18T06:05:42Z

Motivation

According to the kernel design, group_k is expected to be divisible by BLOCK_SIZE_K. However, when BLOCK_SIZE_K is smaller than group_k, scale_step_k is always computed as 0, which prevents the scaling pointer from advancing.

For example, with BLOCK_SIZE_K = 64 and group_k = 128, the current implementation results in scale_step_k = 0.

This fix ensures the kernel correctly handles such cases by properly updating the scaling pointer.

Modifications

Compute the number of blocks within each group_k (i.e., how many blocks share the same scaling parameters).
Update scale_step_k to 1 only after the last block in the group has consumed the shared scaling parameters, ensuring the pointer advances correctly.

Accuracy Tests

Server Launching Scripts:

# before using default fp8 config, which mean BLOCK_SIZE_K is always equal to group_k
python -m sglang.launch_server --model-path /models/Qwen3.5-35B-A3B-FP8/0b2752837483aa34b3db6e83e151b150c0e00e49 --tp-size 1 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --port 8888 --mem-fraction-static 0.9


# after using tuned fp8 config
python -m sglang.launch_server --model-path /models/Qwen3.5-35B-A3B-FP8/0b2752837483aa34b3db6e83e151b150c0e00e49 --tp-size 1 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --port 8889 --mem-fraction-static 0.9

E2E Accuracy Tests

MMLU

# before
python3 -m sglang.test.run_eval --port 8888 --eval-name mmlu --num-examples 200 --repeat 3  --temperature 1.0 --top-p 0.95 --top-k 20 --min-p 0 --max-tokens 4096

# after
python3 -m sglang.test.run_eval --port 8889 --eval-name mmlu --num-examples 200 --repeat 3 --temperature 1.0 --top-p 0.95 --top-k 20 --min-p 0 --max-tokens 4096

GSM8k

Kernel Accuracy Tests

The maximum element-wise difference of 1 can be attributed to accumulated numerical errors when using BLOCK_SIZE_K = 64. However in this configuration, the current kernel is more prone to error accumulation, which can lead to noticeable inaccuracies.

Benchmarking and Profiling

Kernel Performance Comparison

The tiny us overhead towards this PR, but this can be fine to make the computation logic correct

E2E Benchmark
Before:

random_input_len	random_output_len	request_rate	max_concurrency	-	mean_ttft_ms	p99_ttft_ms	mean_tpot_ms	p99_tpot_ms	mean_itl_ms	p99_itl_ms	mean_e2e_latency_ms	p99_e2e_latency_ms	output_throughput
1200.00	800.00	1.00	1.00	-	117.80	144.34	10.36	10.45	10.37	10.77	8394.76	8463.76	95.29
1200.00	800.00	2.00	2.00	-	118.03	130.97	12.50	12.55	12.50	12.73	10102.16	10148.59	158.29
1200.00	800.00	4.00	4.00	-	141.86	167.82	15.57	15.76	15.58	15.91	12585.41	12714.18	253.86
1200.00	800.00	8.00	8.00	-	439.05	716.62	21.40	22.05	21.44	21.94	17538.00	17783.10	364.42
1200.00	800.00	16.00	16.00	-	5689.39	23048.81	27.21	28.40	27.24	28.15	27433.15	45576.83	450.99

After:

random_input_len	random_output_len	request_rate	max_concurrency	-	mean_ttft_ms	p99_ttft_ms	mean_tpot_ms	p99_tpot_ms	mean_itl_ms	p99_itl_ms	mean_e2e_latency_ms	p99_e2e_latency_ms	output_throughput
1200.00	800.00	1.00	1.00	-	116.68	140.34	8.84	8.90	8.87	9.20	7176.15	7237.19	111.47
1200.00	800.00	2.00	2.00	-	114.51	124.02	11.54	11.63	11.55	11.89	9334.51	9405.63	171.30
1200.00	800.00	4.00	4.00	-	142.74	220.35	14.48	14.70	14.49	14.99	11713.21	11863.03	272.73
1200.00	800.00	8.00	8.00	-	454.02	733.78	20.75	21.51	20.94	23.94	17035.50	17333.18	375.16
1200.00	800.00	16.00	16.00	-	5575.56	22461.90	26.45	27.59	26.52	27.48	26711.13	44461.92	463.62

The correctness check and performance benchmark about kernel can be seen in https://github.com/Muqi1029/Awesome-LLM-Training-Serving/blob/main/tutorial/triton/benchmark/fp8_kernel.py

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-18T06:05:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-03-18T06:14:23Z

/tag-and-rerun-ci

BBuf · 2026-03-19T01:44:16Z

/rerun-failed-ci

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Fix scale_step_k computation

2afd8d5

Muqi1029 requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw, b8zhong and ch-wan as code owners March 18, 2026 06:05

BBuf approved these changes Mar 18, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 18, 2026

Merge branch 'main' into bugfix/fp8_kernel

3c1c571

BBuf merged commit 2099943 into sgl-project:main Mar 20, 2026
234 of 261 checks passed

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Fix scale_step_k computation in the fp8_kernel (sgl-project#20819)

1b663ca

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Muqi1029 deleted the bugfix/fp8_kernel branch March 23, 2026 02:58

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

Fix scale_step_k computation in the fp8_kernel (sgl-project#20819)

fd692ea

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026

Fix scale_step_k computation in the fp8_kernel (sgl-project#20819)

c5aa63f

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Fix scale_step_k computation in the fp8_kernel (sgl-project#20819)

7106fbe

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Fix scale_step_k computation in the fp8_kernel (sgl-project#20819)

187dadd

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scale_step_k computation in the fp8_kernel#20819

Fix scale_step_k computation in the fp8_kernel#20819
BBuf merged 2 commits intosgl-project:mainfrom
Muqi1029:bugfix/fp8_kernel

Muqi1029 commented Mar 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

BBuf commented Mar 18, 2026

Uh oh!

BBuf commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Muqi1029 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

MMLU

GSM8k

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Uh oh!

BBuf commented Mar 18, 2026

Uh oh!

BBuf commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Muqi1029 commented Mar 18, 2026 •

edited

Loading