Fix scale_step_k computation in the fp8_kernel#20819
Merged
BBuf merged 2 commits intosgl-project:mainfrom Mar 20, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
BBuf
approved these changes
Mar 18, 2026
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/rerun-failed-ci |
Wangzheee
pushed a commit
to Wangzheee/sglang
that referenced
this pull request
Mar 21, 2026
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
0-693
pushed a commit
to 0-693/sglang
that referenced
this pull request
Mar 25, 2026
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
dutsc
pushed a commit
to dutsc/sglang
that referenced
this pull request
Mar 30, 2026
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
JustinTong0323
pushed a commit
to JustinTong0323/sglang
that referenced
this pull request
Apr 7, 2026
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
According to the kernel design,
group_kis expected to be divisible byBLOCK_SIZE_K. However, whenBLOCK_SIZE_Kis smaller thangroup_k,scale_step_kis always computed as0, which prevents the scaling pointer from advancing.For example, with
BLOCK_SIZE_K = 64andgroup_k = 128, the current implementation results inscale_step_k = 0.This fix ensures the kernel correctly handles such cases by properly updating the scaling pointer.
Modifications
group_k(i.e., how many blocks share the same scaling parameters).scale_step_kto1only after the last block in the group has consumed the shared scaling parameters, ensuring the pointer advances correctly.Accuracy Tests
Server Launching Scripts:
E2E Accuracy Tests
MMLU
GSM8k
Kernel Accuracy Tests

The maximum element-wise difference of 1 can be attributed to accumulated numerical errors when using BLOCK_SIZE_K = 64. However in this configuration, the current kernel is more prone to error accumulation, which can lead to noticeable inaccuracies.
Benchmarking and Profiling
Kernel Performance Comparison

The tiny us overhead towards this PR, but this can be fine to make the computation logic correct
E2E Benchmark
Before:
After:
The correctness check and performance benchmark about kernel can be seen in https://github.com/Muqi1029/Awesome-LLM-Training-Serving/blob/main/tutorial/triton/benchmark/fp8_kernel.py
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci