[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache#11655
Merged
Fridge003 merged 4 commits intosgl-project:mainfrom Oct 28, 2025
Merged
[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache#11655Fridge003 merged 4 commits intosgl-project:mainfrom
Fridge003 merged 4 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
a113313 to
264ce30
Compare
4 tasks
a926a96 to
d514e1f
Compare
f8974a1 to
00a769d
Compare
Fridge003
reviewed
Oct 25, 2025
6f3c28d to
aae926a
Compare
Collaborator
Collaborator
|
@hlu1 Please fix the bug here https://github.com/sgl-project/sglang/actions/runs/18812434633/job/53676478829?pr=11655 |
Collaborator
Author
Will do.
Fixed. |
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
This was referenced Oct 28, 2025
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add logics to dequant the kvcache from fp8 to bf16 in a separate kernel and use
flashmla_prefillkernel with fp8 kvcache.dequantize_k_cache_paged.flashmla_automode and use it as the default mode for prefill when fp8 kvcache is enabled.flashmla_kvwith fp8 kvcache when spec decoding is detected.flashmla_decode (before)
flashmla_prefill with no kvcache reuse or chunked prefill (after)

flashmla_prefill with kvcache reuse or chunked prefill (after)

Accuracy Tests
gpqa (with an early fp4 checkpoint)
with fp8 kvcache
before: ['0.768', '0.823', '0.773', '0.783']
after: ['0.788', '0.798', '0.773', '0.798']
with bf16 kvcache
After: ['0.818', '0.818', '0.828', '0.758']
Benchmarking and Profiling
With fp4 checkpoint:
before:
input throughput: 12536.68 tok/s
after:
input throughput: 14610.87 tok/s
Checklist