Skip to content

[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache#11655

Merged
Fridge003 merged 4 commits intosgl-project:mainfrom
hlu1:dsv32
Oct 28, 2025
Merged

[DeepseekV32] Enable flashmla_prefill kernel with fp8 kvcache#11655
Fridge003 merged 4 commits intosgl-project:mainfrom
hlu1:dsv32

Conversation

@hlu1
Copy link
Copy Markdown
Collaborator

@hlu1 hlu1 commented Oct 15, 2025

Motivation

Add logics to dequant the kvcache from fp8 to bf16 in a separate kernel and use flashmla_prefill kernel with fp8 kvcache.

  • Verify accuracy and perf on both H200 and B200
  • Fuse topk transformations with the topk kernel.
  • Add a new triton fused kernel dequantize_k_cache_paged.
  • Add heuristics to switch flashmla_prefill/flashmla_decode automatically. Also add flashmla_auto mode and use it as the default mode for prefill when fp8 kvcache is enabled.
  • Adjust default settings for both fp8/bf16 kvcache on Blackwell.
  • Make it compatible with mtp. Currently it dispatches prefill to flashmla_kv with fp8 kvcache when spec decoding is detected.

flashmla_decode (before)

image

flashmla_prefill with no kvcache reuse or chunked prefill (after)
image

flashmla_prefill with kvcache reuse or chunked prefill (after)
image

Accuracy Tests

gpqa (with an early fp4 checkpoint)
with fp8 kvcache
before: ['0.768', '0.823', '0.773', '0.783']
after: ['0.788', '0.798', '0.773', '0.798']

with bf16 kvcache
After: ['0.818', '0.818', '0.828', '0.758']

Benchmarking and Profiling

With fp4 checkpoint:

python -m sglang.bench_one_batch_server --model-path $MODEL --tp 4 --dp 4 --enable-dp-attention --batch 64 --input-len 8192 --output-len 1  --nsa-prefill flashmla_prefill

before:
input throughput: 12536.68 tok/s

after:
input throughput: 14610.87 tok/s

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hlu1 hlu1 marked this pull request as draft October 15, 2025 06:55
@hlu1 hlu1 force-pushed the dsv32 branch 2 times, most recently from a113313 to 264ce30 Compare October 18, 2025 23:35
@hlu1 hlu1 marked this pull request as ready for review October 18, 2025 23:41
@hlu1 hlu1 self-assigned this Oct 19, 2025
@hlu1 hlu1 added the run-ci label Oct 19, 2025
@hlu1 hlu1 force-pushed the dsv32 branch 2 times, most recently from a926a96 to d514e1f Compare October 22, 2025 19:42
@hlu1 hlu1 force-pushed the dsv32 branch 3 times, most recently from f8974a1 to 00a769d Compare October 23, 2025 05:43
Comment thread python/sglang/srt/server_args.py Outdated
Comment thread python/sglang/srt/layers/attention/nsa/dequant_k_cache.py Outdated
Comment thread python/sglang/srt/layers/attention/nsa/dequant_k_cache.py
Comment thread python/sglang/srt/layers/attention/nsa_backend.py Outdated
Comment thread python/sglang/srt/server_args.py Outdated
Comment thread python/sglang/srt/layers/attention/nsa_backend.py Outdated
Comment thread python/sglang/srt/layers/attention/nsa_backend.py Outdated
@hlu1 hlu1 force-pushed the dsv32 branch 3 times, most recently from 6f3c28d to aae926a Compare October 26, 2025 01:53
@Fridge003
Copy link
Copy Markdown
Collaborator

@hlu1 The Configuration Tips section of document needs to be updated in a following PR after this PR get merged.

Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@Fridge003
Copy link
Copy Markdown
Collaborator

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Oct 26, 2025

@hlu1 The Configuration Tips section of document needs to be updated in a following PR after this PR get merged.

Will do.

@hlu1 Please fix the bug here https://github.com/sgl-project/sglang/actions/runs/18812434633/job/53676478829?pr=11655

Fixed.

Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful work!

@Fridge003 Fridge003 merged commit 81a632a into sgl-project:main Oct 28, 2025
90 of 108 checks passed
@hlu1 hlu1 deleted the dsv32 branch November 14, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants