Fix chunked prefix cache for nvfp4 by wenscarl · Pull Request #10180 · sgl-project/sglang

wenscarl · 2025-09-08T21:17:31Z

co-authored by @elfiegg

Motivation

To address issue 9806

Modifications

Accuracy Tests

server.sh
python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-0528-FP4 \
  --trust-remote-code \
  --attention-backend trtllm_mla \ or flashinfer \
  --disable-radix-cache \
  --max-running-requests 256 \
  --chunked-prefill-size 2048 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 32768 \
  --disable-cuda-graph \
  --tp 8 \
  --dp 8 \
  --enable-dp-attention \
  --quantization modelopt_fp4 

client.sh
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1310 --parallel 1310

flashinfer attention backend
before:
Accuracy: 0.510
Invalid: 0.480

after:
Accuracy: 0.948
Invalid: 0.000


trtllm_mla attention backend
before:
Accuracy: 0.421
Invalid: 0.429

after:
Accuracy: 0.952
Invalid: 0.001

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

wenscarl

Work around cutlass kernel for chunked prefix

Revert "[NVIDIA] disable chunked prefix cache when dp and blackwell is used (sgl-project#9861)" This reverts commit 90dfe3d.

Fridge003 · 2025-09-09T21:49:32Z

@wenscarl Will the accuracy of dpsk-r1 nvfp4 be back to normal after changing flashinfer kernel to fa2 version?

wenscarl · 2025-09-09T23:12:51Z

@wenscarl Will the accuracy of dpsk-r1 nvfp4 be back to normal after changing flashinfer kernel to fa2 version?
Yes. It's in the description.

Fridge003

LGTM

elfiegg · 2025-09-10T16:42:04Z

can we route the logic based on quantization temporarily ? FP8 model works well with cutlass backend, and using FA2 the perf would drop to 1/2.

elfiegg

Thanks for the quick fix!

kushanam · 2025-09-11T03:23:27Z

This PR already includes the changes from #10178

Fridge003 · 2025-09-11T23:47:30Z

@wenscarl Please fix lint with

pre-commit run --all-files

fzyzcjy · 2025-09-12T09:52:49Z

looking forward to this

wenscarl force-pushed the chunked_prefix_fix branch from 2de5c0a to d944356 Compare September 8, 2025 21:33

Fridge003 reviewed Sep 9, 2025

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

wenscarl commented Sep 9, 2025

View reviewed changes

wenscarl marked this pull request as ready for review September 9, 2025 19:04

wenscarl requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners September 9, 2025 19:04

wenscarl requested a review from Fridge003 September 9, 2025 19:04

wenscarl added 2 commits September 9, 2025 14:06

temp fix

5fcba1d

Revert "[NVIDIA] disable chunked prefix cache when dp and blackwell is used (sgl-project#9861)" This reverts commit 90dfe3d.

Work around cutlass kernel for chunked prefix

5074cbd

wenscarl force-pushed the chunked_prefix_fix branch from 80c7b3e to 5074cbd Compare September 9, 2025 19:06

zhyncs reviewed Sep 9, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/attention/flashinfer_mla_backend.py Outdated

zhyncs assigned yzh119 Sep 9, 2025

zhyncs added the high priority label Sep 9, 2025

Merge branch 'main' into chunked_prefix_fix

c19fd27

wenscarl requested a review from zhyncs September 10, 2025 02:32

Fridge003 approved these changes Sep 10, 2025

View reviewed changes

Fridge003 mentioned this pull request Sep 10, 2025

Add disable_chunked_prefix_cache feature to TRTLLM MLA #10178

Closed

4 tasks

yzh119 reviewed Sep 10, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/attention/flashinfer_mla_backend.py Outdated

wenscarl force-pushed the chunked_prefix_fix branch from 704817d to 694443d Compare September 10, 2025 19:48

wenscarl added 2 commits September 10, 2025 21:07

Fix lint

9385028

Merge remote-tracking branch 'origin/main' into chunked_prefix_fix

fdf2fd9

Fridge003 reviewed Sep 10, 2025

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

Dispatch to AttnForwardMethod.MHA

959d4a8

wenscarl force-pushed the chunked_prefix_fix branch from e1c1ee3 to 959d4a8 Compare September 10, 2025 22:27

elfiegg reviewed Sep 10, 2025

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_v2.py

elfiegg reviewed Sep 10, 2025

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

Fallback to dispatch_mla

9c8501a

elfiegg reviewed Sep 10, 2025

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

wenscarl and others added 3 commits September 10, 2025 23:53

Remove fp4 quantization checks and reroute for trtllm_mla backend

98199f4

add trtllm support

e86874c

Upd

d757d8c

Fridge003 added 2 commits September 10, 2025 20:58

Merge branch 'main' into chunked_prefix_fix

41ee1c8

Merge branch 'main' into chunked_prefix_fix

d1005dd

Fridge003 mentioned this pull request Sep 11, 2025

Potential accuracy regressions about Blackwell #10054

Closed

Fridge003 reviewed Sep 11, 2025

View reviewed changes

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated

wenscarl and others added 2 commits September 11, 2025 22:58

Address comment

ab652f3

Merge branch 'main' into chunked_prefix_fix

d1189ee

Fridge003 added the ready-to-merge The PR is ready to merge after the CI is green. label Sep 11, 2025

Fridge003 self-assigned this Sep 11, 2025

wenscarl added 2 commits September 12, 2025 04:32

fix trtllm_mla

aaae347

fix lint

fb63dbb

wenscarl requested review from Fridge003 September 12, 2025 04:38

Merge branch 'main' into chunked_prefix_fix

1d01fb4

zhyncs merged commit 36acd2f into sgl-project:main Sep 12, 2025
176 of 193 checks passed

rainj-me mentioned this pull request Sep 12, 2025

[Bug] Accuracy drop for nvidia/DeepSeek-R1-0528-FP4 with dp attention #9806

Closed

5 tasks

Conversation

wenscarl commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Uh oh!

wenscarl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fridge003 commented Sep 9, 2025

Uh oh!

wenscarl commented Sep 9, 2025

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elfiegg commented Sep 10, 2025

Uh oh!

Uh oh!

elfiegg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kushanam commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Sep 11, 2025

Uh oh!

fzyzcjy commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wenscarl commented Sep 8, 2025 •

edited

Loading