Skip to content

Fix chunked prefix cache for nvfp4#10180

Merged
zhyncs merged 19 commits intosgl-project:mainfrom
wenscarl:chunked_prefix_fix
Sep 12, 2025
Merged

Fix chunked prefix cache for nvfp4#10180
zhyncs merged 19 commits intosgl-project:mainfrom
wenscarl:chunked_prefix_fix

Conversation

@wenscarl
Copy link
Copy Markdown
Collaborator

@wenscarl wenscarl commented Sep 8, 2025

co-authored by @elfiegg

Motivation

To address issue 9806

Modifications

Accuracy Tests

server.sh
python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-R1-0528-FP4 \
  --trust-remote-code \
  --attention-backend trtllm_mla \ or flashinfer \
  --disable-radix-cache \
  --max-running-requests 256 \
  --chunked-prefill-size 2048 \
  --mem-fraction-static 0.89 \
  --max-prefill-tokens 32768 \
  --disable-cuda-graph \
  --tp 8 \
  --dp 8 \
  --enable-dp-attention \
  --quantization modelopt_fp4 

client.sh
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1310 --parallel 1310

flashinfer attention backend
before:
Accuracy: 0.510
Invalid: 0.480

after:
Accuracy: 0.948
Invalid: 0.000


trtllm_mla attention backend
before:
Accuracy: 0.421
Invalid: 0.429

after:
Accuracy: 0.952
Invalid: 0.001

Benchmarking and Profiling

Checklist

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
Copy link
Copy Markdown
Collaborator Author

@wenscarl wenscarl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work around cutlass kernel for chunked prefix

Revert "[NVIDIA] disable chunked prefix cache when dp and blackwell is used (sgl-project#9861)"

This reverts commit 90dfe3d.
Comment thread python/sglang/srt/layers/attention/flashinfer_mla_backend.py Outdated
@Fridge003
Copy link
Copy Markdown
Collaborator

@wenscarl Will the accuracy of dpsk-r1 nvfp4 be back to normal after changing flashinfer kernel to fa2 version?

@wenscarl
Copy link
Copy Markdown
Collaborator Author

wenscarl commented Sep 9, 2025

@wenscarl Will the accuracy of dpsk-r1 nvfp4 be back to normal after changing flashinfer kernel to fa2 version?
Yes. It's in the description.

@wenscarl wenscarl requested a review from zhyncs September 10, 2025 02:32
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread python/sglang/srt/layers/attention/flashinfer_mla_backend.py Outdated
@elfiegg
Copy link
Copy Markdown
Collaborator

elfiegg commented Sep 10, 2025

can we route the logic based on quantization temporarily ? FP8 model works well with cutlass backend, and using FA2 the perf would drop to 1/2.

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
Copy link
Copy Markdown
Collaborator

@elfiegg elfiegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix!

Comment thread python/sglang/srt/models/deepseek_v2.py
Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
@kushanam
Copy link
Copy Markdown
Collaborator

This PR already includes the changes from #10178

Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
Comment thread python/sglang/srt/models/deepseek_v2.py Outdated
@Fridge003 Fridge003 added the ready-to-merge The PR is ready to merge after the CI is green. label Sep 11, 2025
@Fridge003 Fridge003 self-assigned this Sep 11, 2025
@Fridge003
Copy link
Copy Markdown
Collaborator

@wenscarl Please fix lint with

pre-commit run --all-files

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 12, 2025

looking forward to this

@zhyncs zhyncs merged commit 36acd2f into sgl-project:main Sep 12, 2025
176 of 193 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority ready-to-merge The PR is ready to merge after the CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants