Skip to content

[Bug] B200 DP Attention OOMs with FP4 Flashinfer cutlass moe at high concurrencies and has lower accuracy than TP #8942

@pavanimajety

Description

@pavanimajety

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When we run DP Attention with Flashinfer cutlass moe, there is an out of memory issue. This is not reproducible with --enable-flashinfer-trtllm-moe since it can process larger batches. Also not reproducible when DP is off most likely because the batch sizes are smaller?

Additionally the accuracy with DP Attention is lower:

Attn Backend Batch Size Page Size Accuracy toks/s Latency Enable DP Attention
flashinfer + EFTM 512 16 0.918 1851 87.9 1
flashinfer + EFTM 512 1 0.91 1631 97 1
flashinfer + EFTM 1319 1 0.95 1677 82 1
flashinfer + EFCM 512 16 0.963 1362 98 0 - TP
flashinfer + EFCM 1024 16 0.958 1417 94 0 - TP
flashinfer + EFCM 512 1 - - - 0 - TP
flashinfer + EFCM 1024 1 0.955 1379 98 0 - TP

Confirmed that there is no accuracy drop due to the moe backend used because a few DP Cases have high accuracy

EFTM: --enable-flashinfer-trtllm-moe
EFCM: --enable-flashinfer-cutlass-moe

Reproduction

Server

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --trust-remote-code --enable-dp-attention --disable-radix-cache --max-running-requests 1024 --chunked-prefill-size 32768 --mem-fraction-static 0.85 \
    --cuda-graph-max-bs 1024 --max-prefill-tokens 32768 --attention-backend flashinfer --model-path=/tmp/DeepSeek-R1-FP4/snapshots/574fdb8a5347fdbc06b2c18488699c0c17d71e05/ --host 0.0.0.0 --port 8000 --tensor-parallel-size=4  \
    --data-parallel-size=4 --enable-flashinfer-trtllm-moe --quantization modelopt_fp4 --page-size 16

Client:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=8000

Environment

B200, lmsysorg/sglang:b200-cu129
And move to top of tree. I will modify the environment once I get another lease.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions