Checklist
Describe the bug
When we run DP Attention with Flashinfer cutlass moe, there is an out of memory issue. This is not reproducible with --enable-flashinfer-trtllm-moe since it can process larger batches. Also not reproducible when DP is off most likely because the batch sizes are smaller?
Additionally the accuracy with DP Attention is lower:
| Attn Backend |
Batch Size |
Page Size |
Accuracy |
toks/s |
Latency |
Enable DP Attention |
| flashinfer + EFTM |
512 |
16 |
0.918 |
1851 |
87.9 |
1 |
| flashinfer + EFTM |
512 |
1 |
0.91 |
1631 |
97 |
1 |
| flashinfer + EFTM |
1319 |
1 |
0.95 |
1677 |
82 |
1 |
| flashinfer + EFCM |
512 |
16 |
0.963 |
1362 |
98 |
0 - TP |
| flashinfer + EFCM |
1024 |
16 |
0.958 |
1417 |
94 |
0 - TP |
| flashinfer + EFCM |
512 |
1 |
- |
- |
- |
0 - TP |
| flashinfer + EFCM |
1024 |
1 |
0.955 |
1379 |
98 |
0 - TP |
Confirmed that there is no accuracy drop due to the moe backend used because a few DP Cases have high accuracy
EFTM: --enable-flashinfer-trtllm-moe
EFCM: --enable-flashinfer-cutlass-moe
Reproduction
Server
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --trust-remote-code --enable-dp-attention --disable-radix-cache --max-running-requests 1024 --chunked-prefill-size 32768 --mem-fraction-static 0.85 \
--cuda-graph-max-bs 1024 --max-prefill-tokens 32768 --attention-backend flashinfer --model-path=/tmp/DeepSeek-R1-FP4/snapshots/574fdb8a5347fdbc06b2c18488699c0c17d71e05/ --host 0.0.0.0 --port 8000 --tensor-parallel-size=4 \
--data-parallel-size=4 --enable-flashinfer-trtllm-moe --quantization modelopt_fp4 --page-size 16
Client:
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=8000
Environment
B200, lmsysorg/sglang:b200-cu129
And move to top of tree. I will modify the environment once I get another lease.
Checklist
Describe the bug
When we run DP Attention with Flashinfer cutlass moe, there is an out of memory issue. This is not reproducible with
--enable-flashinfer-trtllm-moesince it can process larger batches. Also not reproducible when DP is off most likely because the batch sizes are smaller?Additionally the accuracy with DP Attention is lower:
Confirmed that there is no accuracy drop due to the moe backend used because a few DP Cases have high accuracy
EFTM:
--enable-flashinfer-trtllm-moeEFCM:
--enable-flashinfer-cutlass-moeReproduction
Server
Client:
Environment
B200, lmsysorg/sglang:b200-cu129
And move to top of tree. I will modify the environment once I get another lease.