[Bug] B200 DP Attention OOMs with FP4  Flashinfer cutlass moe at high concurrencies and has lower accuracy than TP

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

When we run DP Attention with Flashinfer cutlass moe, there is an out of memory issue. This is not reproducible with `--enable-flashinfer-trtllm-moe` since it can process larger batches. Also not reproducible when DP is off most likely because the batch sizes are smaller?

Additionally the accuracy with DP Attention is lower: 

| Attn Backend | Batch Size | Page Size | Accuracy | toks/s | Latency | Enable DP Attention |
|--------------|------------|-----------|----------|--------|---------|-------------------|
| flashinfer + EFTM | 512 | 16 | 0.918 | 1851 | 87.9 | 1 |
| flashinfer + EFTM | 512 | 1 | 0.91 | 1631 | 97 | 1 |
| flashinfer + EFTM | 1319 | 1 | 0.95 | 1677 | 82 | 1 |
| flashinfer + EFCM | 512 | 16 | 0.963 | 1362 | 98 | 0 - TP |
| flashinfer + EFCM | 1024 | 16 | 0.958 | 1417 | 94 | 0 - TP |
| flashinfer + EFCM | 512 | 1 | - | - | - | 0 - TP |
| flashinfer + EFCM | 1024 | 1 | 0.955 | 1379 | 98 | 0 - TP |

Confirmed that there is no accuracy drop due to the moe backend used because a few DP Cases have high accuracy

EFTM: `--enable-flashinfer-trtllm-moe`
EFCM: `--enable-flashinfer-cutlass-moe`

### Reproduction

Server
```
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --trust-remote-code --enable-dp-attention --disable-radix-cache --max-running-requests 1024 --chunked-prefill-size 32768 --mem-fraction-static 0.85 \
    --cuda-graph-max-bs 1024 --max-prefill-tokens 32768 --attention-backend flashinfer --model-path=/tmp/DeepSeek-R1-FP4/snapshots/574fdb8a5347fdbc06b2c18488699c0c17d71e05/ --host 0.0.0.0 --port 8000 --tensor-parallel-size=4  \
    --data-parallel-size=4 --enable-flashinfer-trtllm-moe --quantization modelopt_fp4 --page-size 16
```

Client: 

```
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=8000
```



### Environment

B200, lmsysorg/sglang:b200-cu129 
And move to top of tree. I will modify the environment once I get another lease.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] B200 DP Attention OOMs with FP4 Flashinfer cutlass moe at high concurrencies and has lower accuracy than TP #8942

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attn Backend	Batch Size	Page Size	Accuracy	toks/s	Latency	Enable DP Attention
flashinfer + EFTM	512	16	0.918	1851	87.9	1
flashinfer + EFTM	512	1	0.91	1631	97	1
flashinfer + EFTM	1319	1	0.95	1677	82	1
flashinfer + EFCM	512	16	0.963	1362	98	0 - TP
flashinfer + EFCM	1024	16	0.958	1417	94	0 - TP
flashinfer + EFCM	512	1	-	-	-	0 - TP
flashinfer + EFCM	1024	1	0.955	1379	98	0 - TP

[Bug] B200 DP Attention OOMs with FP4 Flashinfer cutlass moe at high concurrencies and has lower accuracy than TP #8942

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions