[Bug] Tensor shape is wrong when cudagraph+enable_dp_attention

### Checklist

- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.

### Describe the bug

I tried to run DSR1 fp4 model on 8xB200, but found that some issue when I opened cudagraph and attndp, the input tensor dimension for each MoE layer is padded to global bs. For example, I take global bs 4096 and attention dp 8, which each rank should have 512 reqs for decode and the input tensor M dimension should be 512 for local rank. 
But I tried to do some profiling, I found that when cudagraph is on, each rank has input M dim 4096, not 512. When cudagraph is off, each rank has input M dim 512 which looks good.
Is this known or a bug?
Without cudagraph

<img width="612" height="260" alt="Image" src="https://github.com/user-attachments/assets/008683e1-9892-4ae9-abb8-6546b0dd54cb" />

With cudagraph

<img width="592" height="316" alt="Image" src="https://github.com/user-attachments/assets/095ab03b-dd1a-4d1e-b4af-1f14d93b6be2" />

### Reproduction

**Server:**
python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-R1-0528-FP4 \
--trust-remote-code \
--quantization modelopt_fp4 \
--dp-size 8 --enable-dp-attention --enable-dp-lm-head\
--tp-size 8 \
--attention-backend cutlass_mla \
--enable-ep-moe \
--enable-flashinfer-moe \
--cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 \
--chunked-prefill-size 16384 \
--mem-fraction-static 0.85 \
--max-running-requests 4096 \
--stream-interval 5 
**Client:**
benchmark_serving.py with isl/osl 1024/1024, concurrency 4096.

### Environment

latest main.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Tensor shape is wrong when cudagraph+enable_dp_attention #7951

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Tensor shape is wrong when cudagraph+enable_dp_attention #7951

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions