Skip to content

[Bug] Tensor shape is wrong when cudagraph+enable_dp_attention #7951

@lingjiew

Description

@lingjiew

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I tried to run DSR1 fp4 model on 8xB200, but found that some issue when I opened cudagraph and attndp, the input tensor dimension for each MoE layer is padded to global bs. For example, I take global bs 4096 and attention dp 8, which each rank should have 512 reqs for decode and the input tensor M dimension should be 512 for local rank.
But I tried to do some profiling, I found that when cudagraph is on, each rank has input M dim 4096, not 512. When cudagraph is off, each rank has input M dim 512 which looks good.
Is this known or a bug?
Without cudagraph

Image

With cudagraph

Image

Reproduction

Server:
python3 -m sglang.launch_server
--model-path nvidia/DeepSeek-R1-0528-FP4
--trust-remote-code
--quantization modelopt_fp4
--dp-size 8 --enable-dp-attention --enable-dp-lm-head
--tp-size 8
--attention-backend cutlass_mla
--enable-ep-moe
--enable-flashinfer-moe
--cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
--chunked-prefill-size 16384
--mem-fraction-static 0.85
--max-running-requests 4096
--stream-interval 5
Client:
benchmark_serving.py with isl/osl 1024/1024, concurrency 4096.

Environment

latest main.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions