Checklist
Describe the bug
I tried to run DSR1 fp4 model on 8xB200, but found that some issue when I opened cudagraph and attndp, the input tensor dimension for each MoE layer is padded to global bs. For example, I take global bs 4096 and attention dp 8, which each rank should have 512 reqs for decode and the input tensor M dimension should be 512 for local rank.
But I tried to do some profiling, I found that when cudagraph is on, each rank has input M dim 4096, not 512. When cudagraph is off, each rank has input M dim 512 which looks good.
Is this known or a bug?
Without cudagraph
With cudagraph
Reproduction
Server:
python3 -m sglang.launch_server
--model-path nvidia/DeepSeek-R1-0528-FP4
--trust-remote-code
--quantization modelopt_fp4
--dp-size 8 --enable-dp-attention --enable-dp-lm-head
--tp-size 8
--attention-backend cutlass_mla
--enable-ep-moe
--enable-flashinfer-moe
--cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
--chunked-prefill-size 16384
--mem-fraction-static 0.85
--max-running-requests 4096
--stream-interval 5
Client:
benchmark_serving.py with isl/osl 1024/1024, concurrency 4096.
Environment
latest main.
Checklist
Describe the bug
I tried to run DSR1 fp4 model on 8xB200, but found that some issue when I opened cudagraph and attndp, the input tensor dimension for each MoE layer is padded to global bs. For example, I take global bs 4096 and attention dp 8, which each rank should have 512 reqs for decode and the input tensor M dimension should be 512 for local rank.
But I tried to do some profiling, I found that when cudagraph is on, each rank has input M dim 4096, not 512. When cudagraph is off, each rank has input M dim 512 which looks good.
Is this known or a bug?
Without cudagraph
With cudagraph
Reproduction
Server:
python3 -m sglang.launch_server
--model-path nvidia/DeepSeek-R1-0528-FP4
--trust-remote-code
--quantization modelopt_fp4
--dp-size 8 --enable-dp-attention --enable-dp-lm-head
--tp-size 8
--attention-backend cutlass_mla
--enable-ep-moe
--enable-flashinfer-moe
--cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096
--chunked-prefill-size 16384
--mem-fraction-static 0.85
--max-running-requests 4096
--stream-interval 5
Client:
benchmark_serving.py with isl/osl 1024/1024, concurrency 4096.
Environment
latest main.