Checklist
Describe the bug
Hi Team,
I want to have some test on dsr1 with attention dp+attn cutlass mla. Prefill works fine but the server got crashed when entering into decode phase.
I got the error like
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/cutlass_mla_backend.py", line 274, in forward_decode o = cutlass_mla_decode( File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/attention.py", line 95, in cutlass_mla_decode assert B_block_table == B_q AssertionError
Can someone take a look at this? Below is the full error log.
sglang_node0.log
Reproduction
python3 -m sglang.launch_server --tokenizer-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --enable-dp-attention --enable-dp-lm-head --disable-radix-cache --enable-flashinfer-cutlass-moe --enable-ep-moe --moe-dense-tp-size 1 --max-running-requests 2048 --chunked-prefill-size 16384 --mem-fraction-static 0.85 --disable-cuda-graph --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 --quantization modelopt_fp4 --attention-backend cutlass_mla --stream-interval 10 --model-path=nvidia/DeepSeek-R1-0528-FP4 --host 0.0.0.0 --port 8000 --tensor-parallel-size=8 --data-parallel-size=8
Environment
Latest main branch.
Checklist
Describe the bug
Hi Team,
I want to have some test on dsr1 with attention dp+attn cutlass mla. Prefill works fine but the server got crashed when entering into decode phase.
I got the error like
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/cutlass_mla_backend.py", line 274, in forward_decode o = cutlass_mla_decode( File "/usr/local/lib/python3.10/dist-packages/sgl_kernel/attention.py", line 95, in cutlass_mla_decode assert B_block_table == B_q AssertionErrorCan someone take a look at this? Below is the full error log.
sglang_node0.log
Reproduction
python3 -m sglang.launch_server --tokenizer-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --enable-dp-attention --enable-dp-lm-head --disable-radix-cache --enable-flashinfer-cutlass-moe --enable-ep-moe --moe-dense-tp-size 1 --max-running-requests 2048 --chunked-prefill-size 16384 --mem-fraction-static 0.85 --disable-cuda-graph --cuda-graph-bs 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 --quantization modelopt_fp4 --attention-backend cutlass_mla --stream-interval 10 --model-path=nvidia/DeepSeek-R1-0528-FP4 --host 0.0.0.0 --port 8000 --tensor-parallel-size=8 --data-parallel-size=8
Environment
Latest main branch.