Checklist
Describe the bug
When using --attention-backend flashinfer with piecewise CUDA graph enabled (default), the server crashes during replay with:
ValueError: q.shape[0] (8) does not match qo_indptr[-1] (6).
For paged prefill, q must have shape [total_tokens, num_heads, head_dim]
where total_tokens = qo_indptr[-1].
FlashInfer PR #2801 (merged 2026-03-23) added explicit shape validation in prefill.run() to catch what was previously a silent out-of-bounds read. The validation now raises ValueError when q.shape[0] != qo_indptr[-1].
Add --disable-piecewise-cuda-graph (already documented in SGLang's own error message) as workaround now.
Reproduction
python -m sglang.launch_server
--model-path Qwen/Qwen3-14B
--attention-backend flashinfer
--disable-cuda-graph
Environment
SGLang: latest main
FlashInfer: latest main
GPU: 4× NVIDIA B200 (SM100, Compute 10.0)
PyTorch: 2.9.1+cu128
Model: Qwen/Qwen3-14B, tp=1
Checklist
Describe the bug
When using --attention-backend flashinfer with piecewise CUDA graph enabled (default), the server crashes during replay with:
ValueError: q.shape[0] (8) does not match qo_indptr[-1] (6).
For paged prefill, q must have shape [total_tokens, num_heads, head_dim]
where total_tokens = qo_indptr[-1].
FlashInfer PR #2801 (merged 2026-03-23) added explicit shape validation in prefill.run() to catch what was previously a silent out-of-bounds read. The validation now raises ValueError when q.shape[0] != qo_indptr[-1].
Add --disable-piecewise-cuda-graph (already documented in SGLang's own error message) as workaround now.
Reproduction
python -m sglang.launch_server
--model-path Qwen/Qwen3-14B
--attention-backend flashinfer
--disable-cuda-graph
Environment
SGLang: latest main
FlashInfer: latest main
GPU: 4× NVIDIA B200 (SM100, Compute 10.0)
PyTorch: 2.9.1+cu128
Model: Qwen/Qwen3-14B, tp=1