Skip to content

[Bug] Piecewise CUDA graph replay crashes with FlashInfer ≥0.6.6: q.shape[0] does not match qo_indptr[-1] in paged prefill #21218

@yyihuang

Description

@yyihuang

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

When using --attention-backend flashinfer with piecewise CUDA graph enabled (default), the server crashes during replay with:

ValueError: q.shape[0] (8) does not match qo_indptr[-1] (6).
For paged prefill, q must have shape [total_tokens, num_heads, head_dim]
where total_tokens = qo_indptr[-1].

FlashInfer PR #2801 (merged 2026-03-23) added explicit shape validation in prefill.run() to catch what was previously a silent out-of-bounds read. The validation now raises ValueError when q.shape[0] != qo_indptr[-1].

Add --disable-piecewise-cuda-graph (already documented in SGLang's own error message) as workaround now.

Reproduction

python -m sglang.launch_server
--model-path Qwen/Qwen3-14B
--attention-backend flashinfer
--disable-cuda-graph

Environment

SGLang: latest main
FlashInfer: latest main
GPU: 4× NVIDIA B200 (SM100, Compute 10.0)
PyTorch: 2.9.1+cu128
Model: Qwen/Qwen3-14B, tp=1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions