Skip to content

[Bug] Flashinfer MLA batch size mismatch #8774

@trevor-m

Description

@trevor-m

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Today, I got this error:

[2025-08-04 15:32:00 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/trevor/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 141, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/trevor/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 176, in forward_thread_func_
    self.worker.forward_batch_generation(
  File "/trevor/sglang/python/sglang/srt/managers/tp_worker.py", line 238, in forward_batch_generation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
  File "/trevor/sglang/python/sglang/srt/model_executor/model_runner.py", line 1625, in forward
    output = self._forward_raw(
  File "/trevor/sglang/python/sglang/srt/model_executor/model_runner.py", line 1664, in _forward_raw
    ret = self.forward_decode(
  File "/trevor/sglang/python/sglang/srt/model_executor/model_runner.py", line 1542, in forward_decode
    self.attn_backend.init_forward_metadata(forward_batch)
  File "/trevor/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 151, in init_forward_metadata
    self.indices_updater_decode.update(
  File "/trevor/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 531, in update
    self.call_begin_forward(
  File "/trevor/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 560, in call_begin_forward
    kv_indptr[1 : bs + 1] = torch.cumsum(paged_kernel_lens, dim=0)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (2603) at non-singleton dimension 0.  Target sizes: [1024].  Tensor sizes: [2603]

Reproduction

Repro steps:

SGLANG_CUTLASS_MOE=True SGL_ENABLE_JIT_DEEPGEMM=False CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.launch_server --tokenizer-path deepseek-ai/DeepSeek-R1-0528 --trust-remote-code --enable-dp-attention --disable-radix-cache --max-running-requests 1024 --chunked-prefill-size 32768 --mem-fraction-static 0.85 --cuda-graph-max-bs 1024 --max-prefill-tokens 32768 --attention-backend flashinfer --model-path=deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --tensor-parallel-size=8 --data-parallel-size=8 

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=30000

Environment

Python: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
CUDA Driver Version: 580.65.01
PyTorch: 2.7.1+cu128
sglang: 0.4.10.post2
sgl_kernel: 0.3.0
flashinfer_python: 0.2.9rc2
triton: 3.3.1
transformers: 4.54.1
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.3
interegular: 0.3.3
modelscope: 1.28.1
orjson: 3.11.1
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.1
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.22
openai: 1.98.0
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions