[2025-08-04 15:32:00 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/trevor/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 141, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/trevor/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 176, in forward_thread_func_
self.worker.forward_batch_generation(
File "/trevor/sglang/python/sglang/srt/managers/tp_worker.py", line 238, in forward_batch_generation
logits_output, can_run_cuda_graph = self.model_runner.forward(
File "/trevor/sglang/python/sglang/srt/model_executor/model_runner.py", line 1625, in forward
output = self._forward_raw(
File "/trevor/sglang/python/sglang/srt/model_executor/model_runner.py", line 1664, in _forward_raw
ret = self.forward_decode(
File "/trevor/sglang/python/sglang/srt/model_executor/model_runner.py", line 1542, in forward_decode
self.attn_backend.init_forward_metadata(forward_batch)
File "/trevor/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 151, in init_forward_metadata
self.indices_updater_decode.update(
File "/trevor/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 531, in update
self.call_begin_forward(
File "/trevor/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 560, in call_begin_forward
kv_indptr[1 : bs + 1] = torch.cumsum(paged_kernel_lens, dim=0)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (2603) at non-singleton dimension 0. Target sizes: [1024]. Tensor sizes: [2603]
SGLANG_CUTLASS_MOE=True SGL_ENABLE_JIT_DEEPGEMM=False CUDA_LAUNCH_BLOCKING=1 python3 -m sglang.launch_server --tokenizer-path deepseek-ai/DeepSeek-R1-0528 --trust-remote-code --enable-dp-attention --disable-radix-cache --max-running-requests 1024 --chunked-prefill-size 32768 --mem-fraction-static 0.85 --cuda-graph-max-bs 1024 --max-prefill-tokens 32768 --attention-backend flashinfer --model-path=deepseek-ai/DeepSeek-R1-0528 --host 0.0.0.0 --port 8000 --tensor-parallel-size=8 --data-parallel-size=8
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=30000
Python: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.93
CUDA Driver Version: 580.65.01
PyTorch: 2.7.1+cu128
sglang: 0.4.10.post2
sgl_kernel: 0.3.0
flashinfer_python: 0.2.9rc2
triton: 3.3.1
transformers: 4.54.1
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.3
interegular: 0.3.3
modelscope: 1.28.1
orjson: 3.11.1
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.1
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.22
openai: 1.98.0
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found
Checklist
Describe the bug
Today, I got this error:
Reproduction
Repro steps:
Environment