Checklist
Describe the bug
I tried to get multi generation outputs via a single post to the server, but I find that if input_embeds is used, the generation will fail and the error message is weired.
In the document, it is stated that input_embeds can be a List[List[List[float]]] item, but it turns out that any 3d tensor causes a 400 response:The engine initialized with skip_tokenizer_init=True cannot accept text prompts. Please provide input_ids or re-initialize the engine with skip_tokenizer_init=False., even with shape (1, 1, dim). There are no text prompts at all.
Also if n in the sampling_params > 1, it causes a 400 response.
Reproduction
sglang==0.4.9
a single l40 gpu
first run server:
python -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --port 30000 --disable-radix --dtype bfloat16 --skip-tokenizer-init
then run:
import torch
import requests
if __name__ == '__main__':
payload = {
"model": "qwen/qwen2.5-0.5b-instruct",
"input_embeds": torch.zeros((1, 896)).tolist(),
"sampling_params": {
"max_new_tokens": 100,
"temperature": 0,
"n": 1,
}
}
response = requests.post(
"http://localhost:30000/generate",
json=payload,
)
# expect: {'output_ids': [11, 20396, 128547, ...], ...}
print(response.json())
payload = {
"model": "qwen/qwen2.5-0.5b-instruct",
"input_embeds": torch.zeros((1, 896)).tolist(),
"sampling_params": {
"max_new_tokens": 100,
"temperature": 0,
"n": 2,
}
}
response = requests.post(
"http://localhost:30000/generate",
json=payload,
)
# expect: {'error': {'message': 'The engine initialized with skip_tokenizer_init=True cannot accept text prompts. Please provide input_ids or re-initialize the engine with skip_tokenizer_init=False.'}}
print(response.json())
payload = {
"model": "qwen/qwen2.5-0.5b-instruct",
"input_embeds": torch.zeros((1, 1, 896)).tolist(),
"sampling_params": {
"max_new_tokens": 100,
"temperature": 0,
"n": 1,
}
}
response = requests.post(
"http://localhost:30000/generate",
json=payload,
)
# expect: {'error': {'message': 'The engine initialized with skip_tokenizer_init=True cannot accept text prompts. Please provide input_ids or re-initialize the engine with skip_tokenizer_init=False.'}}
print(response.json())
Environment
Python: 3.11.13 (main, Jun 5 2025, 13:12:00) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA L40
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.107
CUDA Driver Version: 550.144.03
PyTorch: 2.7.1+cu126
sglang: 0.4.9
sgl_kernel: 0.2.4
flashinfer_python: 0.2.7.post1
triton: 3.3.1
transformers: 4.53.0
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.13
fastapi: 0.115.14
hf_transfer: 0.1.9
huggingface_hub: 0.33.2
interegular: 0.3.3
modelscope: 1.27.1
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.93.0
tiktoken: 0.9.0
anthropic: 0.57.1
litellm: 1.74.0
decord: 0.6.0
NVIDIA Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 32-63,96-127 1 N/A
NIC0 SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_15
ulimit soft: 1000000
Checklist
Describe the bug
I tried to get multi generation outputs via a single post to the server, but I find that if input_embeds is used, the generation will fail and the error message is weired.
In the document, it is stated that
input_embedscan be aList[List[List[float]]]item, but it turns out that any 3d tensor causes a 400 response:The engine initialized with skip_tokenizer_init=True cannot accept text prompts. Please provide input_ids or re-initialize the engine with skip_tokenizer_init=False., even with shape (1, 1, dim). There are no text prompts at all.Also if
nin the sampling_params > 1, it causes a 400 response.Reproduction
sglang==0.4.9
a single l40 gpu
first run server:
then run:
Environment
Python: 3.11.13 (main, Jun 5 2025, 13:12:00) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA L40
GPU 0 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.107
CUDA Driver Version: 550.144.03
PyTorch: 2.7.1+cu126
sglang: 0.4.9
sgl_kernel: 0.2.4
flashinfer_python: 0.2.7.post1
triton: 3.3.1
transformers: 4.53.0
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.13
fastapi: 0.115.14
hf_transfer: 0.1.9
huggingface_hub: 0.33.2
interegular: 0.3.3
modelscope: 1.27.1
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.93.0
tiktoken: 0.9.0
anthropic: 0.57.1
litellm: 1.74.0
decord: 0.6.0
NVIDIA Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 32-63,96-127 1 N/A
NIC0 SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_15
ulimit soft: 1000000