Your current environment
The output of python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : * (Final) (x86_64)
GCC version : (GCC) 9.4.0
Clang version : 18.1.8 (Red Hat 18.1.8-1.module+el8.10.0+703+ec7b33ba)
CMake version : version 4.1.0
Libc version : glibc-2.28
==============================
PyTorch Info
==============================
PyTorch version : 2.8.0+cu128
Is debug build : False
CUDA used to build PyTorch : 12.8
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform : Linux-5.4.119-19.0009.56-x86_64-with-glibc2.28
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration :
GPU 0: NVIDIA H20
GPU 1: NVIDIA H20
GPU 2: NVIDIA H20
GPU 3: NVIDIA H20
GPU 4: NVIDIA H20
GPU 5: NVIDIA H20
GPU 6: NVIDIA H20
GPU 7: NVIDIA H20
Nvidia driver version : 570.158.01
cuDNN version : Probably one of the following:
/usr/local/cuda-12.8/targets/x86_64-linux/lib/libcudnn.so.9
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 384
On-line CPU(s) list: 0-383
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9K84 96-Core Processor
Stepping: 1
CPU MHz: 3687.441
CPU max MHz: 2600.0000
CPU min MHz: 1500.0000
BogoMIPS: 5200.42
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-95,192-287
NUMA node1 CPU(s): 96-191,288-383
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.8.0
[pip3] transformers==4.57.0.dev0
[pip3] triton==3.4.0
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.10.2 (also tested 0.11.0)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
==============================
Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
When running inference with Qwen3-Next-80B-A3B-Instruct model using vLLM V1 engine, a TypeError occurs during token generation:
TypeError: argument 'id': StreamInput must be either an integer or a list of integers
Critical constraint: Qwen3-Next model requires V1 engine (has assertion AssertionError: Qwen3Next requires VLLM_USE_V1), so V0 engine cannot be used as a workaround.
Error Location: /vllm/v1/engine/detokenizer.py, Line 237
Full Stack Trace:
Traceback (most recent call last):
File "/vllm/v1/engine/output_processor.py", line 420, in process_outputs
stop_string = req_state.detokenizer.update(
File "/vllm/v1/engine/detokenizer.py", line 119, in update
self.output_text += self.decode_next(new_token_id)
File "/vllm/v1/engine/detokenizer.py", line 219, in decode_next
token = self._protected_step(next_token_id)
File "/vllm/v1/engine/detokenizer.py", line 237, in _protected_step
token = self.stream.step(self.tokenizer, next_token_id)
TypeError: argument 'id': StreamInput must be either an integer or a list of integers
Investigation & Debug Findings
Debug Output: Added logging to check the actual type of next_token_id:
print(f"Type: {type(next_token_id)}, isinstance(int): {isinstance(next_token_id, int)}")
# Output: Type: <class 'int'>, isinstance(int): True
Puzzling finding: The value is already a Python native int, yet stream.step() still rejects it with the TypeError.
Attempted Fixes (All Failed):
- Type conversion with
.item():
if hasattr(next_token_id, 'item'):
next_token_id = int(next_token_id.item())
- Explicit int() conversion:
next_token_id = int(next_token_id)
- Using operator.index():
import operator
next_token_id = operator.index(next_token_id)
All attempts failed with the same error.
Related Issues
Possibly related to:
Before submitting a new issue...
Your current environment
The output of
python collect_env.py🐛 Describe the bug
When running inference with Qwen3-Next-80B-A3B-Instruct model using vLLM V1 engine, a
TypeErroroccurs during token generation:Critical constraint: Qwen3-Next model requires V1 engine (has assertion
AssertionError: Qwen3Next requires VLLM_USE_V1), so V0 engine cannot be used as a workaround.Error Location:
/vllm/v1/engine/detokenizer.py, Line 237Full Stack Trace:
Investigation & Debug Findings
Debug Output: Added logging to check the actual type of
next_token_id:Puzzling finding: The value is already a Python native
int, yetstream.step()still rejects it with the TypeError.Attempted Fixes (All Failed):
.item():All attempts failed with the same error.
Related Issues
Possibly related to:
Before submitting a new issue...