Skip to content

[Bug]: [CPU Backend] Engine crashed due to error on flashinfer op registration #32840

@fadara01

Description

@fadara01

Your current environment

The output of python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (aarch64)
GCC version                  : (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Clang version                : 16.0.6 (++20231112100510+7cbf1a259152-1~exp1~20231112100554.106)
CMake version                : version 4.2.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12 (main, Jan  8 2026, 06:52:19) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-1044-aws-aarch64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         aarch64
CPU op-mode(s):                       64-bit
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            ARM
Model name:                           Neoverse-V2
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   96
Socket(s):                            2
Stepping:                             r0p1
BogoMIPS:                             2000.00
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                            12 MiB (192 instances)
L1i cache:                            12 MiB (192 instances)
L2 cache:                             384 MiB (192 instances)
L3 cache:                             72 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-95
NUMA node1 CPU(s):                    96-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Vulnerability Vmscape:                Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1
[pip3] torchaudio==2.9.1
[pip3] torchvision==0.24.1
[pip3] transformers==4.57.6
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.14.0rc2.dev234+g8ebf271bb (git sha: 8ebf271bb)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

Build vLLM on CPU only machine - e.g. VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
Then run: vllm bench throughout and you'll hit this error:

(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935] EngineCore failed to start.
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935] Traceback (most recent call last):
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 926, in run_engine_core
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 691, in __init__
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     super().__init__(
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     self._init_executor()
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 46, in _init_executor
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     self.driver_worker.init_worker(all_kwargs=[kwargs])
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 252, in init_worker
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     worker_class: type[WorkerBase] = resolve_obj_by_qualname(
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/utils/import_utils.py", line 111, in resolve_obj_by_qualname
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     module = importlib.import_module(module_name)
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     return _bootstrap._gcd_import(name[level:], package, level)
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/worker/cpu_worker.py", line 18, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     from vllm.v1.worker.gpu_worker import Worker, init_worker_distributed_environment
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 38, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     from vllm.model_executor.warmup.kernel_warmup import kernel_warmup
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/model_executor/warmup/kernel_warmup.py", line 15, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     from vllm.model_executor.warmup.deep_gemm_warmup import deep_gemm_warmup
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/model_executor/warmup/deep_gemm_warmup.py", line 22, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     from vllm.model_executor.layers.quantization.fp8 import Fp8LinearMethod
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 33, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     from vllm.model_executor.layers.fused_moe.oracle.fp8 import (
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/oracle/fp8.py", line 17, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     from vllm.model_executor.layers.fused_moe.flashinfer_trtllm_moe import (
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py", line 195, in <module>
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     direct_register_custom_op(
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/vllm/utils/torch_utils.py", line 753, in direct_register_custom_op
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     my_lib.define(op_name + schema_str, tags=tags)
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]   File "/home/fadara01/vllm-torch-update/venv/lib/python3.10/site-packages/torch/library.py", line 172, in define
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]     result = self.m.define(schema, alias_analysis, tuple(tags))
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935] RuntimeError: 
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935] invalid numeric default value:
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935] flashinfer_fused_moe_blockscale_fp8(Tensor routing_logits, Tensor routing_bias, Tensor x, Tensor w13_weight, Tensor w13_weight_scale_inv, Tensor w2_weight, Tensor w2_weight_scale_inv, SymInt global_num_experts, SymInt top_k, SymInt? num_expert_group, SymInt? topk_group, SymInt intermediate_size, SymInt expert_offset, SymInt local_num_experts, SymInt[] block_shape, SymInt routing_method_type=RoutingMethodType.DeepSeekV3, float? routed_scaling=1.0) -> Tensor
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935]                                                                                                                                                                                                                                                                                                                                                                                                                            ~ <--- HERE
(EngineCore_DP0 pid=29197) ERROR 01-22 09:14:05 [core.py:935] 

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcpuRelated to CPU backends

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions