[1;36m(APIServer pid=1554) [0;0m INFO: 172.16.33.39:51958 - "GET /metrics HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO: 172.16.33.39:51958 - "GET /metrics HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO: 172.16.93.20:37418 - "GET /health HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:12 [logger.py:37] Request cmpl-2f596feab8044210b24f979c15ed186d-0 details: prompt: '안녕?', prompt_token_ids: [0, 31404, 11939, 246, 33], prompt_embeds shape: None.
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:12 [logger.py:47] Received request cmpl-2f596feab8044210b24f979c15ed186d-0: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None), lora_request: None.
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:12 [async_llm.py:343] Added request cmpl-2f596feab8044210b24f979c15ed186d-0.
[1;36m(APIServer pid=1554) [0;0m INFO: 127.0.0.1:35728 - "POST /v1/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:14 [loggers.py:221] Engine 000: Avg prompt throughput: 1.0 tokens/s, Avg generation throughput: 1.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=1554) [0;0m INFO: 172.16.33.39:51958 - "GET /metrics HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:16 [logger.py:37] Request chatcmpl-2085e3182f394a2a8554f00028f4143d details: prompt: '\n<|begin▁of▁sentence|>You are a helpful assistant.<|User|>JSON data:\n{"203c8d7a-22ef-47d6-a453-63141a73bdc3": "0ff12ff1-508e-4b1b-a252-412e64d36bf3", ..., "e24301db-d210-4abf-88e7-7301223bfcac": "c6e22ba7-775c-4b53-a811-50ea7c7751d2", "ee2790ec-ded8-421d-b7e8-bd963ee7ff34": "04bc055a-8a47-4bd8-8eb6-304df99df5fa", "04b15c99-70fd-479a-9f64-8e77d4f1423e": "9b2d546b-022c-400a-aa64-50d110571e59"}\nQ: \nKey: "e8fa7242-5c2e-46f9-ad70-1d4aaf037243"\nThe value associated with the specified key is: <|Assistant|>', prompt_token_ids: None, prompt_embeds shape: None.
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:16 [logger.py:47] Received request chatcmpl-2085e3182f394a2a8554f00028f4143d: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7760, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None), lora_request: None.
[1;36m(APIServer pid=1554) [0;0m INFO: 127.0.0.1:35732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:16 [async_llm.py:343] Added request chatcmpl-2085e3182f394a2a8554f00028f4143d.
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:19 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=1554) [0;0m INFO: 172.16.33.39:51958 - "GET /metrics HTTP/1.1" 200 OK
[1;36m(APIServer pid=1554) [0;0m INFO: 172.16.33.39:51958 - "GET /metrics HTTP/1.1" 200 OK
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc7.dev20+g9973e6e04) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=16, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint='grpc://localhost:4317', collect_detailed_traces=None), seed=0, served_model_name=deepseek-ai/DeepSeek-V3, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': True, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.PIECEWISE: 1>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes':[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'full_cuda_graph': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 64, 'local_cache_dir': None},
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-2085e3182f394a2a8554f00028f4143d'], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[[[241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256]]], num_computed_tokens=[81920], num_output_tokens=[0]), num_scheduled_tokens={chatcmpl-2085e3182f394a2a8554f00028f4143d: 16384}, total_num_scheduled_tokens=16384, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[96], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.014667685255920548, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={})
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] EngineCore encountered a fatal error.
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] Traceback (most recent call last):
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in run_engine_core
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] engine_core.run_busy_loop()
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 875, in run_busy_loop
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] self._process_engine_step()
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 904, in _process_engine_step
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] outputs, model_executed = self.step_fn()
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 334, in step
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] model_output = self.model_executor.sample_tokens(grammar_output)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 434, in sample_tokens
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return refs[0].get()
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 150, in get
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return _process_return_vals(return_vals, True)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/ray/experimental/compiled_dag_ref.py", line 27, in _process_return_vals
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] raise val.as_instanceof_cause()
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ray.exceptions.RayTaskError(AssertionError): [36mray::RayWorkerWrapper.__ray_call__() [39m (pid=2071, ip=172.16.93.47)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 103, in execute_model_ray
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] output = self.worker.model_runner.execute_model(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return func(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2653, in execute_model
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] model_output = self._model_forward(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2511, in _model_forward
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self.model(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self._call_impl(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return forward_call(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1402, in forward
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] hidden_states = self.model(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 477, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] model_output = self.forward(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1242, in forward
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] def forward(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return fn(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 53, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self.optimized_call(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self._wrapped_call(self, *args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] raise e
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self._call_impl(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return forward_call(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "<eval_with_key>.139", line 638, in forward
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self._wrapped_call(self, *args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] raise e
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self._call_impl(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return forward_call(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "<eval_with_key>.17", line 5, in forward
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q, x_11, key_rot_1, output_3, 'model.layers.0.self_attn.attn'); q = x_11 = key_rot_1 = output_3 = unified_mla_attention_with_output = None
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] return self._op(*args, **kwargs)
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 1045, in unified_mla_attention_with_output
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] self.impl.forward(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1930, in forward
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] output[num_decode_tokens:] = self._forward_prefill(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1809, in _forward_prefill
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] self._context_parallel_compute_prefill_context(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1731, in _context_parallel_compute_prefill_context
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] kv_c_normed, k_pe = reorg_kvcache(
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/common.py", line 1045, in reorg_kvcache
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] assert reorganized_kv_c_normed.shape[0] == sum_seq_len
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(EngineCore_DP0 pid=1880) [0;0m ERROR 11-11 23:20:26 [core.py:857] AssertionError
[1;36m(EngineCore_DP0 pid=1880) [0;0m INFO 11-11 23:20:26 [ray_executor.py:116] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
[1;36m(EngineCore_DP0 pid=1880) [0;0m 2025-11-11 23:20:26,093 INFO compiled_dag_node.py:2171 -- Tearing down compiled DAG
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] AsyncLLM output_handler failed.
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] Traceback (most recent call last):
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 476, in output_handler
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] outputs = await engine_core.get_output_async()
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 883, in get_output_async
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] raise self._format_exception(outputs) from None
[1;36m(APIServer pid=1554) [0;0m ERROR 11-11 23:20:26 [async_llm.py:524] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[1;36m(APIServer pid=1554) [0;0m INFO 11-11 23:20:26 [async_llm.py:442] Request chatcmpl-2085e3182f394a2a8554f00028f4143d failed (engine dead).
Your current environment
The output of
python collect_env.py🐛 Describe the bug
VLLM_MOE_DEEP_GEMM=0 vllm serve /mnt/models --served-model-name deepseek-ai/DeepSeek-V3 -tp 16 -dcp 16 --enable-expert-parallel --all2all-backend deepep_low_latency --distributed-executor-backend ray --async-scheduling --max-model-len 128K --max-num-batched-tokens 16K --max-num-seqs 32Error traceback
Before submitting a new issue...