tests/v1/e2e/spec_decode/test_lora_with_spec_decode.py::test_batch_inference_correctness[model_setup0]
Capturing CUDA graphs (PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.36it/s]
Capturing CUDA graphs (FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 16.54it/s]
(EngineCore pid=3374923) Process EngineCore:
(EngineCore pid=3374923) Traceback (most recent call last):
(EngineCore pid=3374923) File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=3374923) self.run()
(EngineCore pid=3374923) File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=3374923) self._target(*self._args, **self._kwargs)
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1163, in run_engine_core
(EngineCore pid=3374923) raise e
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1133, in run_engine_core
(EngineCore pid=3374923) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 899, in __init__
(EngineCore pid=3374923) super().__init__(
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=3374923) kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=3374923) self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=3374923) compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=3374923) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 689, in compile_or_warm_up_model
(EngineCore pid=3374923) warmup_kernels(self.model_runner, self.execute_model, self.sample_tokens)
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/warmup.py", line 100, in warmup_kernels
(EngineCore pid=3374923) worker_execute_model(prefill_output)
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 843, in execute_model
(EngineCore pid=3374923) output = self.model_runner.execute_model(
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923) return func(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/model_runner.py", line 1167, in execute_model
(EngineCore pid=3374923) model_output = self.model(**model_inputs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=3374923) return self._call_impl(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=3374923) return forward_call(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen3.py", line 323, in forward
(EngineCore pid=3374923) hidden_states = self.model(
(EngineCore pid=3374923) ^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/compilation/decorators.py", line 520, in __call__
(EngineCore pid=3374923) return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(EngineCore pid=3374923) return self.fn(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore pid=3374923) def forward(
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/compilation/caching.py", line 217, in __call__
(EngineCore pid=3374923) return self.optimized_call(*args, **kwargs)
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "<string>", line 145, in execution_fn
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/vllm/compilation/cuda_graph.py", line 313, in __call__
(EngineCore pid=3374923) with torch.cuda.graph(
(EngineCore pid=3374923) ^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 257, in __enter__
(EngineCore pid=3374923) self.cuda_graph.capture_begin(
(EngineCore pid=3374923) File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 115, in capture_begin
(EngineCore pid=3374923) super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
(EngineCore pid=3374923) RuntimeError: CUDA graphs must be captured on a non-default stream. (However, after capture, it's ok to replay them on the default stream.)
Name of failing test
tests/v1/e2e/spec_decode/test_lora_with_spec_decode.py::test_batch_inference_correctness[model_setup0]
Basic information
transformers)🧪 Describe the failing test
Here's the full callstack
📝 History of failing test
According to the CI dashboard this test has been failing 100% of the time since f887aa1a53.
https://buildkite.com/vllm/ci/builds/66298
CC List.
@MatthewBonanni @LucasWilkinson @benchislett