Skip to content

[CI Failure]: Spec Decode Draft Model fails during graph capture #42999

@SageMoore

Description

@SageMoore

Name of failing test

tests/v1/e2e/spec_decode/test_lora_with_spec_decode.py::test_batch_inference_correctness[model_setup0]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Here's the full callstack

Capturing CUDA graphs (PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.36it/s]
Capturing CUDA graphs (FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 16.54it/s]
(EngineCore pid=3374923) Process EngineCore:
(EngineCore pid=3374923) Traceback (most recent call last):
(EngineCore pid=3374923)   File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=3374923)     self.run()
(EngineCore pid=3374923)   File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=3374923)     self._target(*self._args, **self._kwargs)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1163, in run_engine_core
(EngineCore pid=3374923)     raise e
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 1133, in run_engine_core
(EngineCore pid=3374923)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=3374923)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 899, in __init__
(EngineCore pid=3374923)     super().__init__(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=3374923)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=3374923)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=3374923)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=3374923)     compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=3374923)                                                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=3374923)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=3374923)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 689, in compile_or_warm_up_model
(EngineCore pid=3374923)     warmup_kernels(self.model_runner, self.execute_model, self.sample_tokens)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/warmup.py", line 100, in warmup_kernels
(EngineCore pid=3374923)     worker_execute_model(prefill_output)
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu_worker.py", line 843, in execute_model
(EngineCore pid=3374923)     output = self.model_runner.execute_model(
(EngineCore pid=3374923)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=3374923)     return func(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/v1/worker/gpu/model_runner.py", line 1167, in execute_model
(EngineCore pid=3374923)     model_output = self.model(**model_inputs)
(EngineCore pid=3374923)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=3374923)     return self._call_impl(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=3374923)     return forward_call(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen3.py", line 323, in forward
(EngineCore pid=3374923)     hidden_states = self.model(
(EngineCore pid=3374923)                     ^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/decorators.py", line 520, in __call__
(EngineCore pid=3374923)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 224, in __call__
(EngineCore pid=3374923)     return self.fn(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/model_executor/models/qwen2.py", line 389, in forward
(EngineCore pid=3374923)     def forward(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/caching.py", line 217, in __call__
(EngineCore pid=3374923)     return self.optimized_call(*args, **kwargs)
(EngineCore pid=3374923)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "<string>", line 145, in execution_fn
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/vllm/compilation/cuda_graph.py", line 313, in __call__
(EngineCore pid=3374923)     with torch.cuda.graph(
(EngineCore pid=3374923)          ^^^^^^^^^^^^^^^^^
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 257, in __enter__
(EngineCore pid=3374923)     self.cuda_graph.capture_begin(
(EngineCore pid=3374923)   File "/home/sagemoore/git/nm-vllm/.venv/lib64/python3.12/site-packages/torch/cuda/graphs.py", line 115, in capture_begin
(EngineCore pid=3374923)     super().capture_begin(pool=pool, capture_error_mode=capture_error_mode)
(EngineCore pid=3374923) RuntimeError: CUDA graphs must be captured on a non-default stream. (However, after capture, it's ok to replay them on the default stream.)

📝 History of failing test

According to the CI dashboard this test has been failing 100% of the time since f887aa1a53.

https://buildkite.com/vllm/ci/builds/66298

CC List.

@MatthewBonanni @LucasWilkinson @benchislett

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions