-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuesllmrelease-testrelease testrelease testserveRay Serve Related IssueRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)
Description
What happened + What you expected to happen
AssertionError: failed to get the hash of the compiled graph for reproducer configuration. Likely downstream of vllm-project/vllm#18851 but not sure.
(base) ray@ip-10-0-13-54:~/default/work/ray/release/llm_tests/batch$ pytest -vs /home/ray/default/work/ray/release/llm_tests/batch/test_batch_vllm.py::test_vllm_vision_language_models
================================ test session starts ================================
platform linux -- Python 3.11.11, pytest-8.4.0, pluggy-1.5.0 -- /home/ray/anaconda3/bin/python
cachedir: .pytest_cache
rootdir: /home/ray/default/work/ray
configfile: pytest.ini
plugins: anyio-3.7.1
collected 1 item
test_batch_vllm.py::test_vllm_vision_language_models[mistral-community/pixtral-12b-2-1-2-60] 2025-06-13 17:30:36,155 INFO worker.py:1736 -- Connecting to existing Ray cluster at address: 10.0.13.54:6379...
2025-06-13 17:30:36,165 INFO worker.py:1907 -- Connected to Ray cluster. View the dashboard at https://session-cjxl8nq2k9baamrb4gpslw58iy.i.anyscaleuserdata-staging.com
2025-06-13 17:30:36,167 INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_dc51d41baf2dee7dc24a5935404938beb5eb9dbf.zip' (0.03MiB) to Ray cluster...
2025-06-13 17:30:36,167 INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_dc51d41baf2dee7dc24a5935404938beb5eb9dbf.zip'.
2025-06-13 17:30:37,241 INFO worker.py:1736 -- Connecting to existing Ray cluster at address: 10.0.13.54:6379...
2025-06-13 17:30:37,241 INFO worker.py:1754 -- Calling ray.init() again after it has already been called.
No cloud storage mirror configured
2025-06-13 17:30:37,752 WARNING util.py:596 -- The argument ``compute`` is deprecated in Ray 2.9. Please specify argument ``concurrency`` instead. For more information, see https://docs.ray.io/en/master/data/transforming-data.html#stateful-transforms.
2025-06-13 17:30:37,755 INFO logging.py:295 -- Registered dataset logger for dataset dataset_55_0
2025-06-13 17:30:37,769 INFO streaming_executor.py:117 -- Starting execution of Dataset dataset_55_0. Full logs are in /tmp/ray/session_2025-06-13_16-49-27_643862_2297/logs/ray-data
2025-06-13 17:30:37,770 INFO streaming_executor.py:118 -- Execution plan of Dataset dataset_55_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadRange] -> ActorPoolMapOperator[Map(<lambda>)->Map(_preprocess)->MapBatches(PrepareImageUDF)] -> ActorPoolMapOperator[MapBatches(ChatTemplateUDF)] -> ActorPoolMapOperator[MapBatches(vLLMEngineStageUDF)] -> TaskPoolMapOperator[Map(_postprocess)]
Running 0: 0.00 row [00:00, ? row/s]2025-06-13 17:30:37,844 INFO actor_pool_map_operator.py:630 -- Scaling up actor pool by 1 (reason=scaling to min size, running=0, restarting=0, pending=0)
2025-06-13 17:30:39,636 INFO actor_pool_map_operator.py:630 -- Scaling up actor pool by 1 (reason=scaling to min size, running=0, restarting=0, pending=0)
(_MapWorker pid=15582, ip=10.0.149.9) No cloud storage mirror configured
(_MapWorker pid=15582, ip=10.0.149.9) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Running 0: 0.00 row [00:06, ? row/s]2025-06-13 17:30:45,254 INFO actor_pool_map_operator.py:630 -- Scaling up actor pool by 2 (reason=scaling to min size, running=0, restarting=0, pending=0)
(_MapWorker pid=15668, ip=10.0.149.9) Max pending requests is set to 141
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:30:49 [__init__.py:243] Automatically detected platform cuda.
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:30:51 [__init__.py:31] Available plugins for group vllm.general_plugins:
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:30:51 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:30:51 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:30:59 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=5120.
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:30:49 [__init__.py:243] Automatically detected platform cuda.
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:30:51 [__init__.py:31] Available plugins for group vllm.general_plugins:
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:30:51 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:30:51 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(_MapWorker pid=15669, ip=10.0.149.9) WARNING 06-13 17:31:01 [utils.py:2531] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: In a Ray actor and can only be spawned
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:04 [__init__.py:243] Automatically detected platform cuda.
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:00 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=5120. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:07 [core.py:438] Waiting for init message from front-end.
(_MapWorker pid=15668, ip=10.0.149.9) WARNING 06-13 17:31:01 [utils.py:2531] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: In a Ray actor and can only be spawned
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:07 [__init__.py:31] Available plugins for group vllm.general_plugins:
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:07 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:07 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:07 [core.py:65] Initializing a V1 LLM engine (v0.9.0.1) with config: model='mistral-community/pixtral-12b', speculative_config=None, tokenizer='mistral-community/pixtral-12b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistral-community/pixtral-12b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:07 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_dc570d4b'), local_subscribe_addr='ipc:///tmp/bfa38bda-2d50-4e85-89d6-b22ae41764f3', remote_subscribe_addr=None, remote_addr_ipv6=False)
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:10 [__init__.py:243] Automatically detected platform cuda. [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:07 [core.py:438] Waiting for init message from front-end.
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:13 [__init__.py:31] Available plugins for group vllm.general_plugins: [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:13 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) INFO 06-13 17:31:13 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:07 [core.py:65] Initializing a V1 LLM engine (v0.9.0.1) with config: model='mistral-community/pixtral-12b', speculative_config=None, tokenizer='mistral-community/pixtral-12b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistral-community/pixtral-12b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:07 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_48bdd67d'), local_subscribe_addr='ipc:///tmp/c8eb3019-c658-4cd3-abf0-32bbe9adcfe8', remote_subscribe_addr=None, remote_addr_ipv6=False)
(_MapWorker pid=15669, ip=10.0.149.9) WARNING 06-13 17:31:13 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7c218caa9190>
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:13 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_d2c7c052'), local_subscribe_addr='ipc:///tmp/0041824d-a4bb-404c-a9e7-3c215af0a0af', remote_subscribe_addr=None, remote_addr_ipv6=False)
(_MapWorker pid=15669, ip=10.0.149.9) WARNING 06-13 17:31:14 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x76c6ebda1050>
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:14 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_25225be8'), local_subscribe_addr='ipc:///tmp/bf132730-18c9-48f9-a412-a334ac185c67', remote_subscribe_addr=None, remote_addr_ipv6=False)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:15 [utils.py:1077] Found nccl from library libnccl.so.2
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:15 [utils.py:1077] Found nccl from library libnccl.so.2
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:15 [pynccl.py:69] vLLM is using nccl==2.26.2
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:15 [pynccl.py:69] vLLM is using nccl==2.26.2
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:10 [__init__.py:243] Automatically detected platform cuda. [repeated 3x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:15 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/ray/.cache/vllm/gpu_p2p_access_cache_for_2,3.json
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:15 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/ray/.cache/vllm/gpu_p2p_access_cache_for_2,3.json
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) WARNING 06-13 17:31:15 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) WARNING 06-13 17:31:15 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:15 [parallel_state.py:1064] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:15 [parallel_state.py:1064] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(_MapWorker pid=15669, ip=10.0.149.9) No cloud storage mirror configured [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) Max pending requests is set to 141
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) WARNING 06-13 17:31:18 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:18 [gpu_model_runner.py:1531] Starting to load model mistral-community/pixtral-12b...
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) WARNING 06-13 17:31:18 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:18 [gpu_model_runner.py:1531] Starting to load model mistral-community/pixtral-12b...
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:18 [cuda.py:217] Using Flash Attention backend on V1 engine.
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:14 [__init__.py:31] Available plugins for group vllm.general_plugins: [repeated 3x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:14 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver [repeated 3x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:31:14 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load. [repeated 3x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) WARNING 06-13 17:31:14 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x78034e3355d0> [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:15 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_c05053a1'), local_subscribe_addr='ipc:///tmp/9c5bbd66-4a2f-42e0-a41d-a3d1faa84d13', remote_subscribe_addr=None, remote_addr_ipv6=False) [repeated 4x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:18 [cuda.py:217] Using Flash Attention backend on V1 engine.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:19 [backends.py:35] Using InductorAdaptor
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:19 [backends.py:35] Using InductorAdaptor
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:19 [weight_utils.py:291] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 17% Completed | 1/6 [00:00<00:02, 1.69it/s]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:23 [default_loader.py:280] Loading weights took 4.11 seconds
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:15 [utils.py:1077] Found nccl from library libnccl.so.2 [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:15 [pynccl.py:69] vLLM is using nccl==2.26.2 [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:15 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/ray/.cache/vllm/gpu_p2p_access_cache_for_0,1.json [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) WARNING 06-13 17:31:15 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=1 pid=15917) INFO 06-13 17:31:15 [parallel_state.py:1064] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1 [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) WARNING 06-13 17:31:19 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:19 [gpu_model_runner.py:1531] Starting to load model mistral-community/pixtral-12b... [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:23 [gpu_model_runner.py:1549] Model loading took 12.0733 GiB and 4.761811 seconds
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:19 [cuda.py:217] Using Flash Attention backend on V1 engine. [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:24 [gpu_model_runner.py:1863] Encoder cache will be initialized with a budget of 5120 tokens, and profiled with 2 image items of the maximum feature size.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:24 [gpu_model_runner.py:1863] Encoder cache will be initialized with a budget of 5120 tokens, and profiled with 2 image items of the maximum feature size.
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:19 [backends.py:35] Using InductorAdaptor [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916)
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:19 [weight_utils.py:291] Using model weights format ['*.safetensors'] [repeated 3x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:32 [backends.py:459] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/c5b6593c95/rank_0_0 for vLLM's torch.compile
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:31:32 [backends.py:469] Dynamo bytecode transform time: 7.78 s
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:24 [default_loader.py:280] Loading weights took 4.27 seconds [repeated 3x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:24 [gpu_model_runner.py:1549] Model loading took 12.0733 GiB and 5.113338 seconds [repeated 3x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=1 pid=15917) INFO 06-13 17:31:24 [gpu_model_runner.py:1863] Encoder cache will be initialized with a budget of 5120 tokens, and profiled with 2 image items of the maximum feature size. [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:32 [backends.py:459] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/c5b6593c95/rank_1_0 for vLLM's torch.compile
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:32 [backends.py:469] Dynamo bytecode transform time: 7.87 s
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) [rank0]:W0613 17:31:33.461000 15910 site-packages/torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 6/6 [00:04<00:00, 1.44it/s] [repeated 13x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) [rank1]:W0613 17:31:33.546000 15911 site-packages/torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:34 [backends.py:158] Cache the graph of shape None for later use
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] WorkerProc hit an exception.
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] Traceback (most recent call last):
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] output = func(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return func(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] self.model_runner.profile_run()
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] hidden_states = self._dummy_run(self.max_num_tokens)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return func(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1732, in _dummy_run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] outputs = model(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return self._call_impl(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return forward_call(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/model_executor/models/llava.py", line 738, in forward
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] hidden_states = self.language_model.model(input_ids,
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 238, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] output = self.compiled_callable(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1544, in _call_user_compiler
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] raise BackendCompilerFailed(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1519, in _call_user_compiler
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_fn = compiler_fn(gm, self.example_inputs())
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_gm = compiler_fn(gm, example_inputs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_gm = compiler_fn(gm, example_inputs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/__init__.py", line 2392, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return self.compiler_fn(model_, inputs_, **self.kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 498, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] PiecewiseCompileInterpreter(self.split_gm, submod_names_to_compile,
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 273, in run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return super().run(*fake_args)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/fx/interpreter.py", line 171, in run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] self.env[node] = self.run_node(node)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/fx/interpreter.py", line 240, in run_node
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return getattr(self, n.op)(n.target, args, kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 289, in call_module
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiler_manager.compile(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 145, in compile
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_graph, handle = self.compiler.compile(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 415, in compile
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] assert hash_str is not None, (
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7c208c122390>' raised:
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] AssertionError: failed to get the hash of the compiled graph
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] Traceback (most recent call last):
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] output = func(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return func(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 185, in determine_available_memory
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] self.model_runner.profile_run()
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1897, in profile_run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] hidden_states = self._dummy_run(self.max_num_tokens)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return func(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1732, in _dummy_run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] outputs = model(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return self._call_impl(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return forward_call(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/model_executor/models/llava.py", line 738, in forward
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] hidden_states = self.language_model.model(input_ids,
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 238, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] output = self.compiled_callable(*args, **kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1544, in _call_user_compiler
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] raise BackendCompilerFailed(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1519, in _call_user_compiler
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_fn = compiler_fn(gm, self.example_inputs())
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_gm = compiler_fn(gm, example_inputs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_gm = compiler_fn(gm, example_inputs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/__init__.py", line 2392, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return self.compiler_fn(model_, inputs_, **self.kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 498, in __call__
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] PiecewiseCompileInterpreter(self.split_gm, submod_names_to_compile,
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 273, in run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return super().run(*fake_args)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/fx/interpreter.py", line 171, in run
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] self.env[node] = self.run_node(node)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/torch/fx/interpreter.py", line 240, in run_node
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] return getattr(self, n.op)(n.target, args, kwargs)
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 289, in call_module
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiler_manager.compile(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/backends.py", line 145, in compile
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] compiled_graph, handle = self.compiler.compile(
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/compilation/compiler_interface.py", line 415, in compile
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] assert hash_str is not None, (
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] ^^^^^^^^^^^^^^^^^^^^
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7c208c122390>' raised:
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] AssertionError: failed to get the hash of the compiled graph
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) ERROR 06-13 17:32:02 [multiproc_executor.py:522]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:33 [backends.py:459] Using cache directory: /home/ray/.cache/vllm/torch_compile_cache/c5b6593c95/rank_0_0 for vLLM's torch.compile [repeated 2x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:31:33 [backends.py:469] Dynamo bytecode transform time: 7.85 s [repeated 2x across cluster]
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=1 pid=15911) INFO 06-13 17:31:34 [backends.py:158] Cache the graph of shape None for later use [repeated 3x across cluster]
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:32:04 [backends.py:170] Compiling a graph for general shape takes 31.02 s
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=1 pid=15917) INFO 06-13 17:32:04 [backends.py:170] Compiling a graph for general shape takes 31.21 s
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) INFO 06-13 17:32:24 [monitor.py:33] torch.compile takes 38.87 s in total
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=1 pid=15917) INFO 06-13 17:32:24 [monitor.py:33] torch.compile takes 38.88 s in total
(_MapWorker pid=15669, ip=10.0.149.9) (VllmWorker rank=0 pid=15910) INFO 06-13 17:32:04 [backends.py:170] Compiling a graph for general shape takes 31.81 s
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:32:25 [kv_cache_utils.py:637] GPU KV cache size: 78,112 tokens
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:32:25 [kv_cache_utils.py:640] Maximum concurrency for 4,096 tokens per request: 19.07x
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:32:25 [kv_cache_utils.py:637] GPU KV cache size: 78,112 tokens
(_MapWorker pid=15668, ip=10.0.149.9) INFO 06-13 17:32:25 [kv_cache_utils.py:640] Maximum concurrency for 4,096 tokens per request: 19.07x
Running 0: 0.00 row [01:47, ? row/s]^C
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
python/ray/includes/common.pxi:83: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
============================================================================== no tests ran in 120.67s (0:02:00) ==============================================================================
(_MapWorker pid=15668, ip=10.0.149.9) (VllmWorker rank=0 pid=15916) [rank0]:W0613 17:31:34.083000 15916 site-packages/torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode [repeated 2x across cluster]
Running 0: 0.00 row [01:57, ? row/s]
Versions / Dependencies
anyscale/ray-llm:nightly-py311-cu124, vllm 0.9.0.1
Reproduction script
import shutil
import sys
import os
import logging
import pytest
import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig
logger = logging.getLogger(__name__)
@ray.remote(num_gpus=1)
def delete_torch_compile_cache_on_worker(path: str = "~/.cache/vllm/torch_compile_cache"):
"""Delete torch compile cache on worker.
Avoids AssertionError due to torch compile cache corruption
TODO(seiji): check if this is still needed after https://github.com/vllm-project/vllm/issues/18851 is fixed
"""
torch_compile_cache_path = os.path.expanduser(path)
if os.path.exists(torch_compile_cache_path):
shutil.rmtree(torch_compile_cache_path)
logger.warning(f"Deleted torch compile cache at {torch_compile_cache_path}")
@pytest.mark.parametrize(
"model_source,tp_size,pp_size,concurrency,sample_size",
[
# LLaVA model with TP=1, PP=1, concurrency=1
# ("llava-hf/llava-1.5-7b-hf", 1, 1, 1, 60),
# Qwen2.5 VL model with TP=2, PP=1, concurrency=2
("mistral-community/pixtral-12b", 2, 1, 2, 60),
],
)
def test_vllm_vision_language_models(
model_source, tp_size, pp_size, concurrency, sample_size
):
"""Test vLLM with vision language models using different configurations."""
cache_dir_template = "./torch_compile_cache/${HOSTNAME:-$(hostname)}/${CUDA_VISIBLE_DEVICES%,*}"
ray.get(delete_torch_compile_cache_on_worker.remote(cache_dir_template))
# vLLM v1 does not support decoupled tokenizer,
# but since the tokenizer is in a separate process,
# the overhead should be moderated.
tokenize = False
detokenize = False
# TODO(seiji): see if we can remove this once https://github.com/vllm-project/vllm/issues/18851 is fixed
# Use shell variable expansion to ensure each GPU gets its own cache directory
# Similar to: export TORCHINDUCTOR_CACHE_DIR=./torch_compile_cache/${HOSTNAME}/${dev}
processor_config = vLLMEngineProcessorConfig(
model_source=model_source,
task_type="generate",
engine_kwargs=dict(
tensor_parallel_size=tp_size,
pipeline_parallel_size=pp_size,
max_model_len=4096,
enable_chunked_prefill=True,
),
apply_chat_template=True,
tokenize=tokenize,
detokenize=detokenize,
batch_size=16,
concurrency=concurrency,
has_image=True,
)
processor = build_llm_processor(
processor_config,
preprocess=lambda row: dict(
model=model_source,
messages=[
{"role": "system", "content": "You are an assistant"},
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Say {row['id']} words about this image.",
},
{
"type": "image",
"image": "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/cherry_blossom.jpg",
},
],
},
],
sampling_params=dict(
temperature=0.3,
max_tokens=50,
),
),
postprocess=lambda row: {
"resp": row["generated_text"],
},
)
ds = ray.data.range(sample_size)
ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
ds = processor(ds)
ds = ds.materialize()
outs = ds.take_all()
assert len(outs) == sample_size
assert all("resp" in out for out in outs)
Issue Severity
None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tdataRay Data-related issuesRay Data-related issuesllmrelease-testrelease testrelease testserveRay Serve Related IssueRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)Needs triage (eg: priority, bug/not-bug, and owning component)