[Serve][LLM] Fix inconsistent v0/v1 config passed to vLLM by ruisearch42 · Pull Request #52185 · ray-project/ray

ruisearch42 · 2025-04-09T19:37:59Z

Why are these changes needed?

We observed the following error in deploying Ray LLM V0:


2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | return run_method(self, method, args, kwargs)
-- | -- | -- | -- | --
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/utils.py", line 2255, in run_method
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | return func(*args, **kwargs)
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | self.worker.init_device()  # type: ignore
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 120, in init_device
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | self.model_runner: GPUModelRunner = GPUModelRunner(
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 110, in __init__
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | self.attn_backend = get_attn_backend(
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/attention/selector.py", line 95, in get_attn_backend
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | return _cached_get_attn_backend(Show Details
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/attention/selector.py", line 148, in _cached_get_attn_backend
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | attention_cls = current_platform.get_attn_backend_cls(
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 270, in get_attn_backend_cls
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | and kv_cache_dtype.startswith("fp8"))
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^
U | 2025-04-09, 10:50:12.511 | worker |   | ip-10-0-97-109 | AttributeError: 'torch.dtype' object has no attribute 'startswith'

This is because vLLM has inconsistent logic when handling kv_cache_dtype. For V0, it is a string and for V1, it is a torch.dtype. And when the V0 and V1 configs are messed up, this issue manifests.

Another recent change in vLLM is that it auto tries V1 and use V1 configs if possible. Therefore, when we are not using an environment VLLM_USE_V1, Ray LLM defaults to V0 and vLLM may use V1. This PR fixes the inconsistency by always explicitly passing the environment variable.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py

…t#52185)   ## Why are these changes needed?  We observed the following error in deploying Ray LLM V0: ``` 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | return run_method(self, method, args, kwargs) -- | -- | -- | -- | -- U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/utils.py", line 2255, in run_method U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | return func(*args, **kwargs) U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 604, in init_device U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | self.worker.init_device() # type: ignore U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 120, in init_device U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | self.model_runner: GPUModelRunner = GPUModelRunner( U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 110, in __init__ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | self.attn_backend = get_attn_backend( U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/attention/selector.py", line 95, in get_attn_backend U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | return _cached_get_attn_backend(Show Details U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/attention/selector.py", line 148, in _cached_get_attn_backend U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | attention_cls = current_platform.get_attn_backend_cls( U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 270, in get_attn_backend_cls U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | and kv_cache_dtype.startswith("fp8")) U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | ^^^^^^^^^^^^^^^^^^^^^^^^^ U | 2025-04-09, 10:50:12.511 | worker | | ip-10-0-97-109 | AttributeError: 'torch.dtype' object has no attribute 'startswith' ``` This is because vLLM has inconsistent logic when handling `kv_cache_dtype`. For V0, it is a string and for V1, it is a torch.dtype. And when the V0 and V1 configs are messed up, this issue manifests. Another recent change in vLLM is that it auto tries V1 and use V1 configs if possible. Therefore, when we are not using an environment VLLM_USE_V1, Ray LLM defaults to V0 and vLLM may use V1. This PR fixes the inconsistency by always explicitly passing the environment variable. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Steve Han <stevehan2001@gmail.com>

[Serve][LLM] Fix inconsistent v0/v1 config passed to vLLM

5d576ed

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

ruisearch42 requested a review from a team as a code owner April 9, 2025 19:38

ruisearch42 requested review from GeneDer, lk-chen and richardliaw April 9, 2025 19:38

up

4c93013

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

richardliaw approved these changes Apr 9, 2025

View reviewed changes

richardliaw added the go add ONLY when ready to merge, run all tests label Apr 9, 2025

lk-chen approved these changes Apr 9, 2025

View reviewed changes

SumanthRH reviewed Apr 9, 2025

View reviewed changes

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py Show resolved Hide resolved

lk-chen reviewed Apr 9, 2025

View reviewed changes

python/ray/llm/_internal/serve/deployments/llm/vllm/vllm_engine.py Show resolved Hide resolved

richardliaw merged commit 788e968 into ray-project:master Apr 9, 2025
6 checks passed

This was referenced Apr 10, 2025

Release test llm_serve_llama_3dot2_1B_s3 failed #52044

Closed

Release test llm_serve_llama_3dot1_8B_tp_2 failed #52043

Closed

Release test llm_serve_llama_3dot1_8B_quantized_tp_1 failed #52041

Closed

hainesmichaelc added the community-backlog label May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve][LLM] Fix inconsistent v0/v1 config passed to vLLM#52185

[Serve][LLM] Fix inconsistent v0/v1 config passed to vLLM#52185
richardliaw merged 2 commits intoray-project:masterfrom
ruisearch42:v0v1

ruisearch42 commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ruisearch42 commented Apr 9, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants