[llm] upgrade vllm to 0.12.0#58026
Conversation
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…emp-vllm-0.12.0 Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
# Conflicts: # python/deplocks/llm/rayllm_py311_cpu.lock # python/deplocks/llm/rayllm_py311_cu128.lock # python/deplocks/llm/rayllm_test_py311_cpu.lock # python/deplocks/llm/rayllm_test_py311_cu128.lock
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…rgument 'tokens_only' Addresses ray-project#58973 - vLLM release 0.11.1 introduces tokens_only arguments to both FrontendArgs and EngineArgs. VLLMEngine.start() gathers arguments from both of them, which raises errors when collisions occur - Allow different argument sets to define the same arguments by name, and give precedence to the engine args (in case of collisions), then merge the dicts Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
- fixing ci error with `//ci/raydepsets:raydepsets -- build --all-configs` Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
There was a problem hiding this comment.
release tests all passing except for [jailed]llm_batch_vllm_multi_node (None) (0) - tracked here #58062 (comment)
PR to fix (by Rui) - #58866
nrghosh
left a comment
There was a problem hiding this comment.
python/ray/llm/tests/batch/gpu /processor/test_vllm_engine_proc.py::test_embedding_model is failing due to
RuntimeError: flashinfer-cubin version (0.5.3) does not match flashinfer version (0.5.2)
there is a mismatch between flashinfer-python (0.5.2) and flashinfer-cubin (0.5.3). The lock file pins flashinfer-python==0.5.2, but then vLLM 0.11.2 pulls in flashinfer-cubin==0.5.3 as a transitive dependency.
…n required by vllm 0.11.2 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
There was a problem hiding this comment.
test failure for /doc:source/data/doc_code/working-with-llms/embedding_example
msgspec.ValidationError: Expected `bool`, got `None` - at `$[4][10]`
issue is in vllm - fixed here (merged): vllm-project/vllm#29364 - so pushing to 0.12.0
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
vLLM renamed guided_decoding to structured_outputs and changed the embedding API: - SamplingParams: GuidedDecodingParams -> StructuredOutputsParams, guided_decoding -> structured_outputs (vllm-project/vllm#22772, vllm-project/vllm#29326) - Embedding: use encode(pooling_params=...) instead of generate(sampling_params=...) for pooling tasks (vllm-project/vllm#16188, vllm-project/vllm#25524) - EngineArgs: guided_decoding_backend -> structured_outputs_config User-facing "guided_decoding" key in sampling_params dict preserved for backwards compatibility. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
There was a problem hiding this comment.
after this merges, will undo transformers pin
ctx: #58980 (review)
eicherseiji left a comment
Ideally we can drop the transformers pin in the next vLLM version bump
cc @eicherseiji the multi-gpu tests (dp_pd_example and dp_basic_example.py) are OOMing on warmup (at first glance) - is reducing sequence length ok, or did something change with the compute configs? They both run on this branch with on a workspace w/ 4xL4.
|
Yeah @nrghosh we can reduce |
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
nrghosh
left a comment
There was a problem hiding this comment.
looks good @kouroshHakha
|
@aslonnie @richardliaw can you guys approve? |
Related prs that we should review when upgrading fully: - ray-project#58820 - Note from Rui: When we bump new vllm version, we should go with 0.11.2 instead of 0.11.1, which fixes a Ray multi-node PP regression that was introduced when adding torch-based PP https://github.com/vllm-project/vllm/releases/tag/v0.11.2 Issues: - closes ray-project#58937 - closes ray-project#58973 - closes ray-project#58702 --------- Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Nikhil Ghosh <nikhil@anyscale.com> Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com> Co-authored-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Related prs that we should review when upgrading fully:
https://github.com/vllm-project/vllm/releases/tag/v0.11.2
Issues: