[ModelRunnerV2] Support prompt embeds#42963
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
@yewentao256 @njhill Could you please take a look? Thx! |
There was a problem hiding this comment.
Code Review
This pull request implements support for prompt embeddings in the V1 GPU worker. Key changes include updating the model runner to handle prompt_embeds during request addition and prompt length calculation, and modifying the model state to store and apply these embeddings during the model execution phase. Additionally, the logic for preparing input embeddings was refactored to accommodate both multi-modal inputs and prompt embeddings. A high-severity issue was identified in the _remove_request method, where the current order of operations could lead to a race condition if a request index is reused by a concurrent process before its associated model state is fully cleared.
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
Thanks for the work! Could you add description of what specific issue solved for this PR?
E.g
VLLM_USE_V2_MODEL_RUNNER=1 pytest tests/basic_correctness/test_cumem.py::test_deep_sleep
Originally
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] File "/home/yewentao256/vllm-source/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] File "/home/yewentao256/vllm-source/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] return func(*args, **kwargs)
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] File "/home/yewentao256/vllm-source/vllm/v1/worker/gpu_worker.py", line 351, in reload_weights
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] self.model_runner.reload_weights(*args, **kwargs)
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2663266) ERROR 05-14 19:13:43 [core.py:1360] AttributeError: 'GPUModelRunner' object has no attribute 'reload_weights'
Now
======================================== 1 passed, 17 warnings in 45.19s =======================================
Sure. Done now. |
yewentao256
left a comment
There was a problem hiding this comment.
Thanks for the work! Could you also simplify the PR diff?
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Canlin Guo <961750412@qq.com>
Sure. Thanks for cleaning! |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
There was a problem hiding this comment.
Would renaming this method to prepare_inputs_embeds be better so that we can have less intrusive code for model runner backbone? Then this method will include mm embed and prompt embed. cc @WoosukKwon
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
request = self.requests[req_id]
if request.prompt_token_ids is None:
# Prompt logprobs is incompatible with prompt embeddings
continueWe have this in v1, should we add it as well?
|
Thanks @gcanlin But are we sure we want to support this in MRV2? We should not just port over everything blindly, we would like to deprecate as much as possible. |
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Yes. I add the guard before |
Agree. But I'm not sure how to decide whether any feature is needed. Do we have any target or plan for MRV2? |
|
This pull request has merge conflicts that must be resolved before it can be |
@njhill I for one would be greatly disappointed if prompt_embeds support was deprecated. My business is built on it, and if it were removed, I'd have to fork vLLM to continue doing business. It has a variety of use cases. The most common is to train custom MM encoders for modalities not supported natively in vLLM or for models that don't have natively trained MM encoders. A classic example is training a vision encoder that outputs in the token embedding space of a pure-text model, like Nemotron 3 Super. A user can encode their images directly to prompt embeddings, and send those embeddings either alongside text embeddings or as a Another use case I've seen is using prompt embeds to compress the number of tokens in a request. With a clever encoder you can compress dozens of text tokens into a single prompt embed. My particular use case creates a "privacy encoder" (in some sense). We train an encoder that takes in text and outputs a sequence of prompt embeddings that the language model still natively understands, but do not correspond to text in any meaningfully reversible sense without that target model. https://protopia.ai/stained-glass-transform/ People do use this feature, even if it's not super common, evidenced by the issues and PRs that open around it every so often. |
Purpose
Support prompt embeds for ModelRunnerV2.
Test Plan
VLLM_USE_V2_MODEL_RUNNER=1 pytest -sv tests/basic_correctness/test_basic_correctness.py::test_models -k "True-uni or True-mp"Before
After
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.