[Core] Whisper support torch.compile#30385
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request introduces torch.compile support for the Whisper model's decoder, aimed at improving performance during the decoding phase. The implementation cleverly compiles only the decoding steps from the second step onwards, correctly identifying that the first step has a different computation graph due to cross-attention key-value cache generation. This is achieved by adding a force_eager flag to the _model_forward method in GPUModelRunner, which is conditionally set based on the presence of encoder inputs. The changes are well-designed, backward-compatible, and the generic approach in GPUModelRunner could be beneficial for other encoder-decoder models in the future. The code appears to be correct and I could not identify any issues of high or critical severity.
DarkLight1337
left a comment
There was a problem hiding this comment.
I think we should have at least one test that uses Whisper with CUDA graph
|
Changes look fine to me. All you are doing is adding a gated option to run eager mode in model_forward. Any downstream consumers who are subclassing just have to add that variable now. LGTM |
|
note: this generates incorrect answers |
b211ce8 to
3922aaa
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
3922aaa to
44b7a00
Compare
|
@robertgshaw2-redhat reviving this PR, are there any blockers? Happy to look at accuracy issues if any, can't spot them from usual tests |
|
a noob question: I noticed that this only enable Decoder part. Is there any blocker to enable torch.compile + Encoder part? just like mm + compile suport https://docs.vllm.ai/en/latest/design/torch_compile_multimodal/ |
|
Let's wait for @robertgshaw2-redhat to elaborate on the accuracy issues first |
| @@ -2919,6 +2920,17 @@ def _model_forward( | |||
| Returns: | |||
| Model output tensor | |||
| """ | |||
|
|
|||
| if force_eager: | |||
There was a problem hiding this comment.
Let's just add force_eager to ForwardContext and read that in the compile decorator?
There was a problem hiding this comment.
There is an option for enable_if in the support_torch_compile decorator - perhaps we can leverage that?
see
vllm/vllm/model_executor/models/qwen2_5_vl.py
Line 522 in 3a61232
There was a problem hiding this comment.
that flag is for optional compilation, here I need to always compile, but optionally do eager (aka not call the compiled graph)
|
I've implemented @ProExpertProg suggested approach and updated the description. |
5e45c90 to
60c956c
Compare
| @@ -156,7 +156,9 @@ def test_wer_correctness( | |||
| model_name, dataset_repo, expected_wer, n_examples=-1, max_concurrent_request=None | |||
| ): | |||
| # TODO refactor to use `ASRDataset` | |||
| with RemoteOpenAIServer(model_name, ["--enforce-eager"]) as remote_server: | |||
| with RemoteOpenAIServer( | |||
There was a problem hiding this comment.
-cc.mode=NONE if you want to just disable compilation but keep cuda graphs
There was a problem hiding this comment.
will follow up with more cg+compilation tests (which is the default vllm serve setup)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
0b4ba1b to
3366e2e
Compare
Signed-off-by: NickLucche <nlucches@redhat.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
### What this PR does / why we need it? 1. ✅ Upgrade vllm commit to: 0115 (8471b27) Modify import paths due to the refactors: vllm-project/vllm#32245 vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21034239336/job/60490156965?pr=5913 2. ✅Upgrade vllm commit to: 0119 (9a1f16d) Fix `WorkerProc.__init__() missing 1 required positional argument: 'is_driver_worker'` due to vllm-project/vllm#28506 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21156263050/job/60841668755?5569 3. ✅Upgrade vllm commit to: 0120(148117e) 1. Add `skip_compiled` param in `set_forward_context` due to vllm-project/vllm#30385 2. Modify `tests/ut/spec_decode/test_eagle_proposer.py` due to vllm-project/vllm#24322 change `self.max_num_tokens = vllm_config.scheduler_config.max_num_batched_tokens + max_batch_size` 3. Modify UT import paths due to the refactors:vllm-project/vllm#32060 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21204851770/job/60999046946 4. ✅Upgrade vllm commit to: 0121(f23fb5a) 1. vLLM switched `uses_mrope` from target to draft model config, making `positions`/`mrope_positions` mutually exclusive, breaking vllm-ascend's direct self.positions access and tests missing `draft_model_config.uses_mrope`. vllm-project/vllm#32048 2. Moved bs_to_padded_graph_size from CompilationConfig to CudagraphDispatcher due to the refactor vllm-project/vllm#30143 3. Remove unused `maybe_setup_kv_connector` due to vllm-project/vllm#32077 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21217728738/job/61043738834 6. ✅Upgrade vllm commit to: 0122(8ebf271) Updating FusedMoEParallelConfig (added enable_eplb) and FusedMoEConfig due to vllm-project/vllm#32414 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21249922546/job/61148613054 8. ✅Upgrade vllm commit to: 0123(dc917cc) Setting temperature=0.0 due to the removal of the default temperature value in vllm-project/vllm#32723 Test result: https://github.com/vllm-project/vllm-ascend/actions/runs/21280796875 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.14.0 - vLLM main: vllm-project/vllm@d682094 --------- Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: nanxing <1014662416@qq.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
This PR is yet another Whisper performance optimization, adding support for torch.compile during decoding step.
It follows a very similar approach to #30072 (and should also land after that to ensure best defaults) in which only the 2nd decoder steps onward are compiled.
This is due to the fact that step0 in enc-dec models computes and caches crossattn KVs, requiring encoder_output as additional input and hence generating a different graph from the other steps.
Updated profiling:

Considerations
I've attempted to add the "2nd decoder step" selection logic directly in the model runner (
_model_forward).I am aware that
_model_forwardis currently used by OOT runners (#25084), although no "official" runner interface contract is maintained to ensure compatibility (unlike connectors to name one), which makes maintaining these kinds of methods without breaking external usage quite hairy.As this change may affect those runners I am also pinging @patrick-toulme .
Happy to change to a less invasive logic if we find a cleaner way to do it @LucasWilkinson .
Other options are adding the flag to attn_metadata and then retrieving the metadata from the support_torch_compile wrapper OR hacking the WhisperDecoder
__call__(definitely not nice).UPDATE:
Following @ProExpertProg suggestion, I've moved to implementing the alternative option in which the
skip_compiledlogic is generically handled inside the compile decorator.So no change to
_model_forwardis actually needed.Related PRs #29421 #30072
Test with
Compilation is enabled by default:
cc @DarkLight1337 @robertgshaw2-redhat