[Core] Whisper Enable Encoder Batching#29421
Conversation
|
will address failure related to models using whisper encoder only |
| "Encoder-decoder model detected: setting " | ||
| "`max_num_encoder_input_tokens` to encoder length (%s)", | ||
| self.scheduler_config.max_num_encoder_input_tokens, | ||
| if ( |
There was a problem hiding this comment.
why not just set the multiproc method to spawn?
There was a problem hiding this comment.
not sure there was history here but should check with @russellb https://vllm-dev.slack.com/archives/C07QCGVDNUF/p1760579949992319
|
@DarkLight1337 I really can't seem to be able to get a full green CI on this PR, still blocked by unrelated are we already tracking this test? |
|
It seems to be just flaky, it passes decently often on main |
4aca083 to
9842860
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
f651ce7 to
26b193b
Compare
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
04c3f0e to
7379be9
Compare
| self.original_max_model_len = self.max_model_len | ||
| self.max_model_len = self.get_and_verify_max_len(self.max_model_len) | ||
|
|
||
| if self.is_encoder_decoder: |
There was a problem hiding this comment.
note: I dont think this actually is enough. I still need to specify this
There was a problem hiding this comment.
This setting does not apply to the MM feature cache in the model runner.
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
|
Can anyone explain why the EPD-Disaggregation is not applied here to optimize Whisper? |
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
This PR addresses an important performance limitation of our current Whisper implementation, that is the encoder is only running one request at a time, instead of scheduling multiple audios and batching them in a single (encoder) forward.
This is particularly bad at high-memory/high request rates deployments.
To summarize changes with some examples:
These changes build on top of this PR #29268 to generalize seq_lens.
Changes here are confined to:
Whisper: just batching hidden states properly before MHA
Scheduling: avoid constraining
self.scheduler_config.max_num_encoder_input_tokensto max size of one item. This effectively inhibits the EncoderCache from allowing multiple items in a scheduling step.More scheduling: there's an issue related to skippingcheck_and_update_cachefor requests with same input.The problem is that
EncoderCacheManager.allocate()always decrements slots, even when reusing an entry from freeable. This causesnum_freeable_slotsandfreeableto get out of sync, ascheck_and_update_cache()is always skipped for encoder-decoder.What I did in this PR is to have encoder-decoder models callcheck_and_update_cache, but stil schedule the encoder input even if cached. This allows to keep the cache state in sync while retaining previous behavior.Future work can focus on enabling the cache for Whisper, now that the flow is getting more and more aligned with MM models.
EDIT: I have instead provided a much simpler alternative cache that clearly highlights the workflow of enc-dec models (for scheduling only).
Results
cc @DarkLight1337 @russellb