Skip to content

[Core] Whisper Enable Encoder Batching#29421

Merged
NickLucche merged 14 commits into
vllm-project:mainfrom
NickLucche:whisper-enable-encoder-batching
Dec 11, 2025
Merged

[Core] Whisper Enable Encoder Batching#29421
NickLucche merged 14 commits into
vllm-project:mainfrom
NickLucche:whisper-enable-encoder-batching

Conversation

@NickLucche

@NickLucche NickLucche commented Nov 25, 2025

Copy link
Copy Markdown
Member

This PR addresses an important performance limitation of our current Whisper implementation, that is the encoder is only running one request at a time, instead of scheduling multiple audios and batching them in a single (encoder) forward.
This is particularly bad at high-memory/high request rates deployments.

To summarize changes with some examples:

# MAIN

(EngineCore_DP0 pid=2996843) INFO 11-25 18:02:30 [gpu_model_runner.py:4220] Encoder cache will be initialized with a budget of 1500 tokens, and profiled with 1 audio items of the maximum feature size.

# Extra-logging added for debugging 
***ENCODER INPUT HIDDEN STATES*** torch.Size([1500, 1280])


# PR

Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 21 audio items of the maximum feature size.

# Extra-logging added for debugging 
***ENCODER INPUT HIDDEN STATES*** torch.Size([12, 1500, 1280])

These changes build on top of this PR #29268 to generalize seq_lens.

Changes here are confined to:

Whisper: just batching hidden states properly before MHA

Scheduling: avoid constraining self.scheduler_config.max_num_encoder_input_tokens to max size of one item. This effectively inhibits the EncoderCache from allowing multiple items in a scheduling step.

More scheduling: there's an issue related to skipping check_and_update_cache for requests with same input.
The problem is that EncoderCacheManager.allocate() always decrements slots, even when reusing an entry from freeable. This causes num_freeable_slots and freeable to get out of sync, as check_and_update_cache() is always skipped for encoder-decoder.

What I did in this PR is to have encoder-decoder models call check_and_update_cache, but stil schedule the encoder input even if cached. This allows to keep the cache state in sync while retaining previous behavior.
Future work can focus on enabling the cache for Whisper, now that the flow is getting more and more aligned with MM models.

EDIT: I have instead provided a much simpler alternative cache that clearly highlights the workflow of enc-dec models (for scheduling only).

Results

# MAIN


# 10
RESULTS SUMMARY
================================================================================
Total samples: 10
Successful: 10
Failed: 0
Total time: 0.39s
Average latency: 0.19s
Throughput: 25.96 requests/s

# 50

================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 73
Successful: 73
Failed: 0
Total time: 1.35s
Average latency: 0.63s
Throughput: 54.27 requests/s

==========================================================================
# This PR

# 10
================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 10
Successful: 10
Failed: 0
Total time: 0.35s
Average latency: 0.15s
Throughput: 28.83 requests/s

# 50
================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 73
Successful: 73
Failed: 0
Total time: 0.98s
Average latency: 0.49s
Throughput: 74.32 requests/s

cc @DarkLight1337 @russellb

@DarkLight1337 DarkLight1337 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) November 26, 2025 04:42
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2025
@NickLucche

Copy link
Copy Markdown
Member Author

will address failure related to models using whisper encoder only

@NickLucche NickLucche disabled auto-merge November 26, 2025 14:33
Comment thread vllm/config/vllm.py Outdated
"Encoder-decoder model detected: setting "
"`max_num_encoder_input_tokens` to encoder length (%s)",
self.scheduler_config.max_num_encoder_input_tokens,
if (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just set the multiproc method to spawn?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure there was history here but should check with @russellb https://vllm-dev.slack.com/archives/C07QCGVDNUF/p1760579949992319

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@NickLucche NickLucche enabled auto-merge (squash) November 27, 2025 11:31
@NickLucche

Copy link
Copy Markdown
Member Author

@DarkLight1337 I really can't seem to be able to get a full green CI on this PR, still blocked by unrelated


entrypoints/openai/test_response_api_with_harmony.py::test_function_calling_with_stream[openai/gpt-oss-20b] - httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
--

are we already tracking this test?

@DarkLight1337

Copy link
Copy Markdown
Member

It seems to be just flaky, it passes decently often on main

@NickLucche NickLucche force-pushed the whisper-enable-encoder-batching branch from 4aca083 to 9842860 Compare November 27, 2025 16:08
@mergify

mergify Bot commented Nov 28, 2025

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Nov 28, 2025
@NickLucche NickLucche force-pushed the whisper-enable-encoder-batching branch from f651ce7 to 26b193b Compare November 28, 2025 09:24
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche NickLucche force-pushed the whisper-enable-encoder-batching branch from 04c3f0e to 7379be9 Compare December 11, 2025 18:41
@NickLucche NickLucche merged commit 0efd9f8 into vllm-project:main Dec 11, 2025
57 checks passed
Comment thread vllm/config/model.py
self.original_max_model_len = self.max_model_len
self.max_model_len = self.get_and_verify_max_len(self.max_model_len)

if self.is_encoder_decoder:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I dont think this actually is enough. I still need to specify this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting does not apply to the MM feature cache in the model runner.

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Dec 15, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
@npuichigo

Copy link
Copy Markdown

Can anyone explain why the EPD-Disaggregation is not applied here to optimize Whisper?

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
Signed-off-by: NickLucche <nlucches@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants