[Core] Whisper Enable Encoder Batching by NickLucche · Pull Request #29421 · vllm-project/vllm

NickLucche · 2025-11-25T16:40:07Z

This PR addresses an important performance limitation of our current Whisper implementation, that is the encoder is only running one request at a time, instead of scheduling multiple audios and batching them in a single (encoder) forward.
This is particularly bad at high-memory/high request rates deployments.

To summarize changes with some examples:

# MAIN

(EngineCore_DP0 pid=2996843) INFO 11-25 18:02:30 [gpu_model_runner.py:4220] Encoder cache will be initialized with a budget of 1500 tokens, and profiled with 1 audio items of the maximum feature size.

# Extra-logging added for debugging 
***ENCODER INPUT HIDDEN STATES*** torch.Size([1500, 1280])

# PR

Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 21 audio items of the maximum feature size.

# Extra-logging added for debugging 
***ENCODER INPUT HIDDEN STATES*** torch.Size([12, 1500, 1280])

These changes build on top of this PR #29268 to generalize seq_lens.

Changes here are confined to:

Whisper: just batching hidden states properly before MHA

Scheduling: avoid constraining self.scheduler_config.max_num_encoder_input_tokens to max size of one item. This effectively inhibits the EncoderCache from allowing multiple items in a scheduling step.

More scheduling: there's an issue related to skipping check_and_update_cache for requests with same input.
The problem is that EncoderCacheManager.allocate() always decrements slots, even when reusing an entry from freeable. This causes num_freeable_slots and freeable to get out of sync, as check_and_update_cache() is always skipped for encoder-decoder.
What I did in this PR is to have encoder-decoder models call check_and_update_cache, but stil schedule the encoder input even if cached. This allows to keep the cache state in sync while retaining previous behavior.
Future work can focus on enabling the cache for Whisper, now that the flow is getting more and more aligned with MM models.

EDIT: I have instead provided a much simpler alternative cache that clearly highlights the workflow of enc-dec models (for scheduling only).

Results

# MAIN


# 10
RESULTS SUMMARY
================================================================================
Total samples: 10
Successful: 10
Failed: 0
Total time: 0.39s
Average latency: 0.19s
Throughput: 25.96 requests/s

# 50

================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 73
Successful: 73
Failed: 0
Total time: 1.35s
Average latency: 0.63s
Throughput: 54.27 requests/s

==========================================================================

# This PR

# 10
================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 10
Successful: 10
Failed: 0
Total time: 0.35s
Average latency: 0.15s
Throughput: 28.83 requests/s

# 50
================================================================================
RESULTS SUMMARY
================================================================================
Total samples: 73
Successful: 73
Failed: 0
Total time: 0.98s
Average latency: 0.49s
Throughput: 74.32 requests/s

cc @DarkLight1337 @russellb

DarkLight1337

Thanks, LGTM

NickLucche · 2025-11-26T13:32:51Z

will address failure related to models using whisper encoder only

robertgshaw2-redhat · 2025-11-26T14:40:31Z

-                    "Encoder-decoder model detected: setting "
-                    "`max_num_encoder_input_tokens` to encoder length (%s)",
-                    self.scheduler_config.max_num_encoder_input_tokens,
+            if (


why not just set the multiproc method to spawn?

not sure there was history here but should check with @russellb https://vllm-dev.slack.com/archives/C07QCGVDNUF/p1760579949992319

NickLucche · 2025-11-27T14:32:17Z

@DarkLight1337 I really can't seem to be able to get a full green CI on this PR, still blocked by unrelated


entrypoints/openai/test_response_api_with_harmony.py::test_function_calling_with_stream[openai/gpt-oss-20b] - httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)
--

are we already tracking this test?

DarkLight1337 · 2025-11-27T14:41:35Z

It seems to be just flaky, it passes decently often on main

mergify · 2025-11-28T06:50:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: NickLucche <nlucches@redhat.com>

robertgshaw2-redhat · 2025-12-12T02:47:57Z

        self.original_max_model_len = self.max_model_len
        self.max_model_len = self.get_and_verify_max_len(self.max_model_len)
+
+        if self.is_encoder_decoder:


note: I dont think this actually is enough. I still need to specify this

This setting does not apply to the MM feature cache in the model runner.

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

npuichigo · 2025-12-26T11:09:58Z

Can anyone explain why the EPD-Disaggregation is not applied here to optimize Whisper?

Signed-off-by: NickLucche <nlucches@redhat.com>

mergify Bot added the v1 label Nov 25, 2025

NickLucche marked this pull request as ready for review November 25, 2025 21:05

NickLucche requested review from ApostaC, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners November 25, 2025 21:05

DarkLight1337 approved these changes Nov 26, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 26, 2025 04:42

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2025

NickLucche disabled auto-merge November 26, 2025 14:33

robertgshaw2-redhat reviewed Nov 26, 2025

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

NickLucche enabled auto-merge (squash) November 27, 2025 11:31

NickLucche force-pushed the whisper-enable-encoder-batching branch from 4aca083 to 9842860 Compare November 27, 2025 16:08

mergify Bot added the needs-rebase label Nov 28, 2025

NickLucche force-pushed the whisper-enable-encoder-batching branch from f651ce7 to 26b193b Compare November 28, 2025 09:24

NickLucche added 11 commits December 11, 2025 18:41

fix encoder scheduling

d770f71

Signed-off-by: NickLucche <nlucches@redhat.com>

finalize config changes

2a77d48

Signed-off-by: NickLucche <nlucches@redhat.com>

do not stack for models handling batching externally

35a6314

Signed-off-by: NickLucche <nlucches@redhat.com>

fix tests

345b4ef

Signed-off-by: NickLucche <nlucches@redhat.com>

fix tests

568bbcb

Signed-off-by: NickLucche <nlucches@redhat.com>

fix tests

e16d6ca

Signed-off-by: NickLucche <nlucches@redhat.com>

timeout already set

53d224d

Signed-off-by: NickLucche <nlucches@redhat.com>

skip test

2932999

Signed-off-by: NickLucche <nlucches@redhat.com>

revert test changes

cb3f171

Signed-off-by: NickLucche <nlucches@redhat.com>

enc-dec basic cache manager

0c6d82b

Signed-off-by: NickLucche <nlucches@redhat.com>

revert test change

7379be9

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the whisper-enable-encoder-batching branch from 04c3f0e to 7379be9 Compare December 11, 2025 18:41

NickLucche merged commit 0efd9f8 into vllm-project:main Dec 11, 2025
57 checks passed

robertgshaw2-redhat reviewed Dec 12, 2025

View reviewed changes

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Dec 15, 2025

[Core] Whisper Enable Encoder Batching (vllm-project#29421)

913aea6

Signed-off-by: NickLucche <nlucches@redhat.com>

This was referenced Dec 16, 2025

[Core] WhisperEncoder support torch.compile #30549

Open

[Bugfix] Whisper fix number of allocated CrossAttn blocks per-request #30772

Merged

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

[Core] Whisper Enable Encoder Batching (vllm-project#29421)

75f625a

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

[Core] Whisper Enable Encoder Batching (vllm-project#29421)

7068190

Signed-off-by: NickLucche <nlucches@redhat.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[Core] Whisper Enable Encoder Batching (vllm-project#29421)

ca31b98

Signed-off-by: NickLucche <nlucches@redhat.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

[Core] Whisper Enable Encoder Batching (vllm-project#29421)

c6c6027

Signed-off-by: NickLucche <nlucches@redhat.com>

0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026

[Core] Whisper Enable Encoder Batching (vllm-project#29421)

ccdc950

Signed-off-by: NickLucche <nlucches@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core] Whisper Enable Encoder Batching#29421

[Core] Whisper Enable Encoder Batching#29421
NickLucche merged 14 commits into
vllm-project:mainfrom
NickLucche:whisper-enable-encoder-batching

NickLucche commented Nov 25, 2025 •

edited by github-actions Bot

Loading

Uh oh!

DarkLight1337 left a comment

Uh oh!

NickLucche commented Nov 26, 2025

Uh oh!

robertgshaw2-redhat Nov 26, 2025

Uh oh!

NickLucche Nov 26, 2025

Uh oh!

Uh oh!

NickLucche commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

mergify Bot commented Nov 28, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat Dec 12, 2025

Uh oh!

DarkLight1337 Dec 12, 2025

Uh oh!

npuichigo commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

NickLucche commented Nov 25, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Nov 26, 2025

Uh oh!

robertgshaw2-redhat Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NickLucche commented Nov 27, 2025

Uh oh!

DarkLight1337 commented Nov 27, 2025

Uh oh!

mergify Bot commented Nov 28, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

npuichigo commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

NickLucche commented Nov 25, 2025 •

edited by github-actions Bot

Loading