[Core] Make Whisper work with b200 + flashinfer by russellb · Pull Request #25098 · vllm-project/vllm

russellb · 2025-09-17T21:04:10Z

These changes were necessary to get Whisper working on a B200 machine
with the flashinfer attention backend. There are three changes:

Make flashinfer not reject `ENCODER_DECODER`` attention.
Make flashinfer handle the case where key and value are None.
With cross attention (ENCODER_DECODER), key and value are only
set the first pass through the decoder for a given request. It is
then cached in the kv cache for subsequent passes.
In the GPU model runner, this configuration enabled a code path
where force_attention was set to True in _dummy_run().
We need to pass a non-None encoder_seq_lens to the cross attention
metadata builder.

Signed-off-by: Russell Bryant rbryant@redhat.com

gemini-code-assist

Code Review

This pull request enables Whisper support on B200 with the flashinfer backend by allowing ENCODER_DECODER attention and handling potential None values for keys and values in cross-attention. The changes are generally correct, but I've identified a couple of areas for improvement. The error message in flashinfer.py has become outdated and could be misleading. Additionally, the logic for creating dummy encoder_seq_lens in gpu_model_runner.py for warmup/profiling is both too broad in its condition and too narrow in its application, which could lead to incorrect behavior or incomplete warmup for batched cross-attention. I have provided suggestions to address these points.

vllm/v1/worker/gpu_model_runner.py

vllm/v1/attention/backends/flashinfer.py

LucasWilkinson

Overall looks pretty good; I am question modifying the forward signature for only some backends. Can't think of a great way around it though so is probably good until we can figure out if there's something better we can do 👍

Only real issue is the removal of dcp_local_seq_lens

vllm/v1/worker/gpu_model_runner.py

mergify · 2025-11-11T16:55:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @russellb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MatthewBonanni · 2025-11-13T15:53:13Z

Could you implement supports_attn_type in FlashInferBackend? That will enable the selector to pick FlashInfer automatically

LucasWilkinson

LGTM; thanks!

heheda12345

LGTM!

Wait for this comment

Could you implement supports_attn_type in FlashInferBackend? That will enable the selector to pick FlashInfer automatically

mergify · 2026-02-02T18:51:03Z

Hi @russellb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-02T19:07:02Z

Documentation preview: https://vllm--25098.org.readthedocs.build/en/25098/

mergify · 2026-02-02T19:10:39Z

Hi @russellb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

russellb · 2026-02-03T01:43:47Z

@LucasWilkinson can you take one more look? I rebased this again to fix a bunch of conflicts.

NickLucche

We need tests for this imo, if this is adding FI support regardless of sm arch, we can easily test on CI by adding a parametrization on the attention backend.

russellb · 2026-02-03T13:33:02Z

We need tests for this imo

Yeah, good point. I haven't actually run this in months ...

mgoin · 2026-02-05T18:55:29Z

vllm/v1/attention/backends/flashinfer.py

+    @classmethod
+    def supports_attn_type(cls, attn_type: str) -> bool:
+        return attn_type in (
+            AttentionType.DECODER,
+            AttentionType.ENCODER_DECODER,
+        )


Have you tested that both FlashInfer and TRTLLM backends support both?

mgoin · 2026-02-05T18:55:43Z

vllm/v1/attention/backends/flashinfer.py

-        if attn_type != AttentionType.DECODER:
+        if attn_type not in (AttentionType.DECODER, AttentionType.ENCODER_DECODER):
            raise NotImplementedError(
                "Encoder self-attention and "
                "encoder/decoder cross-attention "


The error message needs to be updated

mgoin · 2026-02-05T18:56:13Z

vllm/v1/attention/backends/flashinfer.py

+                    assert key is not None
+                    assert value is not None


I think we also need this check for the trtllm backend?

These changes were necessary to get Whisper working on a B200 machine with the flashinfer attention backend: 1. Make flashinfer not reject `ENCODER_DECODER` attention. 2. Make flashinfer handle the case where `key` and `value` are None. With cross attention (`ENCODER_DECODER`), `key` and `value` are only set the first pass through the decoder for a given request. It is then cached in the kv cache for subsequent passes. 3. Update type hints for key/value in FlashAttention and FlashInfer backends to reflect that they may be None. 4. Add `supports_attn_type` method to flashinfer backend to properly report supported attention types. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2026-04-01T19:22:43Z

I let this bitrot for so long that it's probably worth starting fresh if this is actually a problem that still needs addressing.

russellb requested review from WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat and ywang96 as code owners September 17, 2025 21:04

mergify bot added the v1 label Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

mgoin reviewed Sep 17, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

russellb requested review from LucasWilkinson, heheda12345 and mgoin September 25, 2025 17:48

heheda12345 reviewed Oct 2, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Nov 7, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

LucasWilkinson reviewed Nov 7, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

russellb mentioned this pull request Nov 13, 2025

[CI Failure] Fix backend selection for encoder-only models #28534

Merged

5 tasks

russellb requested review from pavanimajety, youkaichao and zhuohan123 as code owners November 20, 2025 20:28

mergify bot removed the needs-rebase label Nov 20, 2025

LucasWilkinson approved these changes Nov 27, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 27, 2025

heheda12345 previously approved these changes Nov 30, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) November 30, 2025 19:28

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 30, 2025

heheda12345 disabled auto-merge November 30, 2025 19:28

russellb force-pushed the whisper-flashinfer branch from 1e8de6f to ef384cf Compare February 2, 2026 18:44

russellb force-pushed the whisper-flashinfer branch from ef384cf to d74f856 Compare February 2, 2026 19:06

mergify bot added the documentation Improvements or additions to documentation label Feb 2, 2026

russellb force-pushed the whisper-flashinfer branch from d74f856 to 441e0f5 Compare February 2, 2026 20:51

NickLucche reviewed Feb 3, 2026

View reviewed changes

russellb force-pushed the whisper-flashinfer branch from 441e0f5 to 4618256 Compare February 3, 2026 20:13

mgoin reviewed Feb 5, 2026

View reviewed changes

russellb force-pushed the whisper-flashinfer branch from 4618256 to 4c09a8d Compare February 5, 2026 18:56

russellb closed this Apr 1, 2026

github-project-automation bot moved this from In review to Done in NVIDIA Apr 1, 2026

Uh oh!

Conversation

russellb commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

MatthewBonanni commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

mergify bot commented Feb 2, 2026

Uh oh!

russellb commented Feb 3, 2026

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

russellb commented Feb 3, 2026

Uh oh!

mgoin Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

russellb commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MatthewBonanni commented Nov 13, 2025 •

edited

Loading