[Core] Make Whisper work with b200 + flashinfer#25098
[Core] Make Whisper work with b200 + flashinfer#25098russellb wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables Whisper support on B200 with the flashinfer backend by allowing ENCODER_DECODER attention and handling potential None values for keys and values in cross-attention. The changes are generally correct, but I've identified a couple of areas for improvement. The error message in flashinfer.py has become outdated and could be misleading. Additionally, the logic for creating dummy encoder_seq_lens in gpu_model_runner.py for warmup/profiling is both too broad in its condition and too narrow in its application, which could lead to incorrect behavior or incomplete warmup for batched cross-attention. I have provided suggestions to address these points.
LucasWilkinson
left a comment
There was a problem hiding this comment.
Overall looks pretty good; I am question modifying the forward signature for only some backends. Can't think of a great way around it though so is probably good until we can figure out if there's something better we can do 👍
Only real issue is the removal of dcp_local_seq_lens
|
This pull request has merge conflicts that must be resolved before it can be |
|
Could you implement |
Wait for this comment
Could you implement supports_attn_type in FlashInferBackend? That will enable the selector to pick FlashInfer automatically
1e8de6f to
ef384cf
Compare
|
Hi @russellb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
ef384cf to
d74f856
Compare
|
Documentation preview: https://vllm--25098.org.readthedocs.build/en/25098/ |
|
Hi @russellb, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d74f856 to
441e0f5
Compare
|
@LucasWilkinson can you take one more look? I rebased this again to fix a bunch of conflicts. |
NickLucche
left a comment
There was a problem hiding this comment.
We need tests for this imo, if this is adding FI support regardless of sm arch, we can easily test on CI by adding a parametrization on the attention backend.
Yeah, good point. I haven't actually run this in months ... |
441e0f5 to
4618256
Compare
| @classmethod | ||
| def supports_attn_type(cls, attn_type: str) -> bool: | ||
| return attn_type in ( | ||
| AttentionType.DECODER, | ||
| AttentionType.ENCODER_DECODER, | ||
| ) |
There was a problem hiding this comment.
Have you tested that both FlashInfer and TRTLLM backends support both?
| if attn_type != AttentionType.DECODER: | ||
| if attn_type not in (AttentionType.DECODER, AttentionType.ENCODER_DECODER): | ||
| raise NotImplementedError( | ||
| "Encoder self-attention and " | ||
| "encoder/decoder cross-attention " |
There was a problem hiding this comment.
The error message needs to be updated
| assert key is not None | ||
| assert value is not None |
There was a problem hiding this comment.
I think we also need this check for the trtllm backend?
These changes were necessary to get Whisper working on a B200 machine with the flashinfer attention backend: 1. Make flashinfer not reject `ENCODER_DECODER` attention. 2. Make flashinfer handle the case where `key` and `value` are None. With cross attention (`ENCODER_DECODER`), `key` and `value` are only set the first pass through the decoder for a given request. It is then cached in the kv cache for subsequent passes. 3. Update type hints for key/value in FlashAttention and FlashInfer backends to reflect that they may be None. 4. Add `supports_attn_type` method to flashinfer backend to properly report supported attention types. Signed-off-by: Russell Bryant <rbryant@redhat.com>
4618256 to
4c09a8d
Compare
|
I let this bitrot for so long that it's probably worth starting fresh if this is actually a problem that still needs addressing. |
These changes were necessary to get Whisper working on a B200 machine
with the flashinfer attention backend. There are three changes:
Make flashinfer not reject `ENCODER_DECODER`` attention.
Make flashinfer handle the case where
keyandvalueare None.With cross attention (
ENCODER_DECODER),keyandvalueare onlyset the first pass through the decoder for a given request. It is
then cached in the kv cache for subsequent passes.
In the GPU model runner, this configuration enabled a code path
where
force_attentionwas set toTruein_dummy_run().We need to pass a non-None
encoder_seq_lensto the cross attentionmetadata builder.
Signed-off-by: Russell Bryant rbryant@redhat.com