Bump Flashinfer to v0.6.1#30993
Conversation
There was a problem hiding this comment.
Code Review
This pull request bumps the Flashinfer version to v0.6.0rc1. The changes are consistent across the Dockerfiles, requirements, and source code. The main code change is the removal of the tile_tokens_dim argument from all TRTLLM MoE kernel calls, which is in line with the API changes in the new Flashinfer version as stated in the pull request description. The related helper functions for calculating this dimension have also been correctly removed. The changes appear correct and complete for this version bump. I have not found any issues of high or critical severity.
yewentao256
left a comment
There was a problem hiding this comment.
Thanks for the work!
I am thinking if we could wait a little bit until 0.6.0 formally out
elvischenv
left a comment
There was a problem hiding this comment.
@yewentao256 Will update to 0.6.0 when it is released. Thanks.
|
Just FYI: There is some compilation error with GCC 11, if update version, please update at least |
|
I am in favor of adding the ready label to see if there are other failures in the CI before we switch to 0.6.0. |
|
I think it's worth to run CI early because might be some fails. |
njhill
left a comment
There was a problem hiding this comment.
Just adding this to block merging until we update to 0.6.0
vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
Outdated
Show resolved
Hide resolved
08dbcc4 to
97e86fd
Compare
|
Blocked by FlashInfer TopK functional issue: flashinfer-ai/flashinfer#2320 FlashInfer already has a fix: flashinfer-ai/flashinfer#2325 |
This does not need to block us, we don't use the flashinfer sampler. We can just disable that test. |
Hi @njhill, we found that 0.6.1 only fixed the sampler issue on B200 but not on L4, which is used in vLLM CI. We'd like to skip it to move forward. Skipped the test by |
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Update docker/Dockerfile Signed-off-by: Pavani Majety <pavanimajety@gmail.com> Update to v0.6.0rc2 Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Update to v0.6.0rc2 Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Update to v0.6.0rc2 Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> update to 0.6.0 Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> update to 0.6.1 Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> remove tile_tokens_dim Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> fix lack of o_data_type of plan() Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> fix fa2/fa3 API breakage Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
3af5dad to
61cef9d
Compare
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
61cef9d to
c4a5b24
Compare
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, just a question.
Also may need an approval from @njhill
| if self._backend == "fa2": | ||
| args.append(fixed_split_size) | ||
| args.append(disable_split_kv) | ||
| args.append(0) # num_colocated_ctas |
There was a problem hiding this comment.
So FA3 doesn't support fixed_split_size?
There was a problem hiding this comment.
This is from flashinfer decode.py#L1065-L1089.
@nvpohanh Do you know why FA3 does not need these arguments?
There was a problem hiding this comment.
Yes they don't, they are designed for batch-invariance.
There was a problem hiding this comment.
Will this break the current batch invariance test?
There was a problem hiding this comment.
@yewentao256 I don't think this PR will break batch invariance test because:
- If the test was originally using FA2 backend, then it still uses FA2 backend and nothing is changed.
- FA3 backend is enabled to support FP8 kv-cache on Hopper GPUs. Previously, we cannot even run FP8-kv-cache on Hopper GPUs.
mgoin
left a comment
There was a problem hiding this comment.
LGTM, triggering more blackwell CI
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 陈建华 <1647430658@qq.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Purpose
Bump Flashinfer to v0.6.1 when it is released.
API change: argument
tile_tokens_dimhas been removed from all TRTLLM MoE kernels.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit a5e35ec5cd52bd4ca0c9e5a6fbb3a3b491e21ffa. Configure here.
Note
Upgrade FlashInfer to v0.6.0
0.6.0indocker/Dockerfile,Dockerfile.nightly_torch(source build pinned tov0.6.0), andrequirements/cuda.txt.API updates for MoE kernels
tile_tokens_dimfrom all TRTLLM MoE call sites and related helpers; deletescalculate_tile_tokens_dimand associated imports/usages acrossflashinfer_trtllm_moe.py,trtllm_moe.py,mxfp4.py,flashinfer_fp4_moe.py, tests.Attention backend adjustments
o_data_typethrough FlashInfer prefill/decode wrappers; updates fast-plan call to handle backend-specific arg lists (adds conditional args forfa2).Tests
Written by Cursor Bugbot for commit 100b3744ddd31ce849b0bae40a87e2dbe53107e9. This will update automatically on new commits. Configure here.