Bump Flashinfer to v0.6.1 by elvischenv · Pull Request #30993 · vllm-project/vllm

elvischenv · 2025-12-18T23:50:09Z

Purpose

Bump Flashinfer to v0.6.1 when it is released.
API change: argument tile_tokens_dim has been removed from all TRTLLM MoE kernels.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

^{Cursor Bugbot is generating a summary for commit a5e35ec5cd52bd4ca0c9e5a6fbb3a3b491e21ffa. Configure here.}

Note

Upgrade FlashInfer to v0.6.0

Bumps FlashInfer to 0.6.0 in docker/Dockerfile, Dockerfile.nightly_torch (source build pinned to v0.6.0), and requirements/cuda.txt.

API updates for MoE kernels

Removes tile_tokens_dim from all TRTLLM MoE call sites and related helpers; deletes calculate_tile_tokens_dim and associated imports/usages across flashinfer_trtllm_moe.py, trtllm_moe.py, mxfp4.py, flashinfer_fp4_moe.py, tests.

Attention backend adjustments

Passes o_data_type through FlashInfer prefill/decode wrappers; updates fast-plan call to handle backend-specific arg lists (adds conditional args for fa2).

Tests

Updates MXFP4 MoE tests to align with new FlashInfer interfaces (removes tile sizing logic).

^{Written by Cursor Bugbot for commit 100b3744ddd31ce849b0bae40a87e2dbe53107e9. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request bumps the Flashinfer version to v0.6.0rc1. The changes are consistent across the Dockerfiles, requirements, and source code. The main code change is the removal of the tile_tokens_dim argument from all TRTLLM MoE kernel calls, which is in line with the API changes in the new Flashinfer version as stated in the pull request description. The related helper functions for calculating this dimension have also been correctly removed. The changes appear correct and complete for this version bump. I have not found any issues of high or critical severity.

yewentao256

Thanks for the work!
I am thinking if we could wait a little bit until 0.6.0 formally out

elvischenv

@yewentao256 Will update to 0.6.0 when it is released. Thanks.

jiahanc · 2025-12-22T17:37:32Z

Just FYI: There is some compilation error with GCC 11, if update version, please update at least v0.6.0rc2

pavanimajety · 2025-12-22T18:54:09Z

I am in favor of adding the ready label to see if there are other failures in the CI before we switch to 0.6.0.
@elvischenv Could we please update to rc2?

docker/Dockerfile

docker/Dockerfile.nightly_torch

requirements/cuda.txt

vadiklyutiy · 2025-12-22T23:50:27Z

I think it's worth to run CI early because might be some fails.
Are there any objections to set ready tag in order to check the CI?

njhill

Just adding this to block merging until we update to 0.6.0

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

nvpohanh · 2026-01-12T00:56:28Z

Blocked by FlashInfer TopK functional issue: flashinfer-ai/flashinfer#2320

FlashInfer already has a fix: flashinfer-ai/flashinfer#2325

njhill · 2026-01-14T04:45:35Z

Blocked by FlashInfer TopK functional issue: flashinfer-ai/flashinfer#2320

FlashInfer already has a fix: flashinfer-ai/flashinfer#2325

This does not need to block us, we don't use the flashinfer sampler. We can just disable that test.

elvischenv · 2026-01-15T07:53:19Z

This does not need to block us, we don't use the flashinfer sampler. We can just disable that test.

Hi @njhill, we found that 0.6.1 only fixed the sampler issue on B200 but not on L4, which is used in vLLM CI. We'd like to skip it to move forward. Skipped the test by @pytest.mark.skip.

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Update docker/Dockerfile Signed-off-by: Pavani Majety <pavanimajety@gmail.com> Update to v0.6.0rc2 Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Update to v0.6.0rc2 Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Update to v0.6.0rc2 Co-authored-by: Pavani Majety <pavanimajety@gmail.com> Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> update to 0.6.0 Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> update to 0.6.1 Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> remove tile_tokens_dim Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> fix lack of o_data_type of plan() Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> fix fa2/fa3 API breakage Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

yewentao256

LGTM, just a question.

Also may need an approval from @njhill

yewentao256 · 2026-01-19T22:20:46Z

vllm/v1/attention/backends/flashinfer.py

+        if self._backend == "fa2":
+            args.append(fixed_split_size)
+            args.append(disable_split_kv)
+            args.append(0)  # num_colocated_ctas


So FA3 doesn't support fixed_split_size?

@yzh119 do you happen to know?

This is from flashinfer decode.py#L1065-L1089.
@nvpohanh Do you know why FA3 does not need these arguments?

Yes they don't, they are designed for batch-invariance.

Will this break the current batch invariance test?

@yewentao256 I don't think this PR will break batch invariance test because:

If the test was originally using FA2 backend, then it still uses FA2 backend and nothing is changed.

FA3 backend is enabled to support FP8 kv-cache on Hopper GPUs. Previously, we cannot even run FP8-kv-cache on Hopper GPUs.

mgoin

LGTM, triggering more blackwell CI

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 陈建华 <1647430658@qq.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv requested review from WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 18, 2025 23:50

mergify bot added ci/build nvidia labels Dec 18, 2025

github-project-automation bot added this to NVIDIA Dec 18, 2025

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

elvischenv mentioned this pull request Dec 18, 2025

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

Merged

5 tasks

yewentao256 reviewed Dec 19, 2025

View reviewed changes

elvischenv commented Dec 21, 2025

View reviewed changes

pavanimajety reviewed Dec 22, 2025

View reviewed changes

docker/Dockerfile Outdated Show resolved Hide resolved

pavanimajety reviewed Dec 22, 2025

View reviewed changes

docker/Dockerfile.nightly_torch Outdated Show resolved Hide resolved

pavanimajety reviewed Dec 22, 2025

View reviewed changes

docker/Dockerfile.nightly_torch Outdated Show resolved Hide resolved

pavanimajety reviewed Dec 22, 2025

View reviewed changes

requirements/cuda.txt Outdated Show resolved Hide resolved

pavanimajety mentioned this pull request Dec 22, 2025

[SM100] Enable fp8 compute for prefill MLA #30746

Merged

3 tasks

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 23, 2025

njhill requested changes Dec 23, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 23, 2025

elvischenv changed the title ~~Bump Flashinfer to v0.6.0rc1~~ Bump Flashinfer to v0.6.0 Dec 23, 2025

This was referenced Dec 23, 2025

Revert "[SM100] Enable fp8 compute for prefill MLA (#30746)" #31197

Merged

[SM100] Resubmit FMHA FP8 prefill for MLA #31195

Merged

pavanimajety reviewed Dec 23, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py Outdated Show resolved Hide resolved

vadiklyutiy mentioned this pull request Dec 23, 2025

[CI] Add Qwen3-Next-FP8 to Blackwell model tests #31049

Merged

elvischenv force-pushed the elvischenv/update-flashinfer branch from 08dbcc4 to 97e86fd Compare December 24, 2025 01:53

elvischenv mentioned this pull request Jan 9, 2026

[Bug] Flashinfer and Python sampling implementations do not match flashinfer-ai/flashinfer#2320

Closed

mmangkad mentioned this pull request Jan 12, 2026

[Update] Use FlashInfer fast_decode_plan directly instead of replication #32182

Closed

elvischenv changed the title ~~Bump Flashinfer to v0.6.0~~ Bump Flashinfer to v0.6.1 Jan 14, 2026

elvischenv added 2 commits January 18, 2026 17:38

elvischenv force-pushed the elvischenv/update-flashinfer branch from 3af5dad to 61cef9d Compare January 19, 2026 01:39

skip test_flashinfer_sampler

c4a5b24

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/update-flashinfer branch from 61cef9d to c4a5b24 Compare January 19, 2026 02:52

yewentao256 approved these changes Jan 19, 2026

View reviewed changes

mgoin approved these changes Jan 20, 2026

View reviewed changes

pavanimajety approved these changes Jan 20, 2026

View reviewed changes

vllm-bot merged commit 808d6fd into vllm-project:main Jan 21, 2026
97 of 98 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Jan 21, 2026

monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026

Bump Flashinfer to v0.6.1 (vllm-project#30993)

a4f0fb5

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>

cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026

Bump Flashinfer to v0.6.1 (vllm-project#30993)

24866b6

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 陈建华 <1647430658@qq.com>

wpc mentioned this pull request Jan 26, 2026

[Bugfix][MXFP4] Call trtllm_fp4_block_scale_moe with kwargs #33104

Merged

5 tasks

This was referenced Jan 27, 2026

[Bugfix] Add missing o_data_type parameter for FlashInfer >=0.6.0 compatibility #32515

Closed

[Bug] FlashInfer >=0.6.0 TypeError: non_blocking must be bool during CUDA graph capture #32643

Closed

lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026

Bump Flashinfer to v0.6.1 (vllm-project#30993)

07ab8fd

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv deleted the elvischenv/update-flashinfer branch February 7, 2026 16:26

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

Bump Flashinfer to v0.6.1 (vllm-project#30993)

5f03533

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

hmellor mentioned this pull request Mar 16, 2026

[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell #36987

Merged

Uh oh!

Conversation

elvischenv commented Dec 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

elvischenv left a comment

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Dec 22, 2025

Uh oh!

pavanimajety commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy commented Dec 22, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nvpohanh commented Jan 12, 2026

Uh oh!

njhill commented Jan 14, 2026

Uh oh!

elvischenv commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

nvpohanh Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

elvischenv Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

yewentao256 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

nvpohanh Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

elvischenv commented Dec 18, 2025 •

edited by github-actions bot

Loading

elvischenv commented Jan 15, 2026 •

edited

Loading