Update Flashinfer to 0.2.14.post1 by weireweire · Pull Request #23537 · vllm-project/vllm

weireweire · 2025-08-25T09:28:12Z

Purpose

Update flashinfer to

fix allreduce fusion kernel suboptimal perf issue.
Add GPT-OSS cutlass MoE backend.
Include [Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209 for flashinfer autotune @IwakuraRein
Include Draft: Fix flashinfer swizzle enum name for flashinfer update. #23311 for flashinfer enum API change.

Test Plan

lm_eval on llama3

gpt-oss/eval test on gpt-oss

python3 -m gpt_oss.evals --sampler chat_completions --model gpt-oss-120b --reasoning-effort low,medium --n-threads 512 --eval gpqa

Test Result

llama3 FP8 tp2

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.938	±	0.0108
		strict-match	5	exact_match	↑	0.904	±	0.0132

llama3 FP4 tp1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.930	±	0.0114
		strict-match	5	exact_match	↑	0.842	±	0.0163

gpt-oss tp1:
[{'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-low_temp1.0_20250825_021150', 'metric': 0.6414141414141414}, {'eval_name': 'gpqa', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250825_021150', 'metric': 0.711489898989899}]

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Signed-off-by: siyuanf <siyuanf@nvidia.com>

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>

gemini-code-assist

Code Review

This pull request updates Flashinfer to version 0.2.14.post1, which addresses a performance issue in the allreduce fusion kernel and incorporates API changes. The changes also include enabling FlashInfer autotuning before CUDA graph capture for better performance. My review found a critical typo in a variable name (max_captute_size instead of max_capture_size) in mxfp4.py which would lead to a runtime error. I've provided suggestions to fix this.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

yewentao256

Thanks for the work!
Could you also report vllm bench metric results so that we can see if we have some improvement for E2E throughput?

mgoin

LGTM, thanks for putting everything together. Let's see the CI

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: siyuanf <siyuanf@nvidia.com> Signed-off-by: Weiliang Liu <weiliangl@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tc-mb <caitianchi@modelbest.cn>

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: siyuanf <siyuanf@nvidia.com> Signed-off-by: Weiliang Liu <weiliangl@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: siyuanf <siyuanf@nvidia.com> Signed-off-by: Weiliang Liu <weiliangl@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: siyuanf <siyuanf@nvidia.com> Signed-off-by: Weiliang Liu <weiliangl@nvidia.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Siyuan Fu <siyuanf@nvidia.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

IwakuraRein and others added 11 commits August 25, 2025 06:24

fix after flashinfer autotuner

82de5e0

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

add warmup

bfc1f20

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

address comment

4cd9551

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Update flashinfer tag

57e4e21

Signed-off-by: siyuanf <siyuanf@nvidia.com>

address comment

8764384

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

address comment

e9f6e12

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

update dockerfile

31bac6a

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

address todo

74c5e7b

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

address todo

e635283

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Fix flashinfer swizzle enum name.

c0a5e4b

Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>

update flashinfer to 0.2.14.post1

d522d9a

Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>

weireweire requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zou3519 as code owners August 25, 2025 09:28

mergify Bot added ci/build v1 labels Aug 25, 2025

gemini-code-assist Bot reviewed Aug 25, 2025

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/mxfp4.py Outdated

Comment thread vllm/model_executor/layers/quantization/mxfp4.py Outdated

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 25, 2025

mgoin and others added 2 commits August 25, 2025 11:08

Apply suggestion from @gemini-code-assist[bot]

7313a92

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

Apply suggestion from @gemini-code-assist[bot]

8548148

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com>

yewentao256 reviewed Aug 25, 2025

View reviewed changes

mgoin approved these changes Aug 25, 2025

View reviewed changes

IwakuraRein mentioned this pull request Aug 25, 2025

[Do Not Merge]Fix after flashinfer fp4 autotuner pr #23209

Closed

4 tasks

simon-mo merged commit ae06788 into vllm-project:main Aug 26, 2025
64 of 70 checks passed

weireweire mentioned this pull request Aug 26, 2025

Draft: Fix flashinfer swizzle enum name for flashinfer update. #23311

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Flashinfer to 0.2.14.post1#23537

Update Flashinfer to 0.2.14.post1#23537
simon-mo merged 13 commits intovllm-project:mainfrom
weireweire:flashinfer-update

weireweire commented Aug 25, 2025 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

weireweire commented Aug 25, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

weireweire commented Aug 25, 2025 •

edited by github-actions Bot

Loading