Skip to content

Bump flashinfer version to 0.6.7#38188

Closed
wzhao18 wants to merge 4 commits intovllm-project:mainfrom
wzhao18:wzhao/bump-fi-0.6.7
Closed

Bump flashinfer version to 0.6.7#38188
wzhao18 wants to merge 4 commits intovllm-project:mainfrom
wzhao18:wzhao/bump-fi-0.6.7

Conversation

@wzhao18
Copy link
Copy Markdown
Contributor

@wzhao18 wzhao18 commented Mar 26, 2026

Purpose

Bump flashinfer version to 0.6.7

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the FlashInfer library version from 0.6.6 to 0.6.7 across Dockerfiles, version configuration, and Python requirements. A review comment suggests re-evaluating the version constraint for the transitive dependency nvidia-cudnn-frontend to ensure compatibility with FlashInfer 0.6.7 and prevent potential build or runtime issues.

Comment thread requirements/cuda.txt
Comment on lines +12 to +13
flashinfer-python==0.6.7
flashinfer-cubin==0.6.7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change updates flashinfer to 0.6.7, but does not update the version constraint for its transitive dependency nvidia-cudnn-frontend on line 16. The existing cap <1.19.0 was likely added for a previous version of flashinfer and may be incompatible with 0.6.7, potentially causing build failures or runtime errors. This constraint should be re-evaluated based on the requirements of flashinfer==0.6.7.

@zyongye zyongye added the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Mar 26, 2026
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are CI failures related?

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 26, 2026

@yewentao256 checking right now.

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 26, 2026

@yewentao256 Seems there are issues with the new version.

I tried the nemotron locally on GB300 and it produces repetitive output:

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --enforce-eager   \
    --max-model-len 4096   \
    --trust-remote-code   \
    --tensor-parallel-size 2   \
    --enable-expert-parallel   \
    --speculative-config '{"method":"mtp","num_speculative_tokens":5}'

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 Seems there are issues with the new version.

Yeah Please take a further look, we need to solve them before merging this PR

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 26, 2026

@yewentao256 Yes of course.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/bump-fi-0.6.7 branch from 894a10e to 456be52 Compare March 29, 2026 00:04
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 29, 2026

@yewentao256 Fixed the LM eval Ci failures - problem related to routing bias in trtllm-gen MoE kernels.

The rest CI failures also failed in main branch. buildkite/ci/pr/model-runner-v2-distributed-2-gpus seems unrelated/flaky.

Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!
Also CC @mgoin

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Mar 29, 2026
@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 29, 2026

@robertgshaw2-redhat Seems we need to add the casting for routing bias back.

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 30, 2026

The routing bias cast is added because the CI showed GSM8k accuracy collapse with nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 and nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 with the new flashinfer release.

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Mar 30, 2026

I'm planning to get this change in #38423

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 30, 2026

@mgoin Sounds good. Will close this one.

@wzhao18 wzhao18 closed this Mar 30, 2026
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready-run-all-tests Trigger CI with all tests for wide-ranging PRs

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants