[Do Not Merge]Fix after flashinfer fp4 autotuner pr#23209
[Do Not Merge]Fix after flashinfer fp4 autotuner pr#23209IwakuraRein wants to merge 10 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Why do we need this flag and to run twice? I think we can just move this before cuda graphs like your original commit
There was a problem hiding this comment.
Because I was thinking that auto-tuning and warm-up serve two different purposes here. Auto-tuning is meant to store the best kernel function index, so I placed it before cuda graph capture to make sure cuda graph sees the correct kernel. Warm-up is a dry run before actual job starts, so I added it right before the real execution just like original codes. Based on your earlier comment, I thought you were suggesting that warm-up is necessary (maybe DeepGEMM requires it?). Please correct me if I’ve misunderstood.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: siyuanf <siyuanf@nvidia.com>
c0b3809 to
7e1fb28
Compare
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
yewentao256
left a comment
There was a problem hiding this comment.
Looks good, could you also add a E2E accuracy test using lm-eval?
Hi @yewentao256 . I have experimented with
|
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
ProExpertProg
left a comment
There was a problem hiding this comment.
Looks good except address todo
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
|
Closed after #23537 |
Can be merged after flashinfer address the AOT installation.
Purpose
The flashinfer fp4 autotuner is merged. Need to update the api call in the mxfp4 moe.
x_scaleshape and use a hardcoded max tunning number of tokens in the mxfp4 moe.kernel_warmupabove theself.model_runner.capture_model()bump flashinfer tag to 0.2.13Test Plan
Test Result
On B200,
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1:without autotuner
with autotuner
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.