Prevent MoE autotuner buffer overflow on large token buckets#3025
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📥 CommitsReviewing files that changed from the base of the PR and between 4bb63033fdcf49e5c2837ee737cdf046ee26d34f and 127868d. 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughMoved MoE autotuner config to per-run instances, tightened CUDA-graph preallocation gating to respect incoming token count and tile size, and made dummy expert-ID initialization dynamic to cover the configured number of experts at runtime. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request enhances the MoE implementation by adding a check to ensure that pre-allocated buffers are only used when the current batch size is within the pre-allocated capacity, preventing potential buffer overflows. Additionally, it refactors the tuning configuration in CuteDslFusedMoENvfp4Runner from a class-level to an instance-level attribute, which allows for more accurate profiling by using the actual number of experts instead of a hardcoded value. I have no feedback to provide as no review comments were submitted.
|
/bot run |
97bbb5e to
4bb6303
Compare
|
/bot cancel |
|
Unknown Command Command Use |
|
/bot stop |
|
The GitLab CI pipeline #48146400 has been cancelled. |
|
/bot run |
| ), | ||
| tensor_initializers=[ | ||
| # 0: x — FP4 quantized input (uint8 packed) | ||
| lambda shapes, dtype, device: torch.randint( |
There was a problem hiding this comment.
do we need to fixed random see to guarantee consistent here?
There was a problem hiding this comment.
do we need to fixed random see to guarantee consistent here?
No, since these are throwaway dummy tensors for timing kernel execution during profiling. The values don't affect tactic selection.
4bb6303 to
127868d
Compare
|
@leejnau Thanks for fixing this! |
📌 Description
CuteDslMoEWrapper pre-allocates intermediate buffers sized for
max_num_tokens, but the autotuner can probe buckets larger than that
(e.g. 8192 tokens vs 2048 max). The GEMM kernels then write past
buffer bounds, silently corrupting model weights and eventually
triggering cudaErrorIllegalAddress.
Two fixes:
buffers; fall back to dynamic allocation when exceeded.
local experts (randint(0, num_experts)) instead of a hardcoded 8,
which concentrated routing and inflated permutation buffer sizes.
🔍 Related Issues
feat: cuteDSL fp4 moe for better DSR1 performance.
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Performance Improvements
Stability & Tuning