Undo fix to AutoTuner find_nearest_profile#2697
Conversation
📝 WalkthroughWalkthroughThis pull request reverts linked-dimension value propagation in the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue in the autotuner by reverting a prior change within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request reverts a previous fix in _find_nearest_profile, which appears to re-introduce a bug. The change removes logic that updates all linked dimensions in a DynamicTensorSpec, instead only updating the first one. This breaks the intended behavior of linked dimensions and will likely cause tests to fail. I've provided a critical review comment recommending that this change be reverted and the original logic restored.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/autotuner/test_autotuner_core.py (1)
134-139: Add a concrete tracking reference to these temporary skips.The skip reasons are clear, but at Line 136, Line 166, and Line 190 they don’t include an issue/PR pointer for restoration. Please append a concrete reference (for example, the draft fix PR) so these don’t silently become permanent.
Suggested tweak
`@pytest.mark.skip`( reason=( "_find_nearest_profile linked-dimension mapping was reverted; " - "re-enable when linked-dim bucket propagation is restored." + "re-enable when linked-dim bucket propagation is restored (see PR `#2695`)." ) )Also applies to: 164-169, 188-193
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/autotuner/test_autotuner_core.py` around lines 134 - 139, The skip decorators using pytest.mark.skip that cite "_find_nearest_profile linked-dimension mapping was reverted; re-enable when linked-dim bucket propagation is restored." need a concrete tracking reference appended to their reason strings (e.g., "see PR `#1234`" or an issue ID) so the temporary skip can be traced; update each pytest.mark.skip instance in tests/autotuner/test_autotuner_core.py that mentions _find_nearest_profile (the three occurrences) to append a stable issue/PR pointer to the reason text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 773-776: The code in _find_nearest_profile() currently maps only
the first linked dimension (using spec.input_idx[0] / spec.dim_idx[0]) which
diverges from _generate_optimization_profiles() that materializes buckets for
all linked dimensions; update _find_nearest_profile() to iterate over all linked
dimensions in spec.dim_idx / spec.input_idx and call
spec.map_to_tuning_buckets(...) for each corresponding
base_profile[input_idx][dim_idx] entry so the tuned profile keys match the
generated optimization profiles (alternatively, if this behavior should be
limited, gate the existing single-dimension mapping behind the specific
kernel/path check used by the TRTLLM MoE code path).
---
Nitpick comments:
In `@tests/autotuner/test_autotuner_core.py`:
- Around line 134-139: The skip decorators using pytest.mark.skip that cite
"_find_nearest_profile linked-dimension mapping was reverted; re-enable when
linked-dim bucket propagation is restored." need a concrete tracking reference
appended to their reason strings (e.g., "see PR `#1234`" or an issue ID) so the
temporary skip can be traced; update each pytest.mark.skip instance in
tests/autotuner/test_autotuner_core.py that mentions _find_nearest_profile (the
three occurrences) to append a stable issue/PR pointer to the reason text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 00156c80-048e-4cb1-add9-75ab48a8992d
📒 Files selected for processing (2)
flashinfer/autotuner.pytests/autotuner/test_autotuner_core.py
|
/bot run |
|
[SUCCESS] Pipeline #45441918: 9/20 passed |
|
bot run tests are clean |
|
Can confirm SGLang + v0.6.5 also hits the issue: |
…e launchers for all supported tileN in trtllm fused MoE (#2821) ## 📌 Description It fixes two autotuner related bugs: 1. Revert back the autotuner fix that was reverted in #2697 2. Fix the issue that #2697 revealed, which is trtllm fused MoE kernel launcher crash when it receives tileN that is supported but filtered out by `computeSelectedTileN`, by creating kernel launchers for all supported tileN values. This PR continues the work in #2695 by @danisereb to revert bugfix 1 and to fix bug 2. More technical details: ### Bug 1: When given num_tokens that isn't a power-of-2, the autotuner (python side) fails to find its appropriate entry in the autotuner cache, so it falls back to passing default, which means passing `[-1, -1]` as the `(tileN, tactic)` to the CPP. It was fixed in [this PR](https://github.com/flashinfer-ai/flashinfer/pull/2617/changes#diff-1964ab957d8185d04b0d5f0cb02d0c7c0a3260ac0a6c573167af6875ab0b0e87L729-L734) but soon after merge, it was reverted [here](#2697), as it exposed the next bug. ### Bug 2 (exposed after fixing bug 1): Crash in fused MoE kernel launcher on forward pass on some values of num_tokens. The crash is at `launchers_map.at(tile_N)` in `trtllm_fused_moe_kernel_launcher.cu`. It happens because: The python side of the autotuner profiles num_tokens that are power of 2, and each such value represents the range until the next power of 2. e.g.: The profile for the range `[2048, 4095]` is done on num_tokens=2048. `computeSelectedTileN` function in `trtllm_fused_moe_kernel_launcher.cu` reduces the set of supported tileN values (to reduce the autotuner's search space), by choosing specific values from the supported tileN sorted list, the values are: `roundUpToPowerOfTwo(num_tokens * topK / numExperts)`, its previous one, and its next 2 values (max value is 256). So values in the same range can get different sets of tileN values. For example, on Nemotron 3 Super NVFP4: - `num_tokens=2048` -> `2048*22/512 = 88`, which rounds up to 128, so the tileN set is `(64, 128, 256)` - `num_tokens=3003` -> `3003*22/512 = 129.03`, which rounds up to 256, so the tileN set is `(128, 256)` In case `tileN=64` was found to be the fastest on `num_tokens=2048` for range `[2048, 4095]`, when given `num_tokens=3003`, the python side would pass `[64, someTactic]` to the CPP, but for `num_tokens=3003`, there's no launcher for `tileN=64` as `computeSelectedTileN` filtered it out. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Stricter MoE tile validation and ensured all supported tiles are available at launch to avoid missing kernel configurations. * Autotuner mapping for linked dynamic dimensions now yields consistent cached bucket values. * **Tests** * Added SM100 MoE autotuner integration tests (including invalid-cached-tactic checks). * Re-enabled and expanded autotuner unit tests and added a test utility to reset the autotuner. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Co-authored-by: Daniel Serebrenik <daserebrenik@nvidia.com>
…e launchers for all supported tileN in trtllm fused MoE (#2821) ## 📌 Description It fixes two autotuner related bugs: 1. Revert back the autotuner fix that was reverted in flashinfer-ai/flashinfer#2697 2. Fix the issue that flashinfer-ai/flashinfer#2697 revealed, which is trtllm fused MoE kernel launcher crash when it receives tileN that is supported but filtered out by `computeSelectedTileN`, by creating kernel launchers for all supported tileN values. This PR continues the work in flashinfer-ai/flashinfer#2695 by @danisereb to revert bugfix 1 and to fix bug 2. More technical details: ### Bug 1: When given num_tokens that isn't a power-of-2, the autotuner (python side) fails to find its appropriate entry in the autotuner cache, so it falls back to passing default, which means passing `[-1, -1]` as the `(tileN, tactic)` to the CPP. It was fixed in [this PR](https://github.com/flashinfer-ai/flashinfer/pull/2617/changes#diff-1964ab957d8185d04b0d5f0cb02d0c7c0a3260ac0a6c573167af6875ab0b0e87L729-L734) but soon after merge, it was reverted [here](flashinfer-ai/flashinfer#2697), as it exposed the next bug. ### Bug 2 (exposed after fixing bug 1): Crash in fused MoE kernel launcher on forward pass on some values of num_tokens. The crash is at `launchers_map.at(tile_N)` in `trtllm_fused_moe_kernel_launcher.cu`. It happens because: The python side of the autotuner profiles num_tokens that are power of 2, and each such value represents the range until the next power of 2. e.g.: The profile for the range `[2048, 4095]` is done on num_tokens=2048. `computeSelectedTileN` function in `trtllm_fused_moe_kernel_launcher.cu` reduces the set of supported tileN values (to reduce the autotuner's search space), by choosing specific values from the supported tileN sorted list, the values are: `roundUpToPowerOfTwo(num_tokens * topK / numExperts)`, its previous one, and its next 2 values (max value is 256). So values in the same range can get different sets of tileN values. For example, on Nemotron 3 Super NVFP4: - `num_tokens=2048` -> `2048*22/512 = 88`, which rounds up to 128, so the tileN set is `(64, 128, 256)` - `num_tokens=3003` -> `3003*22/512 = 129.03`, which rounds up to 256, so the tileN set is `(128, 256)` In case `tileN=64` was found to be the fastest on `num_tokens=2048` for range `[2048, 4095]`, when given `num_tokens=3003`, the python side would pass `[64, someTactic]` to the CPP, but for `num_tokens=3003`, there's no launcher for `tileN=64` as `computeSelectedTileN` filtered it out. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Stricter MoE tile validation and ensured all supported tiles are available at launch to avoid missing kernel configurations. * Autotuner mapping for linked dynamic dimensions now yields consistent cached bucket values. * **Tests** * Added SM100 MoE autotuner integration tests (including invalid-cached-tactic checks). * Re-enabled and expanded autotuner unit tests and added a test utility to reset the autotuner. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Co-authored-by: Daniel Serebrenik <daserebrenik@nvidia.com>
📌 Description
PR #2617 added a fix that solves "using fallback tactic" for TRTLLM MoE kernels.
But after running more tests (
lm_eval) with flashinfer v0.6.5 another issue was found -an error from C++ file
csrc/trtllm_fused_moe_kernel_launcher.cu(key not found inlaunchers_map.at(tile_N)).Fixing this is probably not simple, more details in this draft PR (NOT for v0.6.6):
#2695
In order to prevent the crash, the change in
_find_nearest_profilewill be reverted (to match flashinfer v0.6.4).The relevant AutoTuner tests were marked with "skip":
The AutoTuner rest of the tests are all successful:
Using this branch, the failure from
trtllm_fused_moe_kernel_launcher.cudoes not happen.vLLM main still uses flashinfer v0.6.4 (that does not include PR #2617).
This change should be included in flashinfer v0.6.6 (for use by vLLM).
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit