fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input by leejnau · Pull Request #3216 · flashinfer-ai/flashinfer

leejnau · 2026-05-01T20:34:45Z

📌 Description

The autotuner's DynamicTensorSpec in flashinfer/fused_moe/cute_dsl/tuner.py declared gen_tuning_buckets as the pre-computed tuple get_hybrid_num_tokens_buckets(8192) and map_to_tuning_buckets as lambda x: map_to_hybrid_bucket(x,8192). The hardcoded 8192 cap silently clamped any runtime workload larger than that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill (N=16384) fi profiled at half the per-expert workload and used a tactic optimized for the wrong shape.

This PR replaces the pre-computed tuple with the bare callable form (get_hybrid_num_tokens_buckets) and switches the mapper to the uncapped variant map_to_hybrid_bucket_uncapped (added alongside the hybrid-bucket scheme for exactly this case). The autotuner now invokes them with the actual input dim at autotune time, matching TRT-LLM's pattern at cute_dsl_custom_ops.py:2390-2391 and flashinfer's own pattern at gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG.

🔍 Related Issues

#3171
#3198
#3115

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- MoE autotuner now uses uncapped dynamic hybrid bucket mapping instead of a fixed-bounded set, improving adaptation to varying input token sizes.
Tests
- Added offline tests validating autotuner bucket configuration: dynamic bucket generation, responsiveness to input size, monotonic mapping behavior, large-input scaling, and alignment with expected power-of-2 bucket values.

coderabbitai · 2026-05-01T20:34:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a91e88e-ea1e-4d7e-83c6-709044357760

📥 Commits

Reviewing files that changed from the base of the PR and between 61724e1 and 96e1d09.

📒 Files selected for processing (2)

flashinfer/fused_moe/cute_dsl/tuner.py
tests/moe/test_cute_dsl_fused_moe.py

🚧 Files skipped from review as they are similar to previous changes (2)

flashinfer/fused_moe/cute_dsl/tuner.py
tests/moe/test_cute_dsl_fused_moe.py

📝 Walkthrough

Walkthrough

The CuteDSL NVFP4 MoE tuner’s dynamic token bucketing is changed to use an uncapped hybrid bucket generator and mapping: gen_tuning_buckets is passed as the callable get_hybrid_num_tokens_buckets (not a precomputed bounded tuple) and map_to_tuning_buckets now uses map_to_hybrid_bucket_uncapped. Tests validating bucket generation, scaling, monotonicity, and power-of-2 coverage were added.

Changes

Uncapped Hybrid Bucket Mapping

Layer / File(s)	Summary
Data Shape / Wiring `flashinfer/fused_moe/cute_dsl/tuner.py`	Replace prior capped wiring (`get_hybrid_num_tokens_buckets(8192)` and `lambda x: map_to_hybrid_bucket(x, 8192)`) with `gen_tuning_buckets=get_hybrid_num_tokens_buckets` (callable) and `map_to_tuning_buckets=map_to_hybrid_bucket_uncapped`.
Behavior / Mapping `flashinfer/fused_moe/cute_dsl/tuner.py`	Switch bucket mapping from a fixed-8192-capped hybrid mapping to an uncapped hybrid mapping, changing how autotuning buckets derive from input token dims.
Tests / Validation `tests/moe/test_cute_dsl_fused_moe.py`	Add `bucket_spec` fixture and `TestAutotunerBucketConfig` asserting `gen_tuning_buckets` is callable, bucket maxima increase with input dim, `map_to_tuning_buckets` scales for large inputs and matches power-of-2 for small values, is monotonically non-decreasing, and `gen_tuning_buckets(max_n)` covers TRT-LLM power-of-2 points up to the max.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

flashinfer-ai/flashinfer#3115: Related changes to CuteDSL MoE autotuner token-bucketing and hybrid utilities.
flashinfer-ai/flashinfer#3025: Prior tuner refactor and adjustments to hybrid bucket generator/mapping behavior.

Suggested reviewers

aleozlx
yzh119
samuellees
IwakuraRein
jiahanc
nv-yunzheq

Poem

🐰 I hopped from caps to open skies so wide,
Buckets now grow with every token tide.
Callable kernels hum and maps unbound,
Power-of-two friends are still around.
Tune on, small rabbit, in the uncapped ground.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: making the autotuner bucket configuration adapt to runtime input by removing the hardcoded 8192 cap, which is the primary goal of this PR.
Description check	✅ Passed	The PR description comprehensively addresses the template requirements with detailed explanation of the problem, solution, related issues, and completed pre-commit and test checklists.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the autotuner configuration in flashinfer/fused_moe/cute_dsl/tuner.py to use dynamic bucket generation by passing bare callables, enabling the tuner to adapt to varying input dimensions without hardcoded caps. Additionally, it introduces a comprehensive test suite in tests/moe/test_cute_dsl_fused_moe.py to validate the structural integrity and behavior of the bucket configuration. A review comment identifies a potential ImportError in the new tests caused by a missing utility function.

gemini-code-assist · 2026-05-01T20:36:56Z

+        power-of-2 boundary up to the input dim.
+        """
+        from flashinfer.fused_moe.utils import (
+            get_last_power_of_2_num_tokens_buckets,


The function get_last_power_of_2_num_tokens_buckets is imported here but does not appear to be defined in flashinfer/fused_moe/utils.py. This will cause the test test_gen_tuning_buckets_covers_trtllm_power_of_2_points to fail with an ImportError. Please verify if this function was intended to be added to utils.py in this PR or if it should be replaced with an existing function.

…2026-05-01) Final-state line 15 + line 51 + line 4343 PR queue annotation now show the bucket-cap-fix's upstream state: opened as Draft PR flashinfer-ai#3216 on the post-flashinfer-ai#3171 main rebase, HEAD `1e3e217b` = tests on top of `1058280b` = fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…me input The autotuner's `DynamicTensorSpec` at `flashinfer/fused_moe/cute_dsl/tuner.py` declared a dynamic-token-count spec with `gen_tuning_buckets` as a pre-computed tuple (`get_hybrid_num_tokens_buckets(8192)`) and `map_to_tuning_buckets` as a lambda that capped at 8192 (`lambda x: map_to_hybrid_bucket(x, 8192)`). So when a model serves at num_tokens > 8192 — the DeepSeek-V3 prefill case at N=16384, for example — the runtime input mapped to bucket=8192 and used the cached tactic that was profiled at half the per-expert workload. This produced a profile-shape vs runtime-shape mismatch: at profile-time bucket=8192 with EP=8 the per-expert work is ~256 tokens, where tile_size=128 wins by a tight ~0.58% margin over tile_size=256. At runtime N=16384 the per-expert work doubles to ~512 tokens and tile_size=256 wins more decisively. The cached choice from the 8192-shape profile was suboptimal for the larger runtime workload. This change replaces the pre-computed-tuple form with the bare-callable form, and switches `map_to_tuning_buckets` to the uncapped variant `map_to_hybrid_bucket_uncapped` that was added alongside the hybrid- bucket scheme exactly for this case. flashinfer's autotuner already supports this: at `flashinfer/autotuner.py:1024` it inspects `gen_tuning_buckets` and invokes it with the actual input dim at autotune time when the value is a function. With the bare callable, the bucket set adapts to the workload — no hardcoded cap, no magic number, future-proof at any N. This matches: - TRT-LLM's pattern at `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py:2390-2391` and `2700-2703` (CuteDSLFp8BlackwellRunner / -BmmRunner). - flashinfer's own pattern at `gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG` and six other callsites in `gemm_base.py` and `trtllm_low_latency_gemm.py`. The uncapped helper `map_to_hybrid_bucket_uncapped` was introduced in that same code area for exactly this purpose; the CuteDSL MoE's tuner was the one place that wasn't migrated to use it. Empirical impact at N=16384 with --num-iters 100 --warmup 10 (3 runs each, on a Blackwell B200): Without this fix (prealloc-fix only — see related side branch cute-dsl-moe-wrapper-prealloc-bias-fix): EP=8: Δ% = +8.4% / -2.7% / +7.5% (mean +4.4%, spread 11pp) EP=16: Δ% = +8.5% / -2.0% / +0.6% (mean +2.4%, spread 10.5pp) With this fix + prealloc-fix: EP=8: Δ% = -0.7% / -1.4% / -0.3% (mean -0.8%, spread 1.1pp) EP=16: Δ% = -9.6% / +0.3% / -7.6% (mean -5.6%, spread 10pp) (Δ% measured by `benchmarks/bench_cute_dsl_port_parity.py` against TensorRT-LLM 1.3.0rc5.post2 in the same process.) The bucket-cap mismatch was the load-bearing cause of the EP>1 perf gap at N=16384; removing the hardcoded cap closes it. The remaining EP=16 10pp spread under "both fixes" is trt's autotune coin-flip on its own 0.08% tile=128 vs tile=256 profile margin, not a fi-side issue. At all N ≤ the autotune-time max-N: identical to the previous (capped) form — same buckets, same cache lookup, same tactic selection. At N > autotune-time max-N (cache miss case): the previous form mapped to the cap (8192) and reused that bucket's tactic; this form returns the actual N. In practice the user calls autotune warmup at the maximum expected N (`CuteDslMoEWrapper` standard usage), so cache misses shouldn't occur. Fully observable perf impact requires the wrapper prealloc-bias fix (side branch `cute-dsl-moe-wrapper-prealloc-bias-fix`) to be applied as well — that fix removes the autotune bias that locks fi to tile=128 in 14/14 cache entries. Without it, this patch is a no-op since fi can't pick tile=256 even when its profile shape suggests it. The two patches are independent and can land in either order. PR flashinfer-ai#3171 (the prerequisite gemm2 tactic enumeration fix that addresses issue flashinfer-ai#3067) has already merged into main as commit `070fabf0`, so tile_size=256 is correctly enumerated and this patch is unblocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nfiguration Adds six no-GPU pytest cases at `tests/moe/test_cute_dsl_fused_moe.py::TestAutotunerBucketConfig` guarding the autotuner bucket-cap fix and locking in the load-bearing behavioral parity with TRT-LLM's pattern at `cute_dsl_custom_ops.py:2390-2391` and `2700-2703`. Three "no hardcoded cap" regression guards (the load-bearing property of the fix): 1. `test_gen_tuning_buckets_is_callable_not_static_tuple` — pins `gen_tuning_buckets` on the runner's `tuning_config` to be a bare callable, not a pre-computed tuple. 2. `test_gen_tuning_buckets_no_hardcoded_8192_cap` — verifies that calling the configured `gen_tuning_buckets` with input dims 8192, 16384, and 32768 produces bucket sets whose maximum reflects the input value. 3. `test_map_to_tuning_buckets_above_8192_not_capped` — verifies that `map_to_tuning_buckets(x)` for x ∈ {16384, 32768, 65536} doesn't cap at 8192. Ensures we use `map_to_hybrid_bucket_uncapped` instead of `lambda x: map_to_hybrid_bucket(x, 8192)`. Three TRT-LLM-parity regression guards (lock in the behavioral-equivalence-where-achievable): 4. `test_map_to_tuning_buckets_phase1_matches_trtllm_at_powers_of_2` — pins fi/trt-llm parity at power-of-2 inputs ≤ 256 (hybrid Phase 1, where pure power-of-2 spacing is preserved). At these inputs, fi's `map_to_tuning_buckets(x)` must equal x and equal `last_positive_power_of_2(x)` (TRT-LLM's pattern). 5. `test_map_to_tuning_buckets_is_monotonic` — pins monotonic non-decreasing behavior across hybrid Phases 1-4. TRT-LLM's `last_positive_power_of_2` and fi's `map_to_hybrid_bucket_uncapped` both satisfy this; catches a regression that would introduce non-monotonic mapping. 6. `test_gen_tuning_buckets_covers_trtllm_power_of_2_points` — pins that fi's hybrid bucket set is a SUPERSET of TRT-LLM's power-of-2 bucket set at every max_n tested. The hybrid scheme intentionally adds intermediate linear-step buckets in Phase 2/3 (per PR flashinfer-ai#3115's perf rationale) but must preserve the coarse-grained power-of-2 coverage TRT-LLM has. These six tests together enforce: (a) no hardcoded cap, (b) callable form, (c) TRT-LLM-equivalence at power-of-2 probe points, (d) monotonicity, (e) coarse-grained coverage parity with TRT-LLM. The hybrid-vs-power-of-2 deviation in Phase 2/3/4 is intentional and documented (PR flashinfer-ai#3115); the tests don't enforce parity in those phases because that would regress fi's deliberate perf optimization. All tests are pure-Python and run without a GPU. They construct a `CuteDslFusedMoENvfp4Runner` with a no-op `forward_impl` to inspect its `tuning_config`; no GPU, no CuteDSL kernel binaries, no autotune side effects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address gemini-code-assist review on PR flashinfer-ai#3216: the test was importing `get_last_power_of_2_num_tokens_buckets` from `flashinfer.fused_moe.utils`, but PR flashinfer-ai#3115 (merged 2026-04-24) removed that function in favor of the hybrid bucket scheme. The import would have caused an ImportError when the test was collected. Replace the call with an equivalent inline construction that mirrors TRT-LLM's `get_last_power_of_2_num_tokens_buckets` (in `tensorrt_llm/_torch/utils.py:291`): powers of 2 from 1 up to `last_positive_power_of_2(max_n)`. `last_positive_power_of_2` is still available in `flashinfer.fused_moe.utils`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

tests/moe/test_cute_dsl_fused_moe.py (1)

569-577: ⚡ Quick win

Remove redundant strict=False keyword arguments from zip() calls.

The strict parameter is zip()'s default behavior and adds no functional value. Removing it simplifies the code. Note that this is a code-quality improvement, not a compatibility fix—the project requires Python 3.10+ where strict is available.

Suggested fix

-        for prev_x, prev_y, curr_x, curr_y in zip(
-            test_xs, results, test_xs[1:], results[1:], strict=False
-        ):
+        for prev_x, prev_y, curr_x, curr_y in zip(
+            test_xs, results, test_xs[1:], results[1:]
+        ):
             assert prev_y <= curr_y, (
                 f"map_to_tuning_buckets must be monotonically "
                 f"non-decreasing; got map({prev_x})={prev_y} > "
                 f"map({curr_x})={curr_y}. Full mapping at probe "
-                f"points: {list(zip(test_xs, results, strict=False))}."
+                f"points: {list(zip(test_xs, results))}."
             )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_cute_dsl_fused_moe.py` around lines 569 - 577, The zip() calls
in the loop that iterates prev_x, prev_y, curr_x, curr_y (the line starting with
"for prev_x, prev_y, curr_x, curr_y in zip(") include the redundant keyword
argument strict=False; remove strict=False from both zip() usages so the calls
simply use zip(test_xs, results, test_xs[1:], results[1:]) and zip(test_xs,
results) in the f-string construction, keeping the logic unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/moe/test_cute_dsl_fused_moe.py`:
- Around line 569-577: The zip() calls in the loop that iterates prev_x, prev_y,
curr_x, curr_y (the line starting with "for prev_x, prev_y, curr_x, curr_y in
zip(") include the redundant keyword argument strict=False; remove strict=False
from both zip() usages so the calls simply use zip(test_xs, results,
test_xs[1:], results[1:]) and zip(test_xs, results) in the f-string
construction, keeping the logic unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34f13e0e-aab4-45e6-b5b4-08544ef362f0

📥 Commits

Reviewing files that changed from the base of the PR and between 68d2b66 and a0af65aa2d4cfe39aca217aaa9fa5af1617627d3.

📒 Files selected for processing (2)

flashinfer/fused_moe/cute_dsl/tuner.py
tests/moe/test_cute_dsl_fused_moe.py

nv-yunzheq · 2026-05-04T20:01:48Z

/bot run

flashinfer-bot · 2026-05-04T20:02:42Z

GitLab MR !624 has been created, and the CI pipeline #50247376 is currently running. I'll report back once the pipeline job completes.

Updates the three audit-doc references to PR flashinfer-ai#3216's draft status (line 15, line 51, line 4369 PR queue annotation) to reflect the promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qiching · 2026-05-04T20:25:17Z

                    dim_idx=(0, 0, 0, 0, 0),
-                    gen_tuning_buckets=get_hybrid_num_tokens_buckets(8192),
-                    map_to_tuning_buckets=lambda x: map_to_hybrid_bucket(x, 8192),
+                    # Pass bucket generators as bare callables (matching


the TRT-LLM line numbers (2390-2391, 2700-2703) will go stale. the code is self-explanatory given the function names. I'd trim to 1-2 lines max, e.g.:

# bare callables, autotuner adapts bucket set to actual input dim # (matches gemm_base.py _FP8_GEMM_SM100_TUNING_CONFIG pattern).

qiching · 2026-05-04T20:26:49Z

+        tuple/sequence that bakes in a hardcoded cap.
+        """
+        runner = self._make_runner()
+        spec = runner.tuning_config.dynamic_tensor_specs[0]


If gen_tuning_buckets is a tuple, callable(tuple_instance) is already false, so the first assertion fails before the second is ever reached. the second assertion is dead. could you check? maybe either remove it or swap the order.

qiching

every test method calls self._make_runner() independently. Since the runner is stateless for these checks, i will recommend @pytest.fixture would reduce boilerplate and test runtime.

Three changes in response to qiching's review: 1. `tuner.py`: trim the verbose bucket-config comment. Drop the TRT-LLM line numbers that will go stale; keep a one-line pointer to flashinfer's own `_FP8_GEMM_SM100_TUNING_CONFIG` pattern in `gemm_base.py`. 2. `tests/moe/test_cute_dsl_fused_moe.py`: collapse the dead second assertion in `test_gen_tuning_buckets_is_callable_not_static_tuple`. `callable(tuple_instance)` is already `False`, so the `not isinstance(..., tuple)` check was unreachable. Single `callable()` check now carries the full message (including the "pre-computed sequence likely indicates a hardcoded cap" hint). 3. `tests/moe/test_cute_dsl_fused_moe.py`: replace the `_make_runner` static method + per-test reconstruction with a module-scoped `bucket_spec` pytest fixture. Reduces boilerplate and avoids reconstructing the runner once per test method (the runner is stateless for these checks). Also genericized two stale TRT-LLM line-number references in test docstrings (`cute_dsl_custom_ops.py:2390-2391`) — same staleness concern as flashinfer-ai#1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qiching

good! module-scoped fixture that replaces _make_runner and the repeated construction of the runner for each test with a bucket_spec pytest fixture. This is better! the module scope means the runner is constructed only once for the entire test module, making it more efficient!

aleozlx

reviewed. will wait for bot run

`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`, `_gemm1_output_scale`, and `_moe_sort_buffers` sized for `self.tile_size` only. The `use_prealloc` gate in `_forward_with_tactic` honors prealloc only when the probed tactic's `tile_size` matches `self.tile_size`: use_prealloc = ( self.use_cuda_graph and tile_size == self.tile_size and num_tokens <= self.max_num_tokens ) During autotune profiling, mismatched tactics fall through to dynamic `torch.empty()` per-call allocation. The autotuner is then comparing tactic latencies that include asymmetric allocation overhead — tactics matching `self.tile_size` run on the prealloc, others pay the alloc cost — so it consistently picks the matching `tile_size` even when intrinsic kernel performance favors the other. Empirical signature pre-fix at EP=8/16, N=16384: fi locks to `tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the same shapes picks `tile_size=256` more often, producing a +5-9% headline gap from the tactic mismatch. Fix — three coordinated changes: 1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a module-level `VALID_TILE_SIZES` tuple. Single source of truth for tactic enumeration AND prealloc sizing. Adding a new tile_size here automatically widens the prealloc. 2. `fused_moe.py:_allocate_buffers`: size buffers to fit any `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is monotonically increasing in `tile_size` (use `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically decreasing (use `min(VALID_TILE_SIZES)`). Override `out_permuted_idx_to_expanded_idx` independently to fit the largest tile's `max_num_permuted_tokens`. 3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`. Both tactic groups now reuse the prealloc; profiling is unbiased. Net: post-fix, the autotuner picks the higher-throughput tactic at each shape on its merits, matching TRT-LLM's choice at large N. ## Tests Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`: - `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one entry (otherwise the bias-prevention is moot), (2) the monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies `max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers), and (3) the opposite monotonicity of `max_num_tiles` (justifies `min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5 parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32 + a generic mid-size shape. - `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required): constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The first test verifies the prealloc'd buffers fit the workload at every `tile_size in VALID_TILE_SIZES` (not just the constructor-time `self.tile_size`). The second test monkey-patches the module-level `_moe_core_impl` to capture the buffer-passing decision and verifies the `use_prealloc` gate honors every `tile_size in VALID_TILE_SIZES`, not just `self.tile_size` — directly pinning the load-bearing property of the fix. ## Pairs with PR flashinfer-ai#3216 Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically; validated at `--num-iters 100` on B200 across EP=8/16, N=16384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aleozlx · 2026-05-05T17:03:35Z

/bot run

flashinfer-bot · 2026-05-05T17:04:02Z

GitLab MR !624 has been updated with latest changes, and the CI pipeline #50336929 is currently running. I'll report back once the pipeline job completes.

`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`, `_gemm1_output_scale`, and `_moe_sort_buffers` sized for `self.tile_size` only. The `use_prealloc` gate in `_forward_with_tactic` honors prealloc only when the probed tactic's `tile_size` matches `self.tile_size`: use_prealloc = ( self.use_cuda_graph and tile_size == self.tile_size and num_tokens <= self.max_num_tokens ) During autotune profiling, mismatched tactics fall through to dynamic `torch.empty()` per-call allocation. The autotuner is then comparing tactic latencies that include asymmetric allocation overhead — tactics matching `self.tile_size` run on the prealloc, others pay the alloc cost — so it consistently picks the matching `tile_size` even when intrinsic kernel performance favors the other. Empirical signature pre-fix at EP=8/16, N=16384: fi locks to `tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the same shapes picks `tile_size=256` more often, producing a +5-9% headline gap from the tactic mismatch. Fix — three coordinated changes: 1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a module-level `VALID_TILE_SIZES` tuple. Single source of truth for tactic enumeration AND prealloc sizing. Adding a new tile_size here automatically widens the prealloc. 2. `fused_moe.py:_allocate_buffers`: size buffers to fit any `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is monotonically increasing in `tile_size` (use `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically decreasing (use `min(VALID_TILE_SIZES)`). Override `out_permuted_idx_to_expanded_idx` independently to fit the largest tile's `max_num_permuted_tokens`. 3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`. Both tactic groups now reuse the prealloc; profiling is unbiased. Net: post-fix, the autotuner picks the higher-throughput tactic at each shape on its merits, matching TRT-LLM's choice at large N. ## Tests Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`: - `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one entry (otherwise the bias-prevention is moot), (2) the monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies `max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers), and (3) the opposite monotonicity of `max_num_tiles` (justifies `min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5 parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32 + a generic mid-size shape. - `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required): constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The first test verifies the prealloc'd buffers fit the workload at every `tile_size in VALID_TILE_SIZES` (not just the constructor-time `self.tile_size`). The second test monkey-patches the module-level `_moe_core_impl` to capture the buffer-passing decision and verifies the `use_prealloc` gate honors every `tile_size in VALID_TILE_SIZES`, not just `self.tile_size` — directly pinning the load-bearing property of the fix. ## Pairs with PR flashinfer-ai#3216 Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically; validated at `--num-iters 100` on B200 across EP=8/16, N=16384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR flashinfer-ai#3216 merged 2026-05-06 22:09 UTC as squash-commit `e6ac7cc2`. Replaces the pre-computed-tuple bucket cap with a bare-callable form that adapts to runtime input dim. Pairs with the prealloc-fix (now rebased onto post-flashinfer-ai#3216 main as HEAD `c7a81fdb`, ready to PR). Updates the line-15 final-state entry, line-51 perf-investigation- closed paragraph, and line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`, `_gemm1_output_scale`, and `_moe_sort_buffers` sized for `self.tile_size` only. The `use_prealloc` gate in `_forward_with_tactic` honors prealloc only when the probed tactic's `tile_size` matches `self.tile_size`: use_prealloc = ( self.use_cuda_graph and tile_size == self.tile_size and num_tokens <= self.max_num_tokens ) During autotune profiling, mismatched tactics fall through to dynamic `torch.empty()` per-call allocation. The autotuner is then comparing tactic latencies that include asymmetric allocation overhead — tactics matching `self.tile_size` run on the prealloc, others pay the alloc cost — so it consistently picks the matching `tile_size` even when intrinsic kernel performance favors the other. Empirical signature pre-fix at EP=8/16, N=16384: fi locks to `tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the same shapes picks `tile_size=256` more often, producing a +5-9% headline gap from the tactic mismatch. Fix — three coordinated changes: 1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a module-level `VALID_TILE_SIZES` tuple. Single source of truth for tactic enumeration AND prealloc sizing. Adding a new tile_size here automatically widens the prealloc. 2. `fused_moe.py:_allocate_buffers`: size buffers to fit any `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is monotonically increasing in `tile_size` (use `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically decreasing (use `min(VALID_TILE_SIZES)`). Override `out_permuted_idx_to_expanded_idx` independently to fit the largest tile's `max_num_permuted_tokens`. 3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`. Both tactic groups now reuse the prealloc; profiling is unbiased. Net: post-fix, the autotuner picks the higher-throughput tactic at each shape on its merits, matching TRT-LLM's choice at large N. ## Tests Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`: - `TestPreallocStaticInvariants` (1 test, no-GPU): pins `VALID_TILE_SIZES` to enumerate more than one tile_size. Catches the orthogonal failure mode where a future refactor reduces the constant to a single entry — in that case the GPU integration tests below would pass trivially (no max/min divergence, only one tile_size to gate-check) and the bias-prevention silently disappears. - `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required): constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The first test verifies the prealloc'd buffer shapes fit the workload at every `tile_size in VALID_TILE_SIZES` — directly empirically pinning the buffer-sizing contract. The second test monkey-patches the module-level `_moe_core_impl` to capture the buffer-passing decision and verifies the `use_prealloc` gate honors every `tile_size in VALID_TILE_SIZES`, not just `self.tile_size` — directly pinning the load-bearing property of the fix. ## Pairs with PR flashinfer-ai#3216 Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix, merged 2026-05-06). Both required to fully close the EP>1 perf gap empirically; validated at `--num-iters 100` on B200 across EP=8/16, N=16384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend the `use_prealloc` gate in `_forward_with_tactic` with `not AutoTuner.get().is_tuning_mode` so the wrapper bypasses its preallocated buffers during autotune profiling. All tactics then see the same per-call `torch.empty()` allocation overhead and the autotuner's tactic comparison is unbiased; outside the `autotune(True)` context the gate behaves as before — prealloc when `tile_size == self.tile_size`, fall through otherwise. This replaces an earlier approach in this branch that widened the preallocated buffers to fit every valid tile_size; the new approach decouples `self.tile_size` from autotune-time allocation without expanding the prealloc layout, so `tuner.py` is left untouched and `_allocate_buffers` reverts to its pre-PR shape. Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically; validated at `--num-iters 100` on B200 across EP=8/16, N=16384. ## Tests Replaces the prior buffer-shape and structural-invariant tests with a single GPU/SM100 test (`TestPreallocGateUnderTuning`) that constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`, monkey-patches the module-level `_moe_core_impl` to capture the `moe_sort_buffers` argument across {inside `autotune(True)`, outside} × {`tile_size == self.tile_size`, mismatch}, and asserts: - Inside `autotune(True)`: gate skips prealloc for every tactic — pinning the unbias property. - Outside `autotune(True)`: gate uses prealloc when the picked tactic's `tile_size` matches `self.tile_size`; skips it otherwise (the prealloc layout would be wrong for the other tile_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the wrapper's ``not AutoTuner.get().is_tuning_mode`` gate clause with ``not is_in_profile_measurement()``. The new signal is strictly narrower than ``is_tuning_mode``: it is True only on the calling thread, and only inside the autotuner's per-tactic measurement window (warmup + timed run inside ``_profile_single_kernel``). It is False during cache lookups, ``do_preparation`` calls, the runner invocation immediately after ``choose_one`` returns, and other threads' inference -- all of which the broader ``is_tuning_mode`` flag swept up incorrectly. ## Why narrower ``AutoTuner.is_tuning_mode`` is True for the whole ``autotune(True)`` context, regardless of whether the autotuner is actively timing a specific tactic. Reading it from the wrapper meant that: 1. Cache hits for ops already tuned (where no measurement happens) bypassed prealloc anyway. 2. The runner invocation that uses the chosen tactic immediately after ``choose_one`` returns -- still inside the ``with autotune(True):`` block -- bypassed prealloc. 3. Concurrent threads doing inference while another thread held the tuning context bypassed prealloc. 4. CUDA-graph capture happening inside an ``autotune(True)`` block would record per-call ``torch.empty()`` calls instead of preallocs. None of these are situations where unbiased measurement matters; they all benefit from prealloc. ``is_in_profile_measurement()`` excludes them while still serving the original intent: during the actual measurement window, every tactic sees the same per-call allocation overhead and the autotuner's tactic comparison is unbiased. ## What changes in autotuner.py - New module-level ``_profile_measurement_thread_local`` (a ``threading.local``). - New ``_profile_measurement_scope`` context manager (private; sets the thread-local on entry, restores prior value on exit, supports nesting). - New ``is_in_profile_measurement()`` accessor (public; reads the thread-local; returns False on threads that never entered the scope). - ``AutoTuner._profile_single_kernel`` wraps its warmup + timed run with ``_profile_measurement_scope()`` so every runner invocation inside the measurement function sees the flag True; runner invocations elsewhere in ``choose_one`` (cache search, ``do_preparation``, the post-loop ``search_cache`` call) see it False. The change is purely additive: the new helpers don't alter ``AutoTuner``'s class state, no other autotune callers are affected. ## Tests Replaces the prior ``TestPreallocGateUnderTuning`` test with a broader contract check that exercises three contexts: 1. Inside ``autotune(True)`` AND inside ``_profile_measurement_scope()`` (simulating a tactic measurement) -- gate must skip prealloc for every tactic, regardless of tile_size match. 2. Inside ``autotune(True)`` but OUTSIDE the measurement scope (simulating a cache hit, the do_preparation call, or the post-``choose_one`` runner invocation) -- gate must use prealloc when ``tile_size == self.tile_size``, skip otherwise. This is the property that distinguishes the narrow signal from the broad one. 3. Outside any tuning context (plain inference) -- same as case 2. Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically; validated at ``--num-iters 100`` on B200 across EP=8/16, N=16384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds on the narrow ``is_in_profile_measurement()`` gate from the prior commit by also expanding ``CuteDslMoEWrapper._allocate_buffers`` to size kernel-output buffers for *any* ``tile_size in VALID_TILE_SIZES``, not just the constructor-time ``self.tile_size``. ## Why Now that the autotuner profiles tactics unbiasedly (the prior commit's narrower gate), it can fairly pick a tactic with ``tile_size != self.tile_size`` when that's the higher-throughput choice — at large N this is the common case (``tile_size=256`` typically wins). But with the prior commit alone, the wrapper's prealloc was still sized for ``self.tile_size`` and the gate fell through to per-call ``torch.empty()`` whenever the picked tactic mismatched. Two real problems: 1. **CUDA-graph contract**: the wrapper's ``run()`` is documented as graph-safe with ``use_cuda_graph=True``, but per-call ``torch.empty`` means captured graphs record the alloc inside the graph instead of binding to the prealloc. PyTorch's graph private memory pool accommodates this since 1.10, but it's not what the contract promises. 2. **Buffer-overflow correctness**: ``max_num_permuted_tokens`` is monotonically increasing in ``tile_size``, so a tile_size=128-sized buffer is too small for a tile_size=256 tactic. The fall-through to per-call alloc isn't a perf-hygiene choice — it's required for correctness given the smaller sizing. ## Fix — three coordinated changes 1. ``tuner.py``: lift the hardcoded ``[128, 256]`` tile_size list to a module-level ``VALID_TILE_SIZES`` tuple. Single source of truth for tactic enumeration AND prealloc sizing. 2. ``fused_moe.py:_allocate_buffers``: size buffers to fit any ``tile_size in VALID_TILE_SIZES``. ``max_num_permuted_tokens`` is monotonically increasing in ``tile_size`` (use ``max(VALID_TILE_SIZES)``); ``max_num_tiles`` is monotonically decreasing (use ``min(VALID_TILE_SIZES)``). Override ``out_permuted_idx_to_expanded_idx`` independently to fit the largest tile's ``max_num_permuted_tokens``. 3. ``fused_moe.py:_forward_with_tactic``: drop the ``tile_size == self.tile_size`` check from the gate. Replace with ``tile_size in VALID_TILE_SIZES`` (defensive; should never fail for tactics drawn from ``ALL_MOE_TACTICS``). The narrow ``is_in_profile_measurement()`` check from the prior commit is retained. ## Resulting gate semantics ``use_prealloc = use_cuda_graph AND not is_in_profile_measurement() AND tile_size in VALID_TILE_SIZES AND num_tokens <= max_num_tokens`` - During autotune profiling: ``is_in_profile_measurement`` is True → prealloc bypassed for every tactic → unbiased measurement. - During cache-lookup / do-preparation / post-``choose_one`` / concurrent-thread inference: prealloc used regardless of tile_size, preserving the wrapper's CUDA-graph contract. - During plain inference: same — prealloc used for whichever tactic the autotuner picked, including ``tile_size != self.tile_size``. ## Tests Three test classes in ``tests/moe/test_cute_dsl_fused_moe.py``: - ``TestPreallocStaticInvariants`` (1 no-GPU test): pin ``VALID_TILE_SIZES`` non-trivial. Catches accidental reduction to a single entry which would defeat the whole purpose. - ``TestPreallocBuffersIntegration`` (1 GPU/SM100 test): construct a real ``CuteDslMoEWrapper(use_cuda_graph=True)``, verify ``_gemm1_output``, ``_gemm1_output_scale``, and the moe_sort buffers fit the workload at every ``tile_size in VALID_TILE_SIZES``. Pins the buffer-sizing contract empirically. - ``TestPreallocGateUnderTuning`` (1 GPU/SM100 test, updated): monkey-patches ``_moe_core_impl`` and exercises three contexts × two tile_sizes: - measurement scope (inside ``_profile_measurement_scope``): gate must skip prealloc for every tactic. - inside ``autotune(True)`` but outside the measurement scope (cached call, post-choose_one): gate must use prealloc for every valid tile_size. - outside any tuning context (plain inference): gate must use prealloc for every valid tile_size. The latter two assertions are what's strengthened in this commit: the gate no longer requires ``tile_size == self.tile_size``. Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

flashinfer-bot added the op: moe label May 1, 2026

gemini-code-assist Bot reviewed May 1, 2026

View reviewed changes

leejnau marked this pull request as ready for review May 4, 2026 19:29

leejnau requested review from IwakuraRein, aleozlx, jiahanc, nv-yunzheq, samuellees and yzh119 as code owners May 4, 2026 19:29

leejnau and others added 3 commits May 4, 2026 12:34

leejnau force-pushed the fix-cute_dsl-moe-autotuner-bucket-cap branch from a0af65a to 61724e1 Compare May 4, 2026 19:35

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

nv-yunzheq added run-ci v0.6.11 release blocker label for 0.6.11 labels May 4, 2026

qiching reviewed May 4, 2026

View reviewed changes

qiching approved these changes May 4, 2026

View reviewed changes

aleozlx reviewed May 4, 2026

View reviewed changes

aleozlx approved these changes May 6, 2026

View reviewed changes

aleozlx enabled auto-merge (squash) May 6, 2026 07:22

yongwww added run-ci and removed run-ci labels May 6, 2026

aleozlx merged commit e6ac7cc into flashinfer-ai:main May 6, 2026
76 of 92 checks passed

leejnau deleted the fix-cute_dsl-moe-autotuner-bucket-cap branch May 6, 2026 22:20

leejnau mentioned this pull request May 6, 2026

fix(cute_dsl/moe): unbias autotuner profiling for tile_size enumeration #3252

Open

5 tasks

Conversation

leejnau commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq commented May 4, 2026

Uh oh!

flashinfer-bot commented May 4, 2026

Uh oh!

qiching May 4, 2026

Choose a reason for hiding this comment

Uh oh!

qiching May 4, 2026

Choose a reason for hiding this comment

Uh oh!

qiching left a comment

Choose a reason for hiding this comment

Uh oh!

qiching left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented May 5, 2026

Uh oh!

flashinfer-bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

leejnau commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading