fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input#3216
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe CuteDSL NVFP4 MoE tuner’s dynamic token bucketing is changed to use an uncapped hybrid bucket generator and mapping: ChangesUncapped Hybrid Bucket Mapping
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the autotuner configuration in flashinfer/fused_moe/cute_dsl/tuner.py to use dynamic bucket generation by passing bare callables, enabling the tuner to adapt to varying input dimensions without hardcoded caps. Additionally, it introduces a comprehensive test suite in tests/moe/test_cute_dsl_fused_moe.py to validate the structural integrity and behavior of the bucket configuration. A review comment identifies a potential ImportError in the new tests caused by a missing utility function.
| power-of-2 boundary up to the input dim. | ||
| """ | ||
| from flashinfer.fused_moe.utils import ( | ||
| get_last_power_of_2_num_tokens_buckets, |
There was a problem hiding this comment.
The function get_last_power_of_2_num_tokens_buckets is imported here but does not appear to be defined in flashinfer/fused_moe/utils.py. This will cause the test test_gen_tuning_buckets_covers_trtllm_power_of_2_points to fail with an ImportError. Please verify if this function was intended to be added to utils.py in this PR or if it should be replaced with an existing function.
…2026-05-01) Final-state line 15 + line 51 + line 4343 PR queue annotation now show the bucket-cap-fix's upstream state: opened as Draft PR flashinfer-ai#3216 on the post-flashinfer-ai#3171 main rebase, HEAD `1e3e217b` = tests on top of `1058280b` = fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me input
The autotuner's `DynamicTensorSpec` at
`flashinfer/fused_moe/cute_dsl/tuner.py` declared a dynamic-token-count
spec with `gen_tuning_buckets` as a pre-computed tuple
(`get_hybrid_num_tokens_buckets(8192)`) and `map_to_tuning_buckets` as
a lambda that capped at 8192 (`lambda x: map_to_hybrid_bucket(x,
8192)`). So when a model serves at num_tokens > 8192 — the DeepSeek-V3
prefill case at N=16384, for example — the runtime input mapped to
bucket=8192 and used the cached tactic that was profiled at half the
per-expert workload.
This produced a profile-shape vs runtime-shape mismatch: at
profile-time bucket=8192 with EP=8 the per-expert work is ~256 tokens,
where tile_size=128 wins by a tight ~0.58% margin over tile_size=256.
At runtime N=16384 the per-expert work doubles to ~512 tokens and
tile_size=256 wins more decisively. The cached choice from the
8192-shape profile was suboptimal for the larger runtime workload.
This change replaces the pre-computed-tuple form with the bare-callable
form, and switches `map_to_tuning_buckets` to the uncapped variant
`map_to_hybrid_bucket_uncapped` that was added alongside the hybrid-
bucket scheme exactly for this case. flashinfer's autotuner already
supports this: at `flashinfer/autotuner.py:1024` it inspects
`gen_tuning_buckets` and invokes it with the actual input dim at
autotune time when the value is a function. With the bare callable,
the bucket set adapts to the workload — no hardcoded cap, no magic
number, future-proof at any N.
This matches:
- TRT-LLM's pattern at
`tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py:2390-2391` and
`2700-2703` (CuteDSLFp8BlackwellRunner / -BmmRunner).
- flashinfer's own pattern at
`gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG` and six other
callsites in `gemm_base.py` and `trtllm_low_latency_gemm.py`. The
uncapped helper `map_to_hybrid_bucket_uncapped` was introduced in
that same code area for exactly this purpose; the CuteDSL MoE's
tuner was the one place that wasn't migrated to use it.
Empirical impact at N=16384 with --num-iters 100 --warmup 10 (3 runs
each, on a Blackwell B200):
Without this fix (prealloc-fix only — see related side branch
cute-dsl-moe-wrapper-prealloc-bias-fix):
EP=8: Δ% = +8.4% / -2.7% / +7.5% (mean +4.4%, spread 11pp)
EP=16: Δ% = +8.5% / -2.0% / +0.6% (mean +2.4%, spread 10.5pp)
With this fix + prealloc-fix:
EP=8: Δ% = -0.7% / -1.4% / -0.3% (mean -0.8%, spread 1.1pp)
EP=16: Δ% = -9.6% / +0.3% / -7.6% (mean -5.6%, spread 10pp)
(Δ% measured by `benchmarks/bench_cute_dsl_port_parity.py` against
TensorRT-LLM 1.3.0rc5.post2 in the same process.)
The bucket-cap mismatch was the load-bearing cause of the EP>1 perf
gap at N=16384; removing the hardcoded cap closes it. The remaining
EP=16 10pp spread under "both fixes" is trt's autotune coin-flip on
its own 0.08% tile=128 vs tile=256 profile margin, not a fi-side issue.
At all N ≤ the autotune-time max-N: identical to the previous (capped)
form — same buckets, same cache lookup, same tactic selection.
At N > autotune-time max-N (cache miss case): the previous form mapped
to the cap (8192) and reused that bucket's tactic; this form returns
the actual N. In practice the user calls autotune warmup at the
maximum expected N (`CuteDslMoEWrapper` standard usage), so cache
misses shouldn't occur.
Fully observable perf impact requires the wrapper prealloc-bias fix
(side branch `cute-dsl-moe-wrapper-prealloc-bias-fix`) to be applied
as well — that fix removes the autotune bias that locks fi to
tile=128 in 14/14 cache entries. Without it, this patch is a no-op
since fi can't pick tile=256 even when its profile shape suggests it.
The two patches are independent and can land in either order. PR
flashinfer-ai#3171 (the prerequisite gemm2 tactic enumeration fix that addresses
issue flashinfer-ai#3067) has already merged into main as commit `070fabf0`, so
tile_size=256 is correctly enumerated and this patch is unblocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfiguration
Adds six no-GPU pytest cases at
`tests/moe/test_cute_dsl_fused_moe.py::TestAutotunerBucketConfig`
guarding the autotuner bucket-cap fix and locking in the load-bearing
behavioral parity with TRT-LLM's pattern at
`cute_dsl_custom_ops.py:2390-2391` and `2700-2703`.
Three "no hardcoded cap" regression guards (the load-bearing
property of the fix):
1. `test_gen_tuning_buckets_is_callable_not_static_tuple` — pins
`gen_tuning_buckets` on the runner's `tuning_config` to be a bare
callable, not a pre-computed tuple.
2. `test_gen_tuning_buckets_no_hardcoded_8192_cap` — verifies that
calling the configured `gen_tuning_buckets` with input dims 8192,
16384, and 32768 produces bucket sets whose maximum reflects the
input value.
3. `test_map_to_tuning_buckets_above_8192_not_capped` — verifies
that `map_to_tuning_buckets(x)` for x ∈ {16384, 32768, 65536}
doesn't cap at 8192. Ensures we use `map_to_hybrid_bucket_uncapped`
instead of `lambda x: map_to_hybrid_bucket(x, 8192)`.
Three TRT-LLM-parity regression guards (lock in the
behavioral-equivalence-where-achievable):
4. `test_map_to_tuning_buckets_phase1_matches_trtllm_at_powers_of_2` —
pins fi/trt-llm parity at power-of-2 inputs ≤ 256 (hybrid Phase 1,
where pure power-of-2 spacing is preserved). At these inputs,
fi's `map_to_tuning_buckets(x)` must equal x and equal
`last_positive_power_of_2(x)` (TRT-LLM's pattern).
5. `test_map_to_tuning_buckets_is_monotonic` — pins monotonic
non-decreasing behavior across hybrid Phases 1-4. TRT-LLM's
`last_positive_power_of_2` and fi's `map_to_hybrid_bucket_uncapped`
both satisfy this; catches a regression that would introduce
non-monotonic mapping.
6. `test_gen_tuning_buckets_covers_trtllm_power_of_2_points` — pins
that fi's hybrid bucket set is a SUPERSET of TRT-LLM's power-of-2
bucket set at every max_n tested. The hybrid scheme intentionally
adds intermediate linear-step buckets in Phase 2/3 (per PR flashinfer-ai#3115's
perf rationale) but must preserve the coarse-grained power-of-2
coverage TRT-LLM has.
These six tests together enforce: (a) no hardcoded cap, (b) callable
form, (c) TRT-LLM-equivalence at power-of-2 probe points, (d)
monotonicity, (e) coarse-grained coverage parity with TRT-LLM. The
hybrid-vs-power-of-2 deviation in Phase 2/3/4 is intentional and
documented (PR flashinfer-ai#3115); the tests don't enforce parity in those phases
because that would regress fi's deliberate perf optimization.
All tests are pure-Python and run without a GPU. They construct a
`CuteDslFusedMoENvfp4Runner` with a no-op `forward_impl` to inspect
its `tuning_config`; no GPU, no CuteDSL kernel binaries, no autotune
side effects.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address gemini-code-assist review on PR flashinfer-ai#3216: the test was importing `get_last_power_of_2_num_tokens_buckets` from `flashinfer.fused_moe.utils`, but PR flashinfer-ai#3115 (merged 2026-04-24) removed that function in favor of the hybrid bucket scheme. The import would have caused an ImportError when the test was collected. Replace the call with an equivalent inline construction that mirrors TRT-LLM's `get_last_power_of_2_num_tokens_buckets` (in `tensorrt_llm/_torch/utils.py:291`): powers of 2 from 1 up to `last_positive_power_of_2(max_n)`. `last_positive_power_of_2` is still available in `flashinfer.fused_moe.utils`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a0af65a to
61724e1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/moe/test_cute_dsl_fused_moe.py (1)
569-577: ⚡ Quick winRemove redundant
strict=Falsekeyword arguments fromzip()calls.The
strictparameter iszip()'s default behavior and adds no functional value. Removing it simplifies the code. Note that this is a code-quality improvement, not a compatibility fix—the project requires Python 3.10+ wherestrictis available.Suggested fix
- for prev_x, prev_y, curr_x, curr_y in zip( - test_xs, results, test_xs[1:], results[1:], strict=False - ): + for prev_x, prev_y, curr_x, curr_y in zip( + test_xs, results, test_xs[1:], results[1:] + ): assert prev_y <= curr_y, ( f"map_to_tuning_buckets must be monotonically " f"non-decreasing; got map({prev_x})={prev_y} > " f"map({curr_x})={curr_y}. Full mapping at probe " - f"points: {list(zip(test_xs, results, strict=False))}." + f"points: {list(zip(test_xs, results))}." )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/moe/test_cute_dsl_fused_moe.py` around lines 569 - 577, The zip() calls in the loop that iterates prev_x, prev_y, curr_x, curr_y (the line starting with "for prev_x, prev_y, curr_x, curr_y in zip(") include the redundant keyword argument strict=False; remove strict=False from both zip() usages so the calls simply use zip(test_xs, results, test_xs[1:], results[1:]) and zip(test_xs, results) in the f-string construction, keeping the logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/moe/test_cute_dsl_fused_moe.py`:
- Around line 569-577: The zip() calls in the loop that iterates prev_x, prev_y,
curr_x, curr_y (the line starting with "for prev_x, prev_y, curr_x, curr_y in
zip(") include the redundant keyword argument strict=False; remove strict=False
from both zip() usages so the calls simply use zip(test_xs, results,
test_xs[1:], results[1:]) and zip(test_xs, results) in the f-string
construction, keeping the logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 34f13e0e-aab4-45e6-b5b4-08544ef362f0
📥 Commits
Reviewing files that changed from the base of the PR and between 68d2b66 and a0af65aa2d4cfe39aca217aaa9fa5af1617627d3.
📒 Files selected for processing (2)
flashinfer/fused_moe/cute_dsl/tuner.pytests/moe/test_cute_dsl_fused_moe.py
|
/bot run |
Updates the three audit-doc references to PR flashinfer-ai#3216's draft status (line 15, line 51, line 4369 PR queue annotation) to reflect the promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| dim_idx=(0, 0, 0, 0, 0), | ||
| gen_tuning_buckets=get_hybrid_num_tokens_buckets(8192), | ||
| map_to_tuning_buckets=lambda x: map_to_hybrid_bucket(x, 8192), | ||
| # Pass bucket generators as bare callables (matching |
There was a problem hiding this comment.
the TRT-LLM line numbers (2390-2391, 2700-2703) will go stale. the code is self-explanatory given the function names. I'd trim to 1-2 lines max, e.g.:
# bare callables, autotuner adapts bucket set to actual input dim
# (matches gemm_base.py _FP8_GEMM_SM100_TUNING_CONFIG pattern).
| tuple/sequence that bakes in a hardcoded cap. | ||
| """ | ||
| runner = self._make_runner() | ||
| spec = runner.tuning_config.dynamic_tensor_specs[0] |
There was a problem hiding this comment.
If gen_tuning_buckets is a tuple, callable(tuple_instance) is already false, so the first assertion fails before the second is ever reached. the second assertion is dead. could you check? maybe either remove it or swap the order.
qiching
left a comment
There was a problem hiding this comment.
every test method calls self._make_runner() independently. Since the runner is stateless for these checks, i will recommend @pytest.fixture would reduce boilerplate and test runtime.
Three changes in response to qiching's review: 1. `tuner.py`: trim the verbose bucket-config comment. Drop the TRT-LLM line numbers that will go stale; keep a one-line pointer to flashinfer's own `_FP8_GEMM_SM100_TUNING_CONFIG` pattern in `gemm_base.py`. 2. `tests/moe/test_cute_dsl_fused_moe.py`: collapse the dead second assertion in `test_gen_tuning_buckets_is_callable_not_static_tuple`. `callable(tuple_instance)` is already `False`, so the `not isinstance(..., tuple)` check was unreachable. Single `callable()` check now carries the full message (including the "pre-computed sequence likely indicates a hardcoded cap" hint). 3. `tests/moe/test_cute_dsl_fused_moe.py`: replace the `_make_runner` static method + per-test reconstruction with a module-scoped `bucket_spec` pytest fixture. Reduces boilerplate and avoids reconstructing the runner once per test method (the runner is stateless for these checks). Also genericized two stale TRT-LLM line-number references in test docstrings (`cute_dsl_custom_ops.py:2390-2391`) — same staleness concern as flashinfer-ai#1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
qiching
left a comment
There was a problem hiding this comment.
good! module-scoped fixture that replaces _make_runner and the repeated construction of the runner for each test with a bucket_spec pytest fixture. This is better! the module scope means the runner is constructed only once for the entire test module, making it more efficient!
aleozlx
left a comment
There was a problem hiding this comment.
reviewed. will wait for bot run
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:
use_prealloc = (
self.use_cuda_graph
and tile_size == self.tile_size
and num_tokens <= self.max_num_tokens
)
During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.
Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.
Fix — three coordinated changes:
1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
module-level `VALID_TILE_SIZES` tuple. Single source of truth for
tactic enumeration AND prealloc sizing. Adding a new tile_size
here automatically widens the prealloc.
2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
`tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
monotonically increasing in `tile_size` (use
`max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
decreasing (use `min(VALID_TILE_SIZES)`). Override
`out_permuted_idx_to_expanded_idx` independently to fit the
largest tile's `max_num_permuted_tokens`.
3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
`tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
Both tactic groups now reuse the prealloc; profiling is unbiased.
Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.
## Tests
Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:
- `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that
justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one
entry (otherwise the bias-prevention is moot), (2) the
monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies
`max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers),
and (3) the opposite monotonicity of `max_num_tiles` (justifies
`min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5
parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32
+ a generic mid-size shape.
- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
first test verifies the prealloc'd buffers fit the workload at
every `tile_size in VALID_TILE_SIZES` (not just the
constructor-time `self.tile_size`). The second test monkey-patches
the module-level `_moe_core_impl` to capture the buffer-passing
decision and verifies the `use_prealloc` gate honors every
`tile_size in VALID_TILE_SIZES`, not just `self.tile_size` —
directly pinning the load-bearing property of the fix.
## Pairs with PR flashinfer-ai#3216
Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/bot run |
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:
use_prealloc = (
self.use_cuda_graph
and tile_size == self.tile_size
and num_tokens <= self.max_num_tokens
)
During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.
Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.
Fix — three coordinated changes:
1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
module-level `VALID_TILE_SIZES` tuple. Single source of truth for
tactic enumeration AND prealloc sizing. Adding a new tile_size
here automatically widens the prealloc.
2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
`tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
monotonically increasing in `tile_size` (use
`max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
decreasing (use `min(VALID_TILE_SIZES)`). Override
`out_permuted_idx_to_expanded_idx` independently to fit the
largest tile's `max_num_permuted_tokens`.
3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
`tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
Both tactic groups now reuse the prealloc; profiling is unbiased.
Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.
## Tests
Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:
- `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that
justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one
entry (otherwise the bias-prevention is moot), (2) the
monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies
`max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers),
and (3) the opposite monotonicity of `max_num_tiles` (justifies
`min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5
parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32
+ a generic mid-size shape.
- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
first test verifies the prealloc'd buffers fit the workload at
every `tile_size in VALID_TILE_SIZES` (not just the
constructor-time `self.tile_size`). The second test monkey-patches
the module-level `_moe_core_impl` to capture the buffer-passing
decision and verifies the `use_prealloc` gate honors every
`tile_size in VALID_TILE_SIZES`, not just `self.tile_size` —
directly pinning the load-bearing property of the fix.
## Pairs with PR flashinfer-ai#3216
Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:
use_prealloc = (
self.use_cuda_graph
and tile_size == self.tile_size
and num_tokens <= self.max_num_tokens
)
During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.
Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.
Fix — three coordinated changes:
1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
module-level `VALID_TILE_SIZES` tuple. Single source of truth for
tactic enumeration AND prealloc sizing. Adding a new tile_size
here automatically widens the prealloc.
2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
`tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
monotonically increasing in `tile_size` (use
`max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
decreasing (use `min(VALID_TILE_SIZES)`). Override
`out_permuted_idx_to_expanded_idx` independently to fit the
largest tile's `max_num_permuted_tokens`.
3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
`tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
Both tactic groups now reuse the prealloc; profiling is unbiased.
Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.
## Tests
Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:
- `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that
justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one
entry (otherwise the bias-prevention is moot), (2) the
monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies
`max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers),
and (3) the opposite monotonicity of `max_num_tiles` (justifies
`min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5
parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32
+ a generic mid-size shape.
- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
first test verifies the prealloc'd buffers fit the workload at
every `tile_size in VALID_TILE_SIZES` (not just the
constructor-time `self.tile_size`). The second test monkey-patches
the module-level `_moe_core_impl` to capture the buffer-passing
decision and verifies the `use_prealloc` gate honors every
`tile_size in VALID_TILE_SIZES`, not just `self.tile_size` —
directly pinning the load-bearing property of the fix.
## Pairs with PR flashinfer-ai#3216
Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR flashinfer-ai#3216 merged 2026-05-06 22:09 UTC as squash-commit `e6ac7cc2`. Replaces the pre-computed-tuple bucket cap with a bare-callable form that adapts to runtime input dim. Pairs with the prealloc-fix (now rebased onto post-flashinfer-ai#3216 main as HEAD `c7a81fdb`, ready to PR). Updates the line-15 final-state entry, line-51 perf-investigation- closed paragraph, and line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:
use_prealloc = (
self.use_cuda_graph
and tile_size == self.tile_size
and num_tokens <= self.max_num_tokens
)
During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.
Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.
Fix — three coordinated changes:
1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
module-level `VALID_TILE_SIZES` tuple. Single source of truth for
tactic enumeration AND prealloc sizing. Adding a new tile_size
here automatically widens the prealloc.
2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
`tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
monotonically increasing in `tile_size` (use
`max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
decreasing (use `min(VALID_TILE_SIZES)`). Override
`out_permuted_idx_to_expanded_idx` independently to fit the
largest tile's `max_num_permuted_tokens`.
3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
`tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
Both tactic groups now reuse the prealloc; profiling is unbiased.
Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.
## Tests
Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:
- `TestPreallocStaticInvariants` (1 test, no-GPU): pins
`VALID_TILE_SIZES` to enumerate more than one tile_size. Catches
the orthogonal failure mode where a future refactor reduces the
constant to a single entry — in that case the GPU integration
tests below would pass trivially (no max/min divergence, only one
tile_size to gate-check) and the bias-prevention silently
disappears.
- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
first test verifies the prealloc'd buffer shapes fit the workload
at every `tile_size in VALID_TILE_SIZES` — directly empirically
pinning the buffer-sizing contract. The second test
monkey-patches the module-level `_moe_core_impl` to capture the
buffer-passing decision and verifies the `use_prealloc` gate
honors every `tile_size in VALID_TILE_SIZES`, not just
`self.tile_size` — directly pinning the load-bearing property of
the fix.
## Pairs with PR flashinfer-ai#3216
Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix, merged 2026-05-06).
Both required to fully close the EP>1 perf gap empirically;
validated at `--num-iters 100` on B200 across EP=8/16, N=16384.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the `use_prealloc` gate in `_forward_with_tactic` with `not AutoTuner.get().is_tuning_mode` so the wrapper bypasses its preallocated buffers during autotune profiling. All tactics then see the same per-call `torch.empty()` allocation overhead and the autotuner's tactic comparison is unbiased; outside the `autotune(True)` context the gate behaves as before — prealloc when `tile_size == self.tile_size`, fall through otherwise. This replaces an earlier approach in this branch that widened the preallocated buffers to fit every valid tile_size; the new approach decouples `self.tile_size` from autotune-time allocation without expanding the prealloc layout, so `tuner.py` is left untouched and `_allocate_buffers` reverts to its pre-PR shape. Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically; validated at `--num-iters 100` on B200 across EP=8/16, N=16384. ## Tests Replaces the prior buffer-shape and structural-invariant tests with a single GPU/SM100 test (`TestPreallocGateUnderTuning`) that constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`, monkey-patches the module-level `_moe_core_impl` to capture the `moe_sort_buffers` argument across {inside `autotune(True)`, outside} × {`tile_size == self.tile_size`, mismatch}, and asserts: - Inside `autotune(True)`: gate skips prealloc for every tactic — pinning the unbias property. - Outside `autotune(True)`: gate uses prealloc when the picked tactic's `tile_size` matches `self.tile_size`; skips it otherwise (the prealloc layout would be wrong for the other tile_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the wrapper's ``not AutoTuner.get().is_tuning_mode`` gate clause with ``not is_in_profile_measurement()``. The new signal is strictly narrower than ``is_tuning_mode``: it is True only on the calling thread, and only inside the autotuner's per-tactic measurement window (warmup + timed run inside ``_profile_single_kernel``). It is False during cache lookups, ``do_preparation`` calls, the runner invocation immediately after ``choose_one`` returns, and other threads' inference -- all of which the broader ``is_tuning_mode`` flag swept up incorrectly. ## Why narrower ``AutoTuner.is_tuning_mode`` is True for the whole ``autotune(True)`` context, regardless of whether the autotuner is actively timing a specific tactic. Reading it from the wrapper meant that: 1. Cache hits for ops already tuned (where no measurement happens) bypassed prealloc anyway. 2. The runner invocation that uses the chosen tactic immediately after ``choose_one`` returns -- still inside the ``with autotune(True):`` block -- bypassed prealloc. 3. Concurrent threads doing inference while another thread held the tuning context bypassed prealloc. 4. CUDA-graph capture happening inside an ``autotune(True)`` block would record per-call ``torch.empty()`` calls instead of preallocs. None of these are situations where unbiased measurement matters; they all benefit from prealloc. ``is_in_profile_measurement()`` excludes them while still serving the original intent: during the actual measurement window, every tactic sees the same per-call allocation overhead and the autotuner's tactic comparison is unbiased. ## What changes in autotuner.py - New module-level ``_profile_measurement_thread_local`` (a ``threading.local``). - New ``_profile_measurement_scope`` context manager (private; sets the thread-local on entry, restores prior value on exit, supports nesting). - New ``is_in_profile_measurement()`` accessor (public; reads the thread-local; returns False on threads that never entered the scope). - ``AutoTuner._profile_single_kernel`` wraps its warmup + timed run with ``_profile_measurement_scope()`` so every runner invocation inside the measurement function sees the flag True; runner invocations elsewhere in ``choose_one`` (cache search, ``do_preparation``, the post-loop ``search_cache`` call) see it False. The change is purely additive: the new helpers don't alter ``AutoTuner``'s class state, no other autotune callers are affected. ## Tests Replaces the prior ``TestPreallocGateUnderTuning`` test with a broader contract check that exercises three contexts: 1. Inside ``autotune(True)`` AND inside ``_profile_measurement_scope()`` (simulating a tactic measurement) -- gate must skip prealloc for every tactic, regardless of tile_size match. 2. Inside ``autotune(True)`` but OUTSIDE the measurement scope (simulating a cache hit, the do_preparation call, or the post-``choose_one`` runner invocation) -- gate must use prealloc when ``tile_size == self.tile_size``, skip otherwise. This is the property that distinguishes the narrow signal from the broad one. 3. Outside any tuning context (plain inference) -- same as case 2. Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to fully close the EP>1 perf gap empirically; validated at ``--num-iters 100`` on B200 across EP=8/16, N=16384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on the narrow ``is_in_profile_measurement()`` gate from the
prior commit by also expanding ``CuteDslMoEWrapper._allocate_buffers``
to size kernel-output buffers for *any* ``tile_size in
VALID_TILE_SIZES``, not just the constructor-time ``self.tile_size``.
## Why
Now that the autotuner profiles tactics unbiasedly (the prior commit's
narrower gate), it can fairly pick a tactic with ``tile_size != self.tile_size``
when that's the higher-throughput choice — at large N this is the
common case (``tile_size=256`` typically wins). But with the prior
commit alone, the wrapper's prealloc was still sized for
``self.tile_size`` and the gate fell through to per-call
``torch.empty()`` whenever the picked tactic mismatched. Two real
problems:
1. **CUDA-graph contract**: the wrapper's ``run()`` is documented as
graph-safe with ``use_cuda_graph=True``, but per-call ``torch.empty``
means captured graphs record the alloc inside the graph instead of
binding to the prealloc. PyTorch's graph private memory pool
accommodates this since 1.10, but it's not what the contract
promises.
2. **Buffer-overflow correctness**: ``max_num_permuted_tokens`` is
monotonically increasing in ``tile_size``, so a tile_size=128-sized
buffer is too small for a tile_size=256 tactic. The fall-through to
per-call alloc isn't a perf-hygiene choice — it's required for
correctness given the smaller sizing.
## Fix — three coordinated changes
1. ``tuner.py``: lift the hardcoded ``[128, 256]`` tile_size list to
a module-level ``VALID_TILE_SIZES`` tuple. Single source of truth
for tactic enumeration AND prealloc sizing.
2. ``fused_moe.py:_allocate_buffers``: size buffers to fit any
``tile_size in VALID_TILE_SIZES``. ``max_num_permuted_tokens`` is
monotonically increasing in ``tile_size`` (use
``max(VALID_TILE_SIZES)``); ``max_num_tiles`` is monotonically
decreasing (use ``min(VALID_TILE_SIZES)``). Override
``out_permuted_idx_to_expanded_idx`` independently to fit the
largest tile's ``max_num_permuted_tokens``.
3. ``fused_moe.py:_forward_with_tactic``: drop the
``tile_size == self.tile_size`` check from the gate. Replace with
``tile_size in VALID_TILE_SIZES`` (defensive; should never fail for
tactics drawn from ``ALL_MOE_TACTICS``). The narrow
``is_in_profile_measurement()`` check from the prior commit is
retained.
## Resulting gate semantics
``use_prealloc = use_cuda_graph
AND not is_in_profile_measurement()
AND tile_size in VALID_TILE_SIZES
AND num_tokens <= max_num_tokens``
- During autotune profiling: ``is_in_profile_measurement`` is True →
prealloc bypassed for every tactic → unbiased measurement.
- During cache-lookup / do-preparation / post-``choose_one`` /
concurrent-thread inference: prealloc used regardless of tile_size,
preserving the wrapper's CUDA-graph contract.
- During plain inference: same — prealloc used for whichever tactic
the autotuner picked, including ``tile_size != self.tile_size``.
## Tests
Three test classes in ``tests/moe/test_cute_dsl_fused_moe.py``:
- ``TestPreallocStaticInvariants`` (1 no-GPU test): pin
``VALID_TILE_SIZES`` non-trivial. Catches accidental reduction to
a single entry which would defeat the whole purpose.
- ``TestPreallocBuffersIntegration`` (1 GPU/SM100 test): construct a
real ``CuteDslMoEWrapper(use_cuda_graph=True)``, verify
``_gemm1_output``, ``_gemm1_output_scale``, and the moe_sort buffers
fit the workload at every ``tile_size in VALID_TILE_SIZES``. Pins
the buffer-sizing contract empirically.
- ``TestPreallocGateUnderTuning`` (1 GPU/SM100 test, updated):
monkey-patches ``_moe_core_impl`` and exercises three contexts × two
tile_sizes:
- measurement scope (inside ``_profile_measurement_scope``):
gate must skip prealloc for every tactic.
- inside ``autotune(True)`` but outside the measurement scope
(cached call, post-choose_one): gate must use prealloc for
every valid tile_size.
- outside any tuning context (plain inference): gate must use
prealloc for every valid tile_size.
The latter two assertions are what's strengthened in this commit:
the gate no longer requires ``tile_size == self.tile_size``.
Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📌 Description
The autotuner's
DynamicTensorSpecinflashinfer/fused_moe/cute_dsl/tuner.pydeclaredgen_tuning_bucketsas the pre-computed tupleget_hybrid_num_tokens_buckets(8192)andmap_to_tuning_bucketsaslambda x: map_to_hybrid_bucket(x,8192). The hardcoded 8192 cap silently clamped any runtime workload larger than that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill (N=16384) fi profiled at half the per-expert workload and used a tactic optimized for the wrong shape.This PR replaces the pre-computed tuple with the bare callable form (
get_hybrid_num_tokens_buckets) and switches the mapper to the uncapped variantmap_to_hybrid_bucket_uncapped(added alongside the hybrid-bucket scheme for exactly this case). The autotuner now invokes them with the actual input dim at autotune time, matching TRT-LLM's pattern atcute_dsl_custom_ops.py:2390-2391and flashinfer's own pattern atgemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG.🔍 Related Issues
#3171
#3198
#3115
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Bug Fixes
Tests