Skip to content

fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input#3216

Merged
aleozlx merged 4 commits intoflashinfer-ai:mainfrom
leejnau:fix-cute_dsl-moe-autotuner-bucket-cap
May 6, 2026
Merged

fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input#3216
aleozlx merged 4 commits intoflashinfer-ai:mainfrom
leejnau:fix-cute_dsl-moe-autotuner-bucket-cap

Conversation

@leejnau
Copy link
Copy Markdown
Contributor

@leejnau leejnau commented May 1, 2026

📌 Description

The autotuner's DynamicTensorSpec in flashinfer/fused_moe/cute_dsl/tuner.py declared gen_tuning_buckets as the pre-computed tuple get_hybrid_num_tokens_buckets(8192) and map_to_tuning_buckets as lambda x: map_to_hybrid_bucket(x,8192). The hardcoded 8192 cap silently clamped any runtime workload larger than that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill (N=16384) fi profiled at half the per-expert workload and used a tactic optimized for the wrong shape.

This PR replaces the pre-computed tuple with the bare callable form (get_hybrid_num_tokens_buckets) and switches the mapper to the uncapped variant map_to_hybrid_bucket_uncapped (added alongside the hybrid-bucket scheme for exactly this case). The autotuner now invokes them with the actual input dim at autotune time, matching TRT-LLM's pattern at cute_dsl_custom_ops.py:2390-2391 and flashinfer's own pattern at gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG.

🔍 Related Issues

#3171
#3198
#3115

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • MoE autotuner now uses uncapped dynamic hybrid bucket mapping instead of a fixed-bounded set, improving adaptation to varying input token sizes.
  • Tests

    • Added offline tests validating autotuner bucket configuration: dynamic bucket generation, responsiveness to input size, monotonic mapping behavior, large-input scaling, and alignment with expected power-of-2 bucket values.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a91e88e-ea1e-4d7e-83c6-709044357760

📥 Commits

Reviewing files that changed from the base of the PR and between 61724e1 and 96e1d09.

📒 Files selected for processing (2)
  • flashinfer/fused_moe/cute_dsl/tuner.py
  • tests/moe/test_cute_dsl_fused_moe.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • flashinfer/fused_moe/cute_dsl/tuner.py
  • tests/moe/test_cute_dsl_fused_moe.py

📝 Walkthrough

Walkthrough

The CuteDSL NVFP4 MoE tuner’s dynamic token bucketing is changed to use an uncapped hybrid bucket generator and mapping: gen_tuning_buckets is passed as the callable get_hybrid_num_tokens_buckets (not a precomputed bounded tuple) and map_to_tuning_buckets now uses map_to_hybrid_bucket_uncapped. Tests validating bucket generation, scaling, monotonicity, and power-of-2 coverage were added.

Changes

Uncapped Hybrid Bucket Mapping

Layer / File(s) Summary
Data Shape / Wiring
flashinfer/fused_moe/cute_dsl/tuner.py
Replace prior capped wiring (get_hybrid_num_tokens_buckets(8192) and lambda x: map_to_hybrid_bucket(x, 8192)) with gen_tuning_buckets=get_hybrid_num_tokens_buckets (callable) and map_to_tuning_buckets=map_to_hybrid_bucket_uncapped.
Behavior / Mapping
flashinfer/fused_moe/cute_dsl/tuner.py
Switch bucket mapping from a fixed-8192-capped hybrid mapping to an uncapped hybrid mapping, changing how autotuning buckets derive from input token dims.
Tests / Validation
tests/moe/test_cute_dsl_fused_moe.py
Add bucket_spec fixture and TestAutotunerBucketConfig asserting gen_tuning_buckets is callable, bucket maxima increase with input dim, map_to_tuning_buckets scales for large inputs and matches power-of-2 for small values, is monotonically non-decreasing, and gen_tuning_buckets(max_n) covers TRT-LLM power-of-2 points up to the max.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • aleozlx
  • yzh119
  • samuellees
  • IwakuraRein
  • jiahanc
  • nv-yunzheq

Poem

🐰 I hopped from caps to open skies so wide,
Buckets now grow with every token tide.
Callable kernels hum and maps unbound,
Power-of-two friends are still around.
Tune on, small rabbit, in the uncapped ground.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: making the autotuner bucket configuration adapt to runtime input by removing the hardcoded 8192 cap, which is the primary goal of this PR.
Description check ✅ Passed The PR description comprehensively addresses the template requirements with detailed explanation of the problem, solution, related issues, and completed pre-commit and test checklists.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the autotuner configuration in flashinfer/fused_moe/cute_dsl/tuner.py to use dynamic bucket generation by passing bare callables, enabling the tuner to adapt to varying input dimensions without hardcoded caps. Additionally, it introduces a comprehensive test suite in tests/moe/test_cute_dsl_fused_moe.py to validate the structural integrity and behavior of the bucket configuration. A review comment identifies a potential ImportError in the new tests caused by a missing utility function.

Comment thread tests/moe/test_cute_dsl_fused_moe.py Outdated
power-of-2 boundary up to the input dim.
"""
from flashinfer.fused_moe.utils import (
get_last_power_of_2_num_tokens_buckets,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function get_last_power_of_2_num_tokens_buckets is imported here but does not appear to be defined in flashinfer/fused_moe/utils.py. This will cause the test test_gen_tuning_buckets_covers_trtllm_power_of_2_points to fail with an ImportError. Please verify if this function was intended to be added to utils.py in this PR or if it should be replaced with an existing function.

leejnau added a commit to leejnau/flashinfer that referenced this pull request May 1, 2026
…2026-05-01)

Final-state line 15 + line 51 + line 4343 PR queue annotation now show
the bucket-cap-fix's upstream state: opened as Draft PR flashinfer-ai#3216 on the
post-flashinfer-ai#3171 main rebase, HEAD `1e3e217b` = tests on top of `1058280b`
= fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leejnau leejnau marked this pull request as ready for review May 4, 2026 19:29
leejnau and others added 3 commits May 4, 2026 12:34
…me input

The autotuner's `DynamicTensorSpec` at
`flashinfer/fused_moe/cute_dsl/tuner.py` declared a dynamic-token-count
spec with `gen_tuning_buckets` as a pre-computed tuple
(`get_hybrid_num_tokens_buckets(8192)`) and `map_to_tuning_buckets` as
a lambda that capped at 8192 (`lambda x: map_to_hybrid_bucket(x,
8192)`). So when a model serves at num_tokens > 8192 — the DeepSeek-V3
prefill case at N=16384, for example — the runtime input mapped to
bucket=8192 and used the cached tactic that was profiled at half the
per-expert workload.

This produced a profile-shape vs runtime-shape mismatch: at
profile-time bucket=8192 with EP=8 the per-expert work is ~256 tokens,
where tile_size=128 wins by a tight ~0.58% margin over tile_size=256.
At runtime N=16384 the per-expert work doubles to ~512 tokens and
tile_size=256 wins more decisively. The cached choice from the
8192-shape profile was suboptimal for the larger runtime workload.

This change replaces the pre-computed-tuple form with the bare-callable
form, and switches `map_to_tuning_buckets` to the uncapped variant
`map_to_hybrid_bucket_uncapped` that was added alongside the hybrid-
bucket scheme exactly for this case. flashinfer's autotuner already
supports this: at `flashinfer/autotuner.py:1024` it inspects
`gen_tuning_buckets` and invokes it with the actual input dim at
autotune time when the value is a function. With the bare callable,
the bucket set adapts to the workload — no hardcoded cap, no magic
number, future-proof at any N.

This matches:
- TRT-LLM's pattern at
  `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py:2390-2391` and
  `2700-2703` (CuteDSLFp8BlackwellRunner / -BmmRunner).
- flashinfer's own pattern at
  `gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG` and six other
  callsites in `gemm_base.py` and `trtllm_low_latency_gemm.py`. The
  uncapped helper `map_to_hybrid_bucket_uncapped` was introduced in
  that same code area for exactly this purpose; the CuteDSL MoE's
  tuner was the one place that wasn't migrated to use it.

Empirical impact at N=16384 with --num-iters 100 --warmup 10 (3 runs
each, on a Blackwell B200):

  Without this fix (prealloc-fix only — see related side branch
  cute-dsl-moe-wrapper-prealloc-bias-fix):
    EP=8:  Δ% = +8.4% / -2.7% / +7.5%   (mean +4.4%, spread 11pp)
    EP=16: Δ% = +8.5% / -2.0% / +0.6%   (mean +2.4%, spread 10.5pp)

  With this fix + prealloc-fix:
    EP=8:  Δ% = -0.7% / -1.4% / -0.3%   (mean -0.8%, spread 1.1pp)
    EP=16: Δ% = -9.6% / +0.3% / -7.6%   (mean -5.6%, spread 10pp)

(Δ% measured by `benchmarks/bench_cute_dsl_port_parity.py` against
TensorRT-LLM 1.3.0rc5.post2 in the same process.)

The bucket-cap mismatch was the load-bearing cause of the EP>1 perf
gap at N=16384; removing the hardcoded cap closes it. The remaining
EP=16 10pp spread under "both fixes" is trt's autotune coin-flip on
its own 0.08% tile=128 vs tile=256 profile margin, not a fi-side issue.

At all N ≤ the autotune-time max-N: identical to the previous (capped)
form — same buckets, same cache lookup, same tactic selection.

At N > autotune-time max-N (cache miss case): the previous form mapped
to the cap (8192) and reused that bucket's tactic; this form returns
the actual N. In practice the user calls autotune warmup at the
maximum expected N (`CuteDslMoEWrapper` standard usage), so cache
misses shouldn't occur.

Fully observable perf impact requires the wrapper prealloc-bias fix
(side branch `cute-dsl-moe-wrapper-prealloc-bias-fix`) to be applied
as well — that fix removes the autotune bias that locks fi to
tile=128 in 14/14 cache entries. Without it, this patch is a no-op
since fi can't pick tile=256 even when its profile shape suggests it.
The two patches are independent and can land in either order. PR
flashinfer-ai#3171 (the prerequisite gemm2 tactic enumeration fix that addresses
issue flashinfer-ai#3067) has already merged into main as commit `070fabf0`, so
tile_size=256 is correctly enumerated and this patch is unblocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfiguration

Adds six no-GPU pytest cases at
`tests/moe/test_cute_dsl_fused_moe.py::TestAutotunerBucketConfig`
guarding the autotuner bucket-cap fix and locking in the load-bearing
behavioral parity with TRT-LLM's pattern at
`cute_dsl_custom_ops.py:2390-2391` and `2700-2703`.

Three "no hardcoded cap" regression guards (the load-bearing
property of the fix):

1. `test_gen_tuning_buckets_is_callable_not_static_tuple` — pins
   `gen_tuning_buckets` on the runner's `tuning_config` to be a bare
   callable, not a pre-computed tuple.

2. `test_gen_tuning_buckets_no_hardcoded_8192_cap` — verifies that
   calling the configured `gen_tuning_buckets` with input dims 8192,
   16384, and 32768 produces bucket sets whose maximum reflects the
   input value.

3. `test_map_to_tuning_buckets_above_8192_not_capped` — verifies
   that `map_to_tuning_buckets(x)` for x ∈ {16384, 32768, 65536}
   doesn't cap at 8192. Ensures we use `map_to_hybrid_bucket_uncapped`
   instead of `lambda x: map_to_hybrid_bucket(x, 8192)`.

Three TRT-LLM-parity regression guards (lock in the
behavioral-equivalence-where-achievable):

4. `test_map_to_tuning_buckets_phase1_matches_trtllm_at_powers_of_2` —
   pins fi/trt-llm parity at power-of-2 inputs ≤ 256 (hybrid Phase 1,
   where pure power-of-2 spacing is preserved). At these inputs,
   fi's `map_to_tuning_buckets(x)` must equal x and equal
   `last_positive_power_of_2(x)` (TRT-LLM's pattern).

5. `test_map_to_tuning_buckets_is_monotonic` — pins monotonic
   non-decreasing behavior across hybrid Phases 1-4. TRT-LLM's
   `last_positive_power_of_2` and fi's `map_to_hybrid_bucket_uncapped`
   both satisfy this; catches a regression that would introduce
   non-monotonic mapping.

6. `test_gen_tuning_buckets_covers_trtllm_power_of_2_points` — pins
   that fi's hybrid bucket set is a SUPERSET of TRT-LLM's power-of-2
   bucket set at every max_n tested. The hybrid scheme intentionally
   adds intermediate linear-step buckets in Phase 2/3 (per PR flashinfer-ai#3115's
   perf rationale) but must preserve the coarse-grained power-of-2
   coverage TRT-LLM has.

These six tests together enforce: (a) no hardcoded cap, (b) callable
form, (c) TRT-LLM-equivalence at power-of-2 probe points, (d)
monotonicity, (e) coarse-grained coverage parity with TRT-LLM. The
hybrid-vs-power-of-2 deviation in Phase 2/3/4 is intentional and
documented (PR flashinfer-ai#3115); the tests don't enforce parity in those phases
because that would regress fi's deliberate perf optimization.

All tests are pure-Python and run without a GPU. They construct a
`CuteDslFusedMoENvfp4Runner` with a no-op `forward_impl` to inspect
its `tuning_config`; no GPU, no CuteDSL kernel binaries, no autotune
side effects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address gemini-code-assist review on PR flashinfer-ai#3216: the test was importing
`get_last_power_of_2_num_tokens_buckets` from
`flashinfer.fused_moe.utils`, but PR flashinfer-ai#3115 (merged 2026-04-24) removed
that function in favor of the hybrid bucket scheme. The import would
have caused an ImportError when the test was collected.

Replace the call with an equivalent inline construction that mirrors
TRT-LLM's `get_last_power_of_2_num_tokens_buckets` (in
`tensorrt_llm/_torch/utils.py:291`): powers of 2 from 1 up to
`last_positive_power_of_2(max_n)`. `last_positive_power_of_2` is
still available in `flashinfer.fused_moe.utils`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leejnau leejnau force-pushed the fix-cute_dsl-moe-autotuner-bucket-cap branch from a0af65a to 61724e1 Compare May 4, 2026 19:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/moe/test_cute_dsl_fused_moe.py (1)

569-577: ⚡ Quick win

Remove redundant strict=False keyword arguments from zip() calls.

The strict parameter is zip()'s default behavior and adds no functional value. Removing it simplifies the code. Note that this is a code-quality improvement, not a compatibility fix—the project requires Python 3.10+ where strict is available.

Suggested fix
-        for prev_x, prev_y, curr_x, curr_y in zip(
-            test_xs, results, test_xs[1:], results[1:], strict=False
-        ):
+        for prev_x, prev_y, curr_x, curr_y in zip(
+            test_xs, results, test_xs[1:], results[1:]
+        ):
             assert prev_y <= curr_y, (
                 f"map_to_tuning_buckets must be monotonically "
                 f"non-decreasing; got map({prev_x})={prev_y} > "
                 f"map({curr_x})={curr_y}. Full mapping at probe "
-                f"points: {list(zip(test_xs, results, strict=False))}."
+                f"points: {list(zip(test_xs, results))}."
             )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_cute_dsl_fused_moe.py` around lines 569 - 577, The zip() calls
in the loop that iterates prev_x, prev_y, curr_x, curr_y (the line starting with
"for prev_x, prev_y, curr_x, curr_y in zip(") include the redundant keyword
argument strict=False; remove strict=False from both zip() usages so the calls
simply use zip(test_xs, results, test_xs[1:], results[1:]) and zip(test_xs,
results) in the f-string construction, keeping the logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/moe/test_cute_dsl_fused_moe.py`:
- Around line 569-577: The zip() calls in the loop that iterates prev_x, prev_y,
curr_x, curr_y (the line starting with "for prev_x, prev_y, curr_x, curr_y in
zip(") include the redundant keyword argument strict=False; remove strict=False
from both zip() usages so the calls simply use zip(test_xs, results,
test_xs[1:], results[1:]) and zip(test_xs, results) in the f-string
construction, keeping the logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 34f13e0e-aab4-45e6-b5b4-08544ef362f0

📥 Commits

Reviewing files that changed from the base of the PR and between 68d2b66 and a0af65aa2d4cfe39aca217aaa9fa5af1617627d3.

📒 Files selected for processing (2)
  • flashinfer/fused_moe/cute_dsl/tuner.py
  • tests/moe/test_cute_dsl_fused_moe.py

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

/bot run

@nv-yunzheq nv-yunzheq added run-ci v0.6.11 release blocker label for 0.6.11 labels May 4, 2026
@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !624 has been created, and the CI pipeline #50247376 is currently running. I'll report back once the pipeline job completes.

leejnau added a commit to leejnau/flashinfer that referenced this pull request May 4, 2026
Updates the three audit-doc references to PR flashinfer-ai#3216's draft status
(line 15, line 51, line 4369 PR queue annotation) to reflect the
promotion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread flashinfer/fused_moe/cute_dsl/tuner.py Outdated
dim_idx=(0, 0, 0, 0, 0),
gen_tuning_buckets=get_hybrid_num_tokens_buckets(8192),
map_to_tuning_buckets=lambda x: map_to_hybrid_bucket(x, 8192),
# Pass bucket generators as bare callables (matching
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the TRT-LLM line numbers (2390-2391, 2700-2703) will go stale. the code is self-explanatory given the function names. I'd trim to 1-2 lines max, e.g.:

# bare callables, autotuner adapts bucket set to actual input dim
# (matches gemm_base.py _FP8_GEMM_SM100_TUNING_CONFIG pattern).

Comment thread tests/moe/test_cute_dsl_fused_moe.py Outdated
tuple/sequence that bakes in a hardcoded cap.
"""
runner = self._make_runner()
spec = runner.tuning_config.dynamic_tensor_specs[0]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gen_tuning_buckets is a tuple, callable(tuple_instance) is already false, so the first assertion fails before the second is ever reached. the second assertion is dead. could you check? maybe either remove it or swap the order.

Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every test method calls self._make_runner() independently. Since the runner is stateless for these checks, i will recommend @pytest.fixture would reduce boilerplate and test runtime.

Three changes in response to qiching's review:

1. `tuner.py`: trim the verbose bucket-config comment. Drop the
   TRT-LLM line numbers that will go stale; keep a one-line pointer
   to flashinfer's own `_FP8_GEMM_SM100_TUNING_CONFIG` pattern in
   `gemm_base.py`.

2. `tests/moe/test_cute_dsl_fused_moe.py`: collapse the dead
   second assertion in `test_gen_tuning_buckets_is_callable_not_static_tuple`.
   `callable(tuple_instance)` is already `False`, so the
   `not isinstance(..., tuple)` check was unreachable. Single
   `callable()` check now carries the full message (including the
   "pre-computed sequence likely indicates a hardcoded cap" hint).

3. `tests/moe/test_cute_dsl_fused_moe.py`: replace the
   `_make_runner` static method + per-test reconstruction with a
   module-scoped `bucket_spec` pytest fixture. Reduces boilerplate
   and avoids reconstructing the runner once per test method (the
   runner is stateless for these checks).

Also genericized two stale TRT-LLM line-number references in test
docstrings (`cute_dsl_custom_ops.py:2390-2391`) — same staleness
concern as flashinfer-ai#1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good! module-scoped fixture that replaces _make_runner and the repeated construction of the runner for each test with a bucket_spec pytest fixture. This is better! the module scope means the runner is constructed only once for the entire test module, making it more efficient!

Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed. will wait for bot run

leejnau added a commit to leejnau/flashinfer that referenced this pull request May 5, 2026
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:

    use_prealloc = (
        self.use_cuda_graph
        and tile_size == self.tile_size
        and num_tokens <= self.max_num_tokens
    )

During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.

Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.

Fix — three coordinated changes:

1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
   module-level `VALID_TILE_SIZES` tuple. Single source of truth for
   tactic enumeration AND prealloc sizing. Adding a new tile_size
   here automatically widens the prealloc.

2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
   `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
   monotonically increasing in `tile_size` (use
   `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
   decreasing (use `min(VALID_TILE_SIZES)`). Override
   `out_permuted_idx_to_expanded_idx` independently to fit the
   largest tile's `max_num_permuted_tokens`.

3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
   `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
   Both tactic groups now reuse the prealloc; profiling is unbiased.

Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.

## Tests

Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:

- `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that
  justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one
  entry (otherwise the bias-prevention is moot), (2) the
  monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies
  `max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers),
  and (3) the opposite monotonicity of `max_num_tiles` (justifies
  `min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5
  parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32
  + a generic mid-size shape.

- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
  constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
  first test verifies the prealloc'd buffers fit the workload at
  every `tile_size in VALID_TILE_SIZES` (not just the
  constructor-time `self.tile_size`). The second test monkey-patches
  the module-level `_moe_core_impl` to capture the buffer-passing
  decision and verifies the `use_prealloc` gate honors every
  `tile_size in VALID_TILE_SIZES`, not just `self.tile_size` —
  directly pinning the load-bearing property of the fix.

## Pairs with PR flashinfer-ai#3216

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented May 5, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !624 has been updated with latest changes, and the CI pipeline #50336929 is currently running. I'll report back once the pipeline job completes.

@aleozlx aleozlx enabled auto-merge (squash) May 6, 2026 07:22
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 6, 2026
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:

    use_prealloc = (
        self.use_cuda_graph
        and tile_size == self.tile_size
        and num_tokens <= self.max_num_tokens
    )

During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.

Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.

Fix — three coordinated changes:

1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
   module-level `VALID_TILE_SIZES` tuple. Single source of truth for
   tactic enumeration AND prealloc sizing. Adding a new tile_size
   here automatically widens the prealloc.

2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
   `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
   monotonically increasing in `tile_size` (use
   `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
   decreasing (use `min(VALID_TILE_SIZES)`). Override
   `out_permuted_idx_to_expanded_idx` independently to fit the
   largest tile's `max_num_permuted_tokens`.

3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
   `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
   Both tactic groups now reuse the prealloc; profiling is unbiased.

Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.

## Tests

Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:

- `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that
  justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one
  entry (otherwise the bias-prevention is moot), (2) the
  monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies
  `max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers),
  and (3) the opposite monotonicity of `max_num_tiles` (justifies
  `min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5
  parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32
  + a generic mid-size shape.

- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
  constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
  first test verifies the prealloc'd buffers fit the workload at
  every `tile_size in VALID_TILE_SIZES` (not just the
  constructor-time `self.tile_size`). The second test monkey-patches
  the module-level `_moe_core_impl` to capture the buffer-passing
  decision and verifies the `use_prealloc` gate honors every
  `tile_size in VALID_TILE_SIZES`, not just `self.tile_size` —
  directly pinning the load-bearing property of the fix.

## Pairs with PR flashinfer-ai#3216

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yongwww yongwww added run-ci and removed run-ci labels May 6, 2026
@aleozlx aleozlx merged commit e6ac7cc into flashinfer-ai:main May 6, 2026
76 of 92 checks passed
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 6, 2026
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:

    use_prealloc = (
        self.use_cuda_graph
        and tile_size == self.tile_size
        and num_tokens <= self.max_num_tokens
    )

During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.

Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.

Fix — three coordinated changes:

1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
   module-level `VALID_TILE_SIZES` tuple. Single source of truth for
   tactic enumeration AND prealloc sizing. Adding a new tile_size
   here automatically widens the prealloc.

2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
   `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
   monotonically increasing in `tile_size` (use
   `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
   decreasing (use `min(VALID_TILE_SIZES)`). Override
   `out_permuted_idx_to_expanded_idx` independently to fit the
   largest tile's `max_num_permuted_tokens`.

3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
   `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
   Both tactic groups now reuse the prealloc; profiling is unbiased.

Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.

## Tests

Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:

- `TestPreallocSizingFormula` (3 tests, no-GPU): pins the math that
  justifies the fix. Guards (1) `VALID_TILE_SIZES` has more than one
  entry (otherwise the bias-prevention is moot), (2) the
  monotonicity of `max_num_permuted_tokens` in `tile_size` (justifies
  `max(VALID_TILE_SIZES)` for the permuted-token-indexed buffers),
  and (3) the opposite monotonicity of `max_num_tiles` (justifies
  `min(VALID_TILE_SIZES)` for tile-count-indexed buffers). 5
  parametrized shape configurations covering DeepSeek-V3 EP=1/8/16/32
  + a generic mid-size shape.

- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
  constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
  first test verifies the prealloc'd buffers fit the workload at
  every `tile_size in VALID_TILE_SIZES` (not just the
  constructor-time `self.tile_size`). The second test monkey-patches
  the module-level `_moe_core_impl` to capture the buffer-passing
  decision and verifies the `use_prealloc` gate honors every
  `tile_size in VALID_TILE_SIZES`, not just `self.tile_size` —
  directly pinning the load-bearing property of the fix.

## Pairs with PR flashinfer-ai#3216

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leejnau leejnau deleted the fix-cute_dsl-moe-autotuner-bucket-cap branch May 6, 2026 22:20
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 6, 2026
PR flashinfer-ai#3216 merged 2026-05-06 22:09 UTC as squash-commit `e6ac7cc2`.
Replaces the pre-computed-tuple bucket cap with a bare-callable form
that adapts to runtime input dim. Pairs with the prealloc-fix
(now rebased onto post-flashinfer-ai#3216 main as HEAD `c7a81fdb`, ready to PR).

Updates the line-15 final-state entry, line-51 perf-investigation-
closed paragraph, and line-4369 PR queue annotation in follow-up flashinfer-ai#12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 6, 2026
`CuteDslMoEWrapper.__init__` pre-allocates `_gemm1_output`,
`_gemm1_output_scale`, and `_moe_sort_buffers` sized for
`self.tile_size` only. The `use_prealloc` gate in
`_forward_with_tactic` honors prealloc only when the probed tactic's
`tile_size` matches `self.tile_size`:

    use_prealloc = (
        self.use_cuda_graph
        and tile_size == self.tile_size
        and num_tokens <= self.max_num_tokens
    )

During autotune profiling, mismatched tactics fall through to dynamic
`torch.empty()` per-call allocation. The autotuner is then comparing
tactic latencies that include asymmetric allocation overhead — tactics
matching `self.tile_size` run on the prealloc, others pay the alloc
cost — so it consistently picks the matching `tile_size` even when
intrinsic kernel performance favors the other.

Empirical signature pre-fix at EP=8/16, N=16384: fi locks to
`tile_size=128` in 14 of 14 autotune cache entries. TRT-LLM at the
same shapes picks `tile_size=256` more often, producing a +5-9%
headline gap from the tactic mismatch.

Fix — three coordinated changes:

1. `tuner.py`: lift the hardcoded `[128, 256]` tile_size list to a
   module-level `VALID_TILE_SIZES` tuple. Single source of truth for
   tactic enumeration AND prealloc sizing. Adding a new tile_size
   here automatically widens the prealloc.

2. `fused_moe.py:_allocate_buffers`: size buffers to fit any
   `tile_size in VALID_TILE_SIZES`. `max_num_permuted_tokens` is
   monotonically increasing in `tile_size` (use
   `max(VALID_TILE_SIZES)`); `max_num_tiles` is monotonically
   decreasing (use `min(VALID_TILE_SIZES)`). Override
   `out_permuted_idx_to_expanded_idx` independently to fit the
   largest tile's `max_num_permuted_tokens`.

3. `fused_moe.py:_forward_with_tactic`: change the prealloc gate from
   `tile_size == self.tile_size` to `tile_size in VALID_TILE_SIZES`.
   Both tactic groups now reuse the prealloc; profiling is unbiased.

Net: post-fix, the autotuner picks the higher-throughput tactic at
each shape on its merits, matching TRT-LLM's choice at large N.

## Tests

Adds two test classes in `tests/moe/test_cute_dsl_fused_moe.py`:

- `TestPreallocStaticInvariants` (1 test, no-GPU): pins
  `VALID_TILE_SIZES` to enumerate more than one tile_size. Catches
  the orthogonal failure mode where a future refactor reduces the
  constant to a single entry — in that case the GPU integration
  tests below would pass trivially (no max/min divergence, only one
  tile_size to gate-check) and the bias-prevention silently
  disappears.

- `TestPreallocBuffersIntegration` (2 tests, GPU/SM100 required):
  constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`. The
  first test verifies the prealloc'd buffer shapes fit the workload
  at every `tile_size in VALID_TILE_SIZES` — directly empirically
  pinning the buffer-sizing contract. The second test
  monkey-patches the module-level `_moe_core_impl` to capture the
  buffer-passing decision and verifies the `use_prealloc` gate
  honors every `tile_size in VALID_TILE_SIZES`, not just
  `self.tile_size` — directly pinning the load-bearing property of
  the fix.

## Pairs with PR flashinfer-ai#3216

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix, merged 2026-05-06).
Both required to fully close the EP>1 perf gap empirically;
validated at `--num-iters 100` on B200 across EP=8/16, N=16384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 7, 2026
Extend the `use_prealloc` gate in `_forward_with_tactic` with `not
AutoTuner.get().is_tuning_mode` so the wrapper bypasses its
preallocated buffers during autotune profiling. All tactics then
see the same per-call `torch.empty()` allocation overhead and the
autotuner's tactic comparison is unbiased; outside the
`autotune(True)` context the gate behaves as before — prealloc
when `tile_size == self.tile_size`, fall through otherwise.

This replaces an earlier approach in this branch that widened the
preallocated buffers to fit every valid tile_size; the new
approach decouples `self.tile_size` from autotune-time allocation
without expanding the prealloc layout, so `tuner.py` is left
untouched and `_allocate_buffers` reverts to its pre-PR shape.

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix). Both required to
fully close the EP>1 perf gap empirically; validated at
`--num-iters 100` on B200 across EP=8/16, N=16384.

## Tests

Replaces the prior buffer-shape and structural-invariant tests
with a single GPU/SM100 test (`TestPreallocGateUnderTuning`) that
constructs a real `CuteDslMoEWrapper(use_cuda_graph=True)`,
monkey-patches the module-level `_moe_core_impl` to capture the
`moe_sort_buffers` argument across {inside `autotune(True)`,
outside} × {`tile_size == self.tile_size`, mismatch}, and asserts:

- Inside `autotune(True)`: gate skips prealloc for every tactic —
  pinning the unbias property.
- Outside `autotune(True)`: gate uses prealloc when the picked
  tactic's `tile_size` matches `self.tile_size`; skips it
  otherwise (the prealloc layout would be wrong for the other
  tile_size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 7, 2026
Replaces the wrapper's ``not AutoTuner.get().is_tuning_mode`` gate
clause with ``not is_in_profile_measurement()``.  The new signal is
strictly narrower than ``is_tuning_mode``: it is True only on the
calling thread, and only inside the autotuner's per-tactic measurement
window (warmup + timed run inside ``_profile_single_kernel``).  It is
False during cache lookups, ``do_preparation`` calls, the runner
invocation immediately after ``choose_one`` returns, and other
threads' inference -- all of which the broader ``is_tuning_mode`` flag
swept up incorrectly.

## Why narrower

``AutoTuner.is_tuning_mode`` is True for the whole ``autotune(True)``
context, regardless of whether the autotuner is actively timing a
specific tactic.  Reading it from the wrapper meant that:

1. Cache hits for ops already tuned (where no measurement happens)
   bypassed prealloc anyway.
2. The runner invocation that uses the chosen tactic immediately after
   ``choose_one`` returns -- still inside the ``with autotune(True):``
   block -- bypassed prealloc.
3. Concurrent threads doing inference while another thread held the
   tuning context bypassed prealloc.
4. CUDA-graph capture happening inside an ``autotune(True)`` block
   would record per-call ``torch.empty()`` calls instead of preallocs.

None of these are situations where unbiased measurement matters; they
all benefit from prealloc.  ``is_in_profile_measurement()`` excludes
them while still serving the original intent: during the actual
measurement window, every tactic sees the same per-call allocation
overhead and the autotuner's tactic comparison is unbiased.

## What changes in autotuner.py

- New module-level ``_profile_measurement_thread_local`` (a
  ``threading.local``).
- New ``_profile_measurement_scope`` context manager (private; sets
  the thread-local on entry, restores prior value on exit, supports
  nesting).
- New ``is_in_profile_measurement()`` accessor (public; reads the
  thread-local; returns False on threads that never entered the scope).
- ``AutoTuner._profile_single_kernel`` wraps its warmup + timed run
  with ``_profile_measurement_scope()`` so every runner invocation
  inside the measurement function sees the flag True; runner
  invocations elsewhere in ``choose_one`` (cache search,
  ``do_preparation``, the post-loop ``search_cache`` call) see it
  False.

The change is purely additive: the new helpers don't alter
``AutoTuner``'s class state, no other autotune callers are affected.

## Tests

Replaces the prior ``TestPreallocGateUnderTuning`` test with a
broader contract check that exercises three contexts:

1. Inside ``autotune(True)`` AND inside ``_profile_measurement_scope()``
   (simulating a tactic measurement) -- gate must skip prealloc for
   every tactic, regardless of tile_size match.
2. Inside ``autotune(True)`` but OUTSIDE the measurement scope
   (simulating a cache hit, the do_preparation call, or the
   post-``choose_one`` runner invocation) -- gate must use prealloc
   when ``tile_size == self.tile_size``, skip otherwise.  This is the
   property that distinguishes the narrow signal from the broad one.
3. Outside any tuning context (plain inference) -- same as case 2.

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix).  Both required to
fully close the EP>1 perf gap empirically; validated at
``--num-iters 100`` on B200 across EP=8/16, N=16384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 7, 2026
Builds on the narrow ``is_in_profile_measurement()`` gate from the
prior commit by also expanding ``CuteDslMoEWrapper._allocate_buffers``
to size kernel-output buffers for *any* ``tile_size in
VALID_TILE_SIZES``, not just the constructor-time ``self.tile_size``.

## Why

Now that the autotuner profiles tactics unbiasedly (the prior commit's
narrower gate), it can fairly pick a tactic with ``tile_size != self.tile_size``
when that's the higher-throughput choice — at large N this is the
common case (``tile_size=256`` typically wins).  But with the prior
commit alone, the wrapper's prealloc was still sized for
``self.tile_size`` and the gate fell through to per-call
``torch.empty()`` whenever the picked tactic mismatched.  Two real
problems:

1. **CUDA-graph contract**: the wrapper's ``run()`` is documented as
   graph-safe with ``use_cuda_graph=True``, but per-call ``torch.empty``
   means captured graphs record the alloc inside the graph instead of
   binding to the prealloc.  PyTorch's graph private memory pool
   accommodates this since 1.10, but it's not what the contract
   promises.
2. **Buffer-overflow correctness**: ``max_num_permuted_tokens`` is
   monotonically increasing in ``tile_size``, so a tile_size=128-sized
   buffer is too small for a tile_size=256 tactic.  The fall-through to
   per-call alloc isn't a perf-hygiene choice — it's required for
   correctness given the smaller sizing.

## Fix — three coordinated changes

1. ``tuner.py``: lift the hardcoded ``[128, 256]`` tile_size list to
   a module-level ``VALID_TILE_SIZES`` tuple.  Single source of truth
   for tactic enumeration AND prealloc sizing.

2. ``fused_moe.py:_allocate_buffers``: size buffers to fit any
   ``tile_size in VALID_TILE_SIZES``.  ``max_num_permuted_tokens`` is
   monotonically increasing in ``tile_size`` (use
   ``max(VALID_TILE_SIZES)``); ``max_num_tiles`` is monotonically
   decreasing (use ``min(VALID_TILE_SIZES)``).  Override
   ``out_permuted_idx_to_expanded_idx`` independently to fit the
   largest tile's ``max_num_permuted_tokens``.

3. ``fused_moe.py:_forward_with_tactic``: drop the
   ``tile_size == self.tile_size`` check from the gate.  Replace with
   ``tile_size in VALID_TILE_SIZES`` (defensive; should never fail for
   tactics drawn from ``ALL_MOE_TACTICS``).  The narrow
   ``is_in_profile_measurement()`` check from the prior commit is
   retained.

## Resulting gate semantics

``use_prealloc = use_cuda_graph
                 AND not is_in_profile_measurement()
                 AND tile_size in VALID_TILE_SIZES
                 AND num_tokens <= max_num_tokens``

- During autotune profiling: ``is_in_profile_measurement`` is True →
  prealloc bypassed for every tactic → unbiased measurement.
- During cache-lookup / do-preparation / post-``choose_one`` /
  concurrent-thread inference: prealloc used regardless of tile_size,
  preserving the wrapper's CUDA-graph contract.
- During plain inference: same — prealloc used for whichever tactic
  the autotuner picked, including ``tile_size != self.tile_size``.

## Tests

Three test classes in ``tests/moe/test_cute_dsl_fused_moe.py``:

- ``TestPreallocStaticInvariants`` (1 no-GPU test): pin
  ``VALID_TILE_SIZES`` non-trivial.  Catches accidental reduction to
  a single entry which would defeat the whole purpose.

- ``TestPreallocBuffersIntegration`` (1 GPU/SM100 test): construct a
  real ``CuteDslMoEWrapper(use_cuda_graph=True)``, verify
  ``_gemm1_output``, ``_gemm1_output_scale``, and the moe_sort buffers
  fit the workload at every ``tile_size in VALID_TILE_SIZES``.  Pins
  the buffer-sizing contract empirically.

- ``TestPreallocGateUnderTuning`` (1 GPU/SM100 test, updated):
  monkey-patches ``_moe_core_impl`` and exercises three contexts × two
  tile_sizes:
    - measurement scope (inside ``_profile_measurement_scope``):
      gate must skip prealloc for every tactic.
    - inside ``autotune(True)`` but outside the measurement scope
      (cached call, post-choose_one): gate must use prealloc for
      every valid tile_size.
    - outside any tuning context (plain inference): gate must use
      prealloc for every valid tile_size.

  The latter two assertions are what's strengthened in this commit:
  the gate no longer requires ``tile_size == self.tile_size``.

Pairs with PR flashinfer-ai#3216 (autotuner bucket-cap fix).  Both required to
fully close the EP>1 perf gap empirically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

op: moe run-ci v0.6.11 release blocker label for 0.6.11

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants