fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM by leejnau · Pull Request #3198 · flashinfer-ai/flashinfer

leejnau · 2026-04-28T20:21:50Z

📌 Description

Replace get_max_num_tiles in flashinfer/fused_moe/cute_dsl/moe_utils.py with TRT-LLM's compact closed-form. The previous formula over-allocated by 1 tile whenever (E - L) % T != 0, where E = num_tokens * top_k, L = num_local_experts, T = tile_size.

Fix

return (num_expanded_tokens + (tile_size - 1) * num_local_experts) // tile_size

Provably equivalent to L + floor((E - L) / T), the tight worst-case bound on sum_e ceil(K_e / T). Matches TRT-LLM's tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py.

Concrete examples at top_k=8, num_tokens=16384:

num_local_experts=32, tile_size=128: 1056 → 1055
num_local_experts=32, tile_size=256: 544 → 543

🔍 Related Issues

#3171

#2398

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Refactor
- Improved MOE tile computation to use a single tight closed-form bound, simplifying logic and ensuring consistent worst-case tiling.
Tests
- Added comprehensive tests validating the new tile-bound across edge cases (including zero tokens and threshold cases), monotonic behavior, and exact tightness between tiled and permuted token counts.

coderabbitai · 2026-04-28T20:21:58Z

📝 Walkthrough

Walkthrough

Refactors get_max_num_tiles to a single closed-form worst-case bound and updates its docstring. Adds unit tests that assert the bound's tightness, edge cases, monotonicity, and consistency between get_max_num_permuted_tokens and get_max_num_tiles.

Changes

Cohort / File(s)	Summary
Core Logic Update `flashinfer/fused_moe/cute_dsl/moe_utils.py`	Replaced prior conditional/greedy logic with a unified closed-form formula `(num_expanded_tokens + (tile_size - 1) * num_local_experts) // tile_size` for `get_max_num_tiles`. Docstring rewritten to explain tight bound rationale and `tile_size` correspondence to `moe_sort` parameters.
Test Coverage `tests/moe/test_cute_dsl_fused_moe.py`	Added tests validating `get_max_num_tiles` against worst-case per-expert distributions, boundary cases (including zero tokens and threshold where expanded ≤ local experts), monotonicity over `num_tokens`, and exact equality between `get_max_num_permuted_tokens` and `get_max_num_tiles * tile_size`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run-ci

Suggested reviewers

nv-yunzheq
yzh119
cyx-6
djmmoss
samuellees

Poem

🐰 I hopped through tiles and counts so neat,
Replaced the branches with one tidy beat,
Tests sniffed out edges, tight as a seam,
Now bounds are snug — a rabbit's dream,
🌿✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: fixing an off-by-one error in get_max_num_tiles to align with TRT-LLM's implementation, which is the core objective of the PR.
Description check	✅ Passed	The PR description is comprehensive, including a clear explanation of the fix, the mathematical rationale, concrete examples, related issues, and completed checklist items for pre-commit checks and tests.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the MoE utility functions and tuner to support 2-CTA variants with a tile size of 256, ensuring layout consistency between GEMM1 and GEMM2. The get_max_num_tiles function has been refactored to use a compact closed-form expression that provides a tight upper bound on the number of tiles, mirroring TRT-LLM's implementation. Additionally, new unit tests have been introduced to verify tactic enumeration invariants and the accuracy of the tile count calculations. I have no feedback to provide.

…mula divergence Updates the addendum's "+1 tile bucketing-formula divergence" subsection to reference upstream PR flashinfer-ai#3198, which fixes get_max_num_tiles by replacing fi's L + ceil((E - L) / T) with TRT-LLM's compact closed-form (E + (T - 1) * L) // T. PR is awaiting review and is based on flashinfer-ai#3067. Also adds a parenthetical clarifying that the perf impact of the +1 over-allocation was effectively zero at runtime (since the kernel work is bounded by num_non_exiting_tiles, not buffer capacity), to prevent future readers from over-attributing the EP=8 gap to this formula bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… bucket-cap fixes Updates the post-close addendum and the flashinfer-ai#9 sub-section to reflect the 2026-04-28 root-cause work. The audit's headline +7-9% EP perf gaps at N=16384 (EP=8 +8.6% kernel, EP=16 +13.7% kernel) are driven by two paired CuteDSL MoE autotune bugs: 1. CuteDslMoEWrapper prealloc-bias (correctness): wrapper prealloc'd buffers sized for self.tile_size only, biasing autotune profiling against the other tile_size. Without the fix, fi is locked to tile_size=128 in 14/14 cache entries. Side branch cute-dsl-moe-wrapper-prealloc-bias-fix at commit a4137e6. 2. Autotuner bucket-cap mismatch (headline-perf): tuner.py:281 caps profile buckets at 8192 while runtime can be N=16384. fi profiles at half the per-expert workload, autotune-noise on tight margins (0.58% at EP=8 bucket=8192) destabilizes the cached choice. 1-line fix: bump cap to 16384. Validation at --num-iters 100 --warmup 10 (3 runs each): no fixes: EP=8 +5.5%/9.5pp EP=16 +1.8%/9.7pp prealloc-fix only: EP=8 +4.4%/11pp EP=16 +2.4%/10.5pp both fixes: EP=8 -0.8%/1.1pp EP=16 -5.6%/10pp Prealloc-fix alone does not close the headline gap. The bucket-cap-fix is what closes it. Both fixes are required. The EP=16 10pp spread under both-fixes is trt's autotune coin-flip on its 0.08% tile=128 vs tile=256 profile-time margin -- fi's chosen tactic is stable across all 3 runs; trt's is the noise source. Several earlier 2026-04-28 revisions to this addendum had wrong framings (EP=16 fully closed by prealloc-fix alone at --num-iters 1 was a lucky 2-run sample; necessary-but-not-sufficient at EP=8 was partly true at --num-iters 1 but the residual was the bucket-cap mismatch, not an unfixable deeper issue). The final state captured here supersedes all earlier revisions in this same audit doc. Neither fix is upstream as of 2026-04-28 -- held pending flashinfer-ai#3067 (fix-cutedsl-moe-gemm2-tactic-enum-tile256) and flashinfer-ai#3198 (get_max_num_tiles off-by-one) merging first. Thin-adapter refactor remains the recommended medium-term direction; both short-term fixes patch symptoms in code the refactor would delete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, mark superseded sub-sections Lee asked for a thorough audit-of-the-audit pass: look for inconsistencies, discrepancies, redundancies, and opportunities for compaction/simplification. After reading the full 3952-line doc, the following systematic edits were applied: Top-of-doc: - Added a "Final state (2026-04-29) — read this first" section at the very top (immediately after the title, before the chronological correction log). Captures: static audit summary, both upstream PRs (flashinfer-ai#3067, flashinfer-ai#3198), the two paired wrapper bugs and their side branches, the comprehensive 30-cell validation matrix, the three-regime classification (large-N closure / medium-N wrapper-overhead-residual / small-N noise), the EP=1 + version-skew explanation, and the closure of the empirical thread. This section is now authoritative; the rest of the doc is the chronological investigation record leading to it. - Updated the title's "last updated" tag to 2026-04-29. Stale references fixed: - The flashinfer-ai#3067 fix branch was referred to via the stale commit hashes d291d17e and e2330e73/f0cf8cd0 in three places; updated to the current 57de5f8 + 38293a0 (with notes that the original commit-hash references are preserved for traceability). - The "Tunable runner" row in the Summary table marked tile_size=256 as "gated off"; updated to "fixed via PR flashinfer-ai#3067 (pending review)". - Item 3 of the corrections list: "Recommended fix (drafted, not yet applied)" → "Fix landed and shipped as upstream PR flashinfer-ai#3067". Duplicate content removed: - Two "What still stands (unaffected by the autotune-context bug)" blocks were nearly identical; the first was removed and replaced with a one-line pointer to the consolidated version below. Executive verdict updated: - Rewritten to reflect current state: now references both PRs, the paired-fix story, the comprehensive 30-cell sweep matrix, and the +13.7% kernel-level closure. Previously it described "Leading hypothesis (NOT YET EMPIRICALLY VERIFIED)" framings that have since been confirmed and shipped. Superseded sub-sections marked: - "EP=8 vs EP=16: the fix has different impact at each" — was an early "necessary but not sufficient" framing under prealloc-only state; replaced with a SUPERSEDED note pointing to the paired-fix story below (the bucket-cap-fix discovery is what closed both EPs). - "EP=16 confirmation (2026-04-28)" header clarified to indicate it's a single-iter prealloc-only snapshot; the H_A/H_B hypothesis-discrimination logic and trt 0.08% margin observation are still load-bearing and preserved. - "Final cross-N validation (2026-04-28)" renamed to "Initial 6-cell cross-N spot-check" with a pointer to the comprehensive 30-cell sweep that supersedes it. - Status sub-section: removed outdated EP=8 partial-closure / EP=16 full-closure bullets (those were prealloc-only state); replaced with concise paired-fix branch summary and pointer to comprehensive sweep for per-cell results. - Implications for the audit's earlier claims: updated the "dominant factor is the prealloc bias" bullet to the paired-fix framing (prealloc + bucket-cap, both required). Follow-ups updated: - Open follow-ups remaining: appended a "Update 2026-04-29 (final): flashinfer-ai#9 also closed" note documenting that the EP=8/EP=16 perf gap was root-caused to two paired wrapper bugs, with the empirical investigation thread now closed. - Item flashinfer-ai#9 in the chronological "Follow-ups queued for later sessions" list: added a RESOLVED 2026-04-29 header with pointer to the paired-fix story, marking the 9a / 9b / 9c sub-mechanism candidates as pre-resolution candidate hypotheses (preserved as record of the search space, not blocking work). What was preserved as historical record (not removed): - Full "2026-04-24/25/28 correction" chronological log - Per-component static audit detail - Issue flashinfer-ai#3067 narrative including the original "Leading hypothesis" framing and the post-fix verification table - The 22-day candidate-root-cause investigation that ruled out 5 mechanisms before landing on the paired wrapper bugs - Part 2 baseline numbers (with retraction annotations preserved as-is) - Probes, MoE kernel coverage, reproduction runbook Net effect: doc is more navigable, internally consistent, and the authoritative summary is at the top. Body content preserves the investigative reasoning. Diff: +241 / -217 lines, net +24 lines (intentional — the new Final state section adds explicit authoritative content). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g→functional-API migration (NOT committed) Captures the analysis surfaced 2026-04-30 around an alternative to the medium-term thin-adapter refactor: migrate SGLang from `CuteDslMoEWrapper.run(...)` to `cute_dsl_fused_moe_nvfp4(...)` directly, then deprecate the wrapper. Lee maintains both fi and SGLang sides, so the cross-project coordination cost normally faced by such a migration is essentially zero — but the question of whether to pursue is not decided. Heavily caveated as a future consideration that **may not be pursued**. Priority is explicitly lower than the existing PR queue (PR flashinfer-ai#3171, flashinfer-ai#3198, bucket-cap-fix, prealloc-fix, convergence) and Phase 1 in follow-up flashinfer-ai#11. Captured for two reasons: 1. The 5-minute pass on SGLang's `flashinfer_cutedsl.py:ensure_cutedsl_wrapper()` surfaces specific gaps in the functional API that would need to be closed before any migration. Documenting them here means future-Lee (or future-someone) doesn't have to re-derive them. 2. The thin-adapter refactor and the deprecate-the-wrapper migration are alternatives, not sequential. If migration is pursued, the thin-adapter is redundant. If it's not, the thin-adapter is the recommended cleanup. The relationship needed to be on the record. Captured fi-side prerequisites (if pursued): - Add `moe_sort_buffers`, `gemm1_out`, `gemm1_out_scale` parameters to `cute_dsl_fused_moe_nvfp4`. Forward to `_moe_core_impl` (which already accepts them). ~15 lines, additive (existing wrapper keeps working). - Verify `CuteDslFusedMoENvfp4Runner` per-call construction doesn't bust the autotune cache or add overhead. - CUDA-graph capture/replay validation of `cute_dsl_fused_moe_nvfp4` under SGLang-style preallocation. Captured benefits / risks / decision criteria. Status: NOT decided; revisit only after the existing PR queue settles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/cute_dsl/tuner.py`:
- Around line 186-188: Replace the Unicode multiplication sign "×" with a plain
ASCII "x" in the comment lines that read "# tile_size=128: 2 GEMM1 tactics × 4
GEMM2 tactics = 8" and "# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8"
(the comment block that also notes "Total: 16 tactics"); edit those exact
comment strings so they use "x" (e.g., "2 GEMM1 tactics x 4 GEMM2 tactics") to
satisfy Ruff RUF003 while keeping the meaning unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c899f78f-8552-4de8-9897-5d3c33d45cf0

📥 Commits

Reviewing files that changed from the base of the PR and between 4e64219 and ee7ae1f40503288825eba9f00fd735eccf9ae7af.

📒 Files selected for processing (3)

flashinfer/fused_moe/cute_dsl/moe_utils.py
flashinfer/fused_moe/cute_dsl/tuner.py
tests/moe/test_cute_dsl_fused_moe.py

coderabbitai · 2026-04-30T19:08:29Z

 # tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
-# Total: 8 tactics
+# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
+# Total: 16 tactics


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the Unicode multiplication sign in this comment.

Ruff flags × here as ambiguous (RUF003). Swapping it to plain x keeps the file lint-clean.

Suggested edit

-# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8 -# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8 +# tile_size=128: 2 GEMM1 tactics x 4 GEMM2 tactics = 8 +# tile_size=256: 2 GEMM1 tactics x 4 GEMM2 tactics = 8

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8

# Total: 8 tactics

# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8

# Total: 16 tactics

# tile_size=128: 2 GEMM1 tactics x 4 GEMM2 tactics = 8

# tile_size=256: 2 GEMM1 tactics x 4 GEMM2 tactics = 8

# Total: 16 tactics

🧰 Tools

🪛 Ruff (0.15.12)

[warning] 186-186: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

[warning] 187-187: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/cute_dsl/tuner.py` around lines 186 - 188, Replace the Unicode multiplication sign "×" with a plain ASCII "x" in the comment lines that read "# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8" and "# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8" (the comment block that also notes "Total: 16 tactics"); edit those exact comment strings so they use "x" (e.g., "2 GEMM1 tactics x 4 GEMM2 tactics") to satisfy Ruff RUF003 while keeping the meaning unchanged.

…r-ai#3198 rebased onto main PR flashinfer-ai#3171 (`fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration`) merged into `flashinfer-ai/flashinfer:main` 2026-04-30 as squash-merge commit `070fabf0`. PR flashinfer-ai#3198 (`fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM`) was originally stacked on PR flashinfer-ai#3171 (3 commits on top of `ede2225a`). After flashinfer-ai#3171 merged as a squash-merge, PR flashinfer-ai#3198 had a test-file conflict against main because git couldn't reconcile the unsquashed pre-merge commits (`e2330e73` + `f0cf8cd0`) with the squashed `070fabf0` on main. Rebased PR flashinfer-ai#3198 onto upstream main directly via `git rebase --onto FETCH_HEAD f0cf8cd0` — drops the now-redundant PR flashinfer-ai#3171 commits and replays just the off-by-one fix. New HEAD: `7492ff11`. Single-commit clean branch off main, force-pushed to fork. PR is now mergeable on GitHub (status went from CONFLICTING to MERGEABLE / BLOCKED-on-checks). Audit-doc updates: - Final state (lines 10-11): PR flashinfer-ai#3171 marked MERGED; PR flashinfer-ai#3198 reframed as rebased onto main, single clean commit. - Per-component table (line 451): PR flashinfer-ai#3171 status flipped from "pending review" to "MERGED 2026-04-30 as `070fabf0`". - Issue flashinfer-ai#3067 narrative (line 962): status updated to merged. - Off-by-one PR section (line 1651): describes the rebase. - Wrapper-fix gating note (line 1306): added 2026-04-30 update that the flashinfer-ai#3171 gating condition is removed. Memory entries updated separately (closed-audit, prealloc-bias, PR flashinfer-ai#3198 dedicated, MEMORY.md hook for flashinfer-ai#3198). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…RT-LLM The previous formula in flashinfer/fused_moe/cute_dsl/moe_utils.py computed max_num_tiles as L + ceil((E - L) / T), where L = num_local_experts, T = tile_size, and E = num_tokens * top_k. This over-allocated by 1 tile whenever (E - L) was not divisible by T. The moe_sort routing kernel (include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh) writes exactly sum_e ceil(K_e / T) entries to tile_idx_to_expert_idx and tile_idx_to_mn_limit, where K_e is the per-local-expert token count and sum_e K_e = E. The tight worst-case upper bound on this sum is (L - 1) + ceil((E - L + 1) / T), achieved when one expert receives (E - L + 1) tokens and the others each receive 1. By the integer identity ceil((X+1)/T) = floor(X/T) + 1 (for non-negative X), this simplifies to L + floor((E - L) / T) = (E + (T - 1) * L) // T, which is the compact form used in TRT-LLM's tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py. fi's old formula exceeded this by 1 whenever (E - L) % T != 0. The over-allocation never caused correctness issues — the routing kernel populates a runtime num_non_exiting_tiles count, and downstream gemm1/gemm2 kernels strictly bound their reads by it (see blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py and blockscaled_contiguous_grouped_gemm_finalize_fusion.py). But it wasted ~1 tile of buffer space per call and produced shape divergence vs TRT-LLM at the same configuration. Concrete examples (DeepSeek-V3, top_k=8, num_tokens=16384): num_local_experts=32 tile_size=128: 1056 -> 1055 num_local_experts=32 tile_size=256: 544 -> 543 num_local_experts=16 tile_size=128: 1040 -> 1039 num_local_experts=16 tile_size=256: 528 -> 527 ## Tests Adds two no-GPU test classes in tests/moe/test_cute_dsl_fused_moe.py covering DeepSeek-V3 production shapes at every supported EP partition (EP=1, 8, 16, 32) × every tile_size (128, 256), generic MoE shapes representing other model families (Mixtral / DBRX / mid-large / high-top_k), and edge cases (boundary values, divisibility). TestGetMaxNumTiles (5 methods): test_matches_trtllm_compact_formula (24 parametrized cases): pin fi to TRT-LLM's compact closed-form across the full DeepSeek-V3 EP × tile_size matrix, generic MoE shapes, and edge cases. test_worst_case_construction_is_tight (24 parametrized cases): build the worst-case routing distribution (one expert gets E - L + 1 tokens, the others each get 1) and verify fi's bound equals the actual sum_e ceil(K_e / T). test_zero_tokens, test_below_or_at_local_experts_threshold: early-return-path edge cases. test_monotonic_in_num_tokens (11 parametrized cases): sweep monotonicity check across the same EP × tile_size matrix and generic-model shapes; catches sign errors and off-by-one in future refactors. TestGetMaxNumPermutedTokens (1 method, 24 parametrized cases): test_consistent_with_get_max_num_tiles: result equals get_max_num_tiles * tile_size exactly, across the same shape coverage. Generic-model shapes parametrized: - num_experts=8, top_k=2 (Mixtral-8x7B family) at EP=1 - num_experts=16, top_k=4 (DBRX family) at EP=1, EP=4 - num_experts=64, top_k=6 at EP=1, EP=8 - num_experts=32 (post-EP), top_k=16 at EP=4 (high-top_k regime) Edge cases parametrized: - num_expanded == num_local_experts (early-return boundary) - num_expanded just past num_local_experts - (E - L) exactly divisible by tile_size - (E - L) one short of divisibility - num_tokens=1, top_k=1 (smallest meaningful inputs) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (3)

tests/moe/test_cute_dsl_fused_moe.py (2)
473-523: ⚡ Quick win

Make the parametrization tables immutable.

These shape matrices are class-level lists, so Ruff will flag them with RUF012 as mutable class attributes. Since they're constants used only for parametrization, tuples avoid the lint noise and accidental mutation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_cute_dsl_fused_moe.py` around lines 473 - 523, The three
class-level mutable lists _DEEPSEEK_V3_SHAPES, _GENERIC_MOE_SHAPES, and
_EDGE_CASE_SHAPES should be made immutable to avoid RUF012; replace each list
literal with a tuple literal containing the pytest.param entries (i.e., change
[...] to (...)) so the parametrization constants cannot be mutated while
preserving their contents and ids.
471-471: ⚡ Quick win

Replace the Unicode multiplication sign in test prose.

Ruff flags × here (RUF002/RUF003). Using plain x keeps the file lint-clean.

Also applies to: 541-541, 627-627
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_cute_dsl_fused_moe.py` at line 471, Replace the Unicode
multiplication sign "×" with a plain ASCII "x" in the test prose comments to
satisfy Ruff (RUF002/RUF003); locate the occurrences such as the comment line "#
num_tokens=16384 (prefill) — full EP × tile_size matrix." (and the similar
occurrences noted later) and update them to "# num_tokens=16384 (prefill) — full
EP x tile_size matrix." so only the character changes and no other text or
formatting is modified.
flashinfer/fused_moe/cute_dsl/moe_utils.py (1)
48-59: ⚡ Quick win

Document the piecewise bound explicitly.

The docstring reads as if (E + (T - 1) * L) // T is the tight bound for all inputs, but the implementation still needs the num_expanded_tokens <= num_local_experts branch. For E <= L, the compact form overestimates (for example E=16, L=32, T=128 gives 31 instead of 16), so it would help to call out that the closed form only applies once E > L.
📝 Suggested doc tweak
-    Mirrors TRT-LLM's ``GroupedGemmInputsHelper.get_max_num_tiles()`` in
-    ``tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py``. The compact
-    closed-form expression is the tight worst-case bound on
+    Mirrors TRT-LLM's ``GroupedGemmInputsHelper.get_max_num_tiles()`` in
+    ``tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py``. For ``E <= L``,
+    the tight bound is simply ``E``. Once ``E > L``, the compact
+    closed-form expression is the tight worst-case bound on
Also applies to: 75-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/moe_utils.py` around lines 48 - 59, Update the
docstring for the max-tiles calculation (the function mirroring TRT-LLM's
get_max_num_tiles) to explicitly state the piecewise nature: when
num_expanded_tokens (E) <= num_local_experts (L) the tight bound is E (i.e., no
extra padding), and only when E > L does the compact closed-form (E + (T - 1) *
L) // T apply; mention the existing branch that checks num_expanded_tokens <=
num_local_experts and add the same clarification for the duplicate text near
lines 75-76 so readers know why the implementation keeps that branch.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/fused_moe/cute_dsl/moe_utils.py`:
- Around line 48-59: Update the docstring for the max-tiles calculation (the
function mirroring TRT-LLM's get_max_num_tiles) to explicitly state the
piecewise nature: when num_expanded_tokens (E) <= num_local_experts (L) the
tight bound is E (i.e., no extra padding), and only when E > L does the compact
closed-form (E + (T - 1) * L) // T apply; mention the existing branch that
checks num_expanded_tokens <= num_local_experts and add the same clarification
for the duplicate text near lines 75-76 so readers know why the implementation
keeps that branch.

In `@tests/moe/test_cute_dsl_fused_moe.py`:
- Around line 473-523: The three class-level mutable lists _DEEPSEEK_V3_SHAPES,
_GENERIC_MOE_SHAPES, and _EDGE_CASE_SHAPES should be made immutable to avoid
RUF012; replace each list literal with a tuple literal containing the
pytest.param entries (i.e., change [...] to (...)) so the parametrization
constants cannot be mutated while preserving their contents and ids.
- Line 471: Replace the Unicode multiplication sign "×" with a plain ASCII "x"
in the test prose comments to satisfy Ruff (RUF002/RUF003); locate the
occurrences such as the comment line "# num_tokens=16384 (prefill) — full EP ×
tile_size matrix." (and the similar occurrences noted later) and update them to
"# num_tokens=16384 (prefill) — full EP x tile_size matrix." so only the
character changes and no other text or formatting is modified.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f974c850-dc14-4cad-a91d-ca022431bd8e

📥 Commits

Reviewing files that changed from the base of the PR and between 7492ff11d4ec855a235d8bfcef092c1a6ac2a8da and 557ad5a.

📒 Files selected for processing (2)

flashinfer/fused_moe/cute_dsl/moe_utils.py
tests/moe/test_cute_dsl_fused_moe.py

nv-yunzheq · 2026-05-01T20:56:14Z

/bot run

flashinfer-bot · 2026-05-01T20:56:50Z

GitLab MR !621 has been created, and the CI pipeline #50037953 is currently running. I'll report back once the pipeline job completes.

nv-yunzheq

Approved as the unit test passed on GB300

…erged 2026-05-04 PR flashinfer-ai#3198 merged 2026-05-04 18:58 UTC as squash-merge commit `393e83ea`. fi's `get_max_num_tiles` formula now matches TRT-LLM's compact closed- form, eliminating the +1 tile over-allocation when (E - L) % T != 0. Updates the line-11 final-state entry and the line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…me input (#3216)  ## 📌 Description The autotuner's `DynamicTensorSpec` in `flashinfer/fused_moe/cute_dsl/tuner.py` declared `gen_tuning_buckets` as the pre-computed tuple `get_hybrid_num_tokens_buckets(8192)` and `map_to_tuning_buckets` as `lambda x: map_to_hybrid_bucket(x,8192)`. The hardcoded 8192 cap silently clamped any runtime workload larger than that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill (N=16384) fi profiled at half the per-expert workload and used a tactic optimized for the wrong shape. This PR replaces the pre-computed tuple with the bare callable form (`get_hybrid_num_tokens_buckets`) and switches the mapper to the uncapped variant `map_to_hybrid_bucket_uncapped` (added alongside the hybrid-bucket scheme for exactly this case). The autotuner now invokes them with the actual input dim at autotune time, matching TRT-LLM's pattern at `cute_dsl_custom_ops.py:2390-2391` and flashinfer's own pattern at `gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG`. ## 🔍 Related Issues #3171 #3198 #3115 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * MoE autotuner now uses uncapped dynamic hybrid bucket mapping instead of a fixed-bounded set, improving adaptation to varying input token sizes. * **Tests** * Added offline tests validating autotuner bucket configuration: dynamic bucket generation, responsiveness to input size, monotonic mapping behavior, large-input scaling, and alignment with expected power-of-2 bucket values.  --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

flashinfer-bot added the op: moe label Apr 28, 2026

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

nv-yunzheq added the v0.6.11 release blocker label for 0.6.11 label Apr 30, 2026

leejnau marked this pull request as ready for review April 30, 2026 19:01

leejnau requested review from IwakuraRein, aleozlx, jiahanc, nv-yunzheq, samuellees and yzh119 as code owners April 30, 2026 19:01

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

leejnau force-pushed the fix-cute_dsl-moe-get-max-num-tiles-off-by-one branch from ee7ae1f to 7492ff1 Compare May 1, 2026 00:37

leejnau force-pushed the fix-cute_dsl-moe-get-max-num-tiles-off-by-one branch from 7492ff1 to 557ad5a Compare May 1, 2026 16:18

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

leejnau mentioned this pull request May 1, 2026

fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input #3216

Merged

5 tasks

Merge branch 'main' into fix-cute_dsl-moe-get-max-num-tiles-off-by-one

9fae1e7

nv-yunzheq added the run-ci label May 1, 2026

nv-yunzheq approved these changes May 4, 2026

View reviewed changes

nv-yunzheq merged commit 393e83e into flashinfer-ai:main May 4, 2026
61 of 68 checks passed

leejnau deleted the fix-cute_dsl-moe-get-max-num-tiles-off-by-one branch May 4, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM#3198

fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM#3198
nv-yunzheq merged 2 commits intoflashinfer-ai:mainfrom
leejnau:fix-cute_dsl-moe-get-max-num-tiles-off-by-one

leejnau commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

nv-yunzheq commented May 1, 2026

Uh oh!

flashinfer-bot commented May 1, 2026

Uh oh!

nv-yunzheq left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leejnau commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Fix

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq commented May 1, 2026

Uh oh!

flashinfer-bot commented May 1, 2026

Uh oh!

nv-yunzheq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leejnau commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading