fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM#3198
Conversation
📝 WalkthroughWalkthroughRefactors Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the MoE utility functions and tuner to support 2-CTA variants with a tile size of 256, ensuring layout consistency between GEMM1 and GEMM2. The get_max_num_tiles function has been refactored to use a compact closed-form expression that provides a tight upper bound on the number of tiles, mirroring TRT-LLM's implementation. Additionally, new unit tests have been introduced to verify tactic enumeration invariants and the accuracy of the tile count calculations. I have no feedback to provide.
…mula divergence Updates the addendum's "+1 tile bucketing-formula divergence" subsection to reference upstream PR flashinfer-ai#3198, which fixes get_max_num_tiles by replacing fi's L + ceil((E - L) / T) with TRT-LLM's compact closed-form (E + (T - 1) * L) // T. PR is awaiting review and is based on flashinfer-ai#3067. Also adds a parenthetical clarifying that the perf impact of the +1 over-allocation was effectively zero at runtime (since the kernel work is bounded by num_non_exiting_tiles, not buffer capacity), to prevent future readers from over-attributing the EP=8 gap to this formula bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… bucket-cap fixes Updates the post-close addendum and the flashinfer-ai#9 sub-section to reflect the 2026-04-28 root-cause work. The audit's headline +7-9% EP perf gaps at N=16384 (EP=8 +8.6% kernel, EP=16 +13.7% kernel) are driven by two paired CuteDSL MoE autotune bugs: 1. CuteDslMoEWrapper prealloc-bias (correctness): wrapper prealloc'd buffers sized for self.tile_size only, biasing autotune profiling against the other tile_size. Without the fix, fi is locked to tile_size=128 in 14/14 cache entries. Side branch cute-dsl-moe-wrapper-prealloc-bias-fix at commit a4137e6. 2. Autotuner bucket-cap mismatch (headline-perf): tuner.py:281 caps profile buckets at 8192 while runtime can be N=16384. fi profiles at half the per-expert workload, autotune-noise on tight margins (0.58% at EP=8 bucket=8192) destabilizes the cached choice. 1-line fix: bump cap to 16384. Validation at --num-iters 100 --warmup 10 (3 runs each): no fixes: EP=8 +5.5%/9.5pp EP=16 +1.8%/9.7pp prealloc-fix only: EP=8 +4.4%/11pp EP=16 +2.4%/10.5pp both fixes: EP=8 -0.8%/1.1pp EP=16 -5.6%/10pp Prealloc-fix alone does not close the headline gap. The bucket-cap-fix is what closes it. Both fixes are required. The EP=16 10pp spread under both-fixes is trt's autotune coin-flip on its 0.08% tile=128 vs tile=256 profile-time margin -- fi's chosen tactic is stable across all 3 runs; trt's is the noise source. Several earlier 2026-04-28 revisions to this addendum had wrong framings (EP=16 fully closed by prealloc-fix alone at --num-iters 1 was a lucky 2-run sample; necessary-but-not-sufficient at EP=8 was partly true at --num-iters 1 but the residual was the bucket-cap mismatch, not an unfixable deeper issue). The final state captured here supersedes all earlier revisions in this same audit doc. Neither fix is upstream as of 2026-04-28 -- held pending flashinfer-ai#3067 (fix-cutedsl-moe-gemm2-tactic-enum-tile256) and flashinfer-ai#3198 (get_max_num_tiles off-by-one) merging first. Thin-adapter refactor remains the recommended medium-term direction; both short-term fixes patch symptoms in code the refactor would delete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, mark superseded sub-sections Lee asked for a thorough audit-of-the-audit pass: look for inconsistencies, discrepancies, redundancies, and opportunities for compaction/simplification. After reading the full 3952-line doc, the following systematic edits were applied: Top-of-doc: - Added a "Final state (2026-04-29) — read this first" section at the very top (immediately after the title, before the chronological correction log). Captures: static audit summary, both upstream PRs (flashinfer-ai#3067, flashinfer-ai#3198), the two paired wrapper bugs and their side branches, the comprehensive 30-cell validation matrix, the three-regime classification (large-N closure / medium-N wrapper-overhead-residual / small-N noise), the EP=1 + version-skew explanation, and the closure of the empirical thread. This section is now authoritative; the rest of the doc is the chronological investigation record leading to it. - Updated the title's "last updated" tag to 2026-04-29. Stale references fixed: - The flashinfer-ai#3067 fix branch was referred to via the stale commit hashes d291d17e and e2330e73/f0cf8cd0 in three places; updated to the current 57de5f8 + 38293a0 (with notes that the original commit-hash references are preserved for traceability). - The "Tunable runner" row in the Summary table marked tile_size=256 as "gated off"; updated to "fixed via PR flashinfer-ai#3067 (pending review)". - Item 3 of the corrections list: "Recommended fix (drafted, not yet applied)" → "Fix landed and shipped as upstream PR flashinfer-ai#3067". Duplicate content removed: - Two "What still stands (unaffected by the autotune-context bug)" blocks were nearly identical; the first was removed and replaced with a one-line pointer to the consolidated version below. Executive verdict updated: - Rewritten to reflect current state: now references both PRs, the paired-fix story, the comprehensive 30-cell sweep matrix, and the +13.7% kernel-level closure. Previously it described "Leading hypothesis (NOT YET EMPIRICALLY VERIFIED)" framings that have since been confirmed and shipped. Superseded sub-sections marked: - "EP=8 vs EP=16: the fix has different impact at each" — was an early "necessary but not sufficient" framing under prealloc-only state; replaced with a SUPERSEDED note pointing to the paired-fix story below (the bucket-cap-fix discovery is what closed both EPs). - "EP=16 confirmation (2026-04-28)" header clarified to indicate it's a single-iter prealloc-only snapshot; the H_A/H_B hypothesis-discrimination logic and trt 0.08% margin observation are still load-bearing and preserved. - "Final cross-N validation (2026-04-28)" renamed to "Initial 6-cell cross-N spot-check" with a pointer to the comprehensive 30-cell sweep that supersedes it. - Status sub-section: removed outdated EP=8 partial-closure / EP=16 full-closure bullets (those were prealloc-only state); replaced with concise paired-fix branch summary and pointer to comprehensive sweep for per-cell results. - Implications for the audit's earlier claims: updated the "dominant factor is the prealloc bias" bullet to the paired-fix framing (prealloc + bucket-cap, both required). Follow-ups updated: - Open follow-ups remaining: appended a "Update 2026-04-29 (final): flashinfer-ai#9 also closed" note documenting that the EP=8/EP=16 perf gap was root-caused to two paired wrapper bugs, with the empirical investigation thread now closed. - Item flashinfer-ai#9 in the chronological "Follow-ups queued for later sessions" list: added a RESOLVED 2026-04-29 header with pointer to the paired-fix story, marking the 9a / 9b / 9c sub-mechanism candidates as pre-resolution candidate hypotheses (preserved as record of the search space, not blocking work). What was preserved as historical record (not removed): - Full "2026-04-24/25/28 correction" chronological log - Per-component static audit detail - Issue flashinfer-ai#3067 narrative including the original "Leading hypothesis" framing and the post-fix verification table - The 22-day candidate-root-cause investigation that ruled out 5 mechanisms before landing on the paired wrapper bugs - Part 2 baseline numbers (with retraction annotations preserved as-is) - Probes, MoE kernel coverage, reproduction runbook Net effect: doc is more navigable, internally consistent, and the authoritative summary is at the top. Body content preserves the investigative reasoning. Diff: +241 / -217 lines, net +24 lines (intentional — the new Final state section adds explicit authoritative content). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g→functional-API migration (NOT committed) Captures the analysis surfaced 2026-04-30 around an alternative to the medium-term thin-adapter refactor: migrate SGLang from `CuteDslMoEWrapper.run(...)` to `cute_dsl_fused_moe_nvfp4(...)` directly, then deprecate the wrapper. Lee maintains both fi and SGLang sides, so the cross-project coordination cost normally faced by such a migration is essentially zero — but the question of whether to pursue is not decided. Heavily caveated as a future consideration that **may not be pursued**. Priority is explicitly lower than the existing PR queue (PR flashinfer-ai#3171, flashinfer-ai#3198, bucket-cap-fix, prealloc-fix, convergence) and Phase 1 in follow-up flashinfer-ai#11. Captured for two reasons: 1. The 5-minute pass on SGLang's `flashinfer_cutedsl.py:ensure_cutedsl_wrapper()` surfaces specific gaps in the functional API that would need to be closed before any migration. Documenting them here means future-Lee (or future-someone) doesn't have to re-derive them. 2. The thin-adapter refactor and the deprecate-the-wrapper migration are alternatives, not sequential. If migration is pursued, the thin-adapter is redundant. If it's not, the thin-adapter is the recommended cleanup. The relationship needed to be on the record. Captured fi-side prerequisites (if pursued): - Add `moe_sort_buffers`, `gemm1_out`, `gemm1_out_scale` parameters to `cute_dsl_fused_moe_nvfp4`. Forward to `_moe_core_impl` (which already accepts them). ~15 lines, additive (existing wrapper keeps working). - Verify `CuteDslFusedMoENvfp4Runner` per-call construction doesn't bust the autotune cache or add overhead. - CUDA-graph capture/replay validation of `cute_dsl_fused_moe_nvfp4` under SGLang-style preallocation. Captured benefits / risks / decision criteria. Status: NOT decided; revisit only after the existing PR queue settles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/fused_moe/cute_dsl/tuner.py`:
- Around line 186-188: Replace the Unicode multiplication sign "×" with a plain
ASCII "x" in the comment lines that read "# tile_size=128: 2 GEMM1 tactics × 4
GEMM2 tactics = 8" and "# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8"
(the comment block that also notes "Total: 16 tactics"); edit those exact
comment strings so they use "x" (e.g., "2 GEMM1 tactics x 4 GEMM2 tactics") to
satisfy Ruff RUF003 while keeping the meaning unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c899f78f-8552-4de8-9897-5d3c33d45cf0
📥 Commits
Reviewing files that changed from the base of the PR and between 4e64219 and ee7ae1f40503288825eba9f00fd735eccf9ae7af.
📒 Files selected for processing (3)
flashinfer/fused_moe/cute_dsl/moe_utils.pyflashinfer/fused_moe/cute_dsl/tuner.pytests/moe/test_cute_dsl_fused_moe.py
| # tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8 | ||
| # Total: 8 tactics | ||
| # tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8 | ||
| # Total: 16 tactics |
There was a problem hiding this comment.
Replace the Unicode multiplication sign in this comment.
Ruff flags × here as ambiguous (RUF003). Swapping it to plain x keeps the file lint-clean.
Suggested edit
-# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
-# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
+# tile_size=128: 2 GEMM1 tactics x 4 GEMM2 tactics = 8
+# tile_size=256: 2 GEMM1 tactics x 4 GEMM2 tactics = 8📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8 | |
| # Total: 8 tactics | |
| # tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8 | |
| # Total: 16 tactics | |
| # tile_size=128: 2 GEMM1 tactics x 4 GEMM2 tactics = 8 | |
| # tile_size=256: 2 GEMM1 tactics x 4 GEMM2 tactics = 8 | |
| # Total: 16 tactics |
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 186-186: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
[warning] 187-187: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/fused_moe/cute_dsl/tuner.py` around lines 186 - 188, Replace the
Unicode multiplication sign "×" with a plain ASCII "x" in the comment lines that
read "# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8" and "#
tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8" (the comment block that
also notes "Total: 16 tactics"); edit those exact comment strings so they use
"x" (e.g., "2 GEMM1 tactics x 4 GEMM2 tactics") to satisfy Ruff RUF003 while
keeping the meaning unchanged.
ee7ae1f to
7492ff1
Compare
…r-ai#3198 rebased onto main PR flashinfer-ai#3171 (`fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration`) merged into `flashinfer-ai/flashinfer:main` 2026-04-30 as squash-merge commit `070fabf0`. PR flashinfer-ai#3198 (`fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM`) was originally stacked on PR flashinfer-ai#3171 (3 commits on top of `ede2225a`). After flashinfer-ai#3171 merged as a squash-merge, PR flashinfer-ai#3198 had a test-file conflict against main because git couldn't reconcile the unsquashed pre-merge commits (`e2330e73` + `f0cf8cd0`) with the squashed `070fabf0` on main. Rebased PR flashinfer-ai#3198 onto upstream main directly via `git rebase --onto FETCH_HEAD f0cf8cd0` — drops the now-redundant PR flashinfer-ai#3171 commits and replays just the off-by-one fix. New HEAD: `7492ff11`. Single-commit clean branch off main, force-pushed to fork. PR is now mergeable on GitHub (status went from CONFLICTING to MERGEABLE / BLOCKED-on-checks). Audit-doc updates: - Final state (lines 10-11): PR flashinfer-ai#3171 marked MERGED; PR flashinfer-ai#3198 reframed as rebased onto main, single clean commit. - Per-component table (line 451): PR flashinfer-ai#3171 status flipped from "pending review" to "MERGED 2026-04-30 as `070fabf0`". - Issue flashinfer-ai#3067 narrative (line 962): status updated to merged. - Off-by-one PR section (line 1651): describes the rebase. - Wrapper-fix gating note (line 1306): added 2026-04-30 update that the flashinfer-ai#3171 gating condition is removed. Memory entries updated separately (closed-audit, prealloc-bias, PR flashinfer-ai#3198 dedicated, MEMORY.md hook for flashinfer-ai#3198). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RT-LLM
The previous formula in flashinfer/fused_moe/cute_dsl/moe_utils.py
computed max_num_tiles as L + ceil((E - L) / T), where L =
num_local_experts, T = tile_size, and E = num_tokens * top_k. This
over-allocated by 1 tile whenever (E - L) was not divisible by T.
The moe_sort routing kernel
(include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh) writes exactly
sum_e ceil(K_e / T) entries to tile_idx_to_expert_idx and
tile_idx_to_mn_limit, where K_e is the per-local-expert token count
and sum_e K_e = E. The tight worst-case upper bound on this sum is
(L - 1) + ceil((E - L + 1) / T), achieved when one expert receives
(E - L + 1) tokens and the others each receive 1. By the integer
identity ceil((X+1)/T) = floor(X/T) + 1 (for non-negative X), this
simplifies to L + floor((E - L) / T) = (E + (T - 1) * L) // T, which
is the compact form used in TRT-LLM's
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py. fi's old
formula exceeded this by 1 whenever (E - L) % T != 0.
The over-allocation never caused correctness issues — the routing
kernel populates a runtime num_non_exiting_tiles count, and downstream
gemm1/gemm2 kernels strictly bound their reads by it (see
blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py and
blockscaled_contiguous_grouped_gemm_finalize_fusion.py). But it
wasted ~1 tile of buffer space per call and produced shape divergence
vs TRT-LLM at the same configuration.
Concrete examples (DeepSeek-V3, top_k=8, num_tokens=16384):
num_local_experts=32 tile_size=128: 1056 -> 1055
num_local_experts=32 tile_size=256: 544 -> 543
num_local_experts=16 tile_size=128: 1040 -> 1039
num_local_experts=16 tile_size=256: 528 -> 527
## Tests
Adds two no-GPU test classes in tests/moe/test_cute_dsl_fused_moe.py
covering DeepSeek-V3 production shapes at every supported EP
partition (EP=1, 8, 16, 32) × every tile_size (128, 256), generic
MoE shapes representing other model families (Mixtral / DBRX /
mid-large / high-top_k), and edge cases (boundary values, divisibility).
TestGetMaxNumTiles (5 methods):
test_matches_trtllm_compact_formula (24 parametrized cases): pin
fi to TRT-LLM's compact closed-form across the full DeepSeek-V3
EP × tile_size matrix, generic MoE shapes, and edge cases.
test_worst_case_construction_is_tight (24 parametrized cases):
build the worst-case routing distribution (one expert gets
E - L + 1 tokens, the others each get 1) and verify fi's bound
equals the actual sum_e ceil(K_e / T).
test_zero_tokens, test_below_or_at_local_experts_threshold:
early-return-path edge cases.
test_monotonic_in_num_tokens (11 parametrized cases): sweep
monotonicity check across the same EP × tile_size matrix and
generic-model shapes; catches sign errors and off-by-one in
future refactors.
TestGetMaxNumPermutedTokens (1 method, 24 parametrized cases):
test_consistent_with_get_max_num_tiles: result equals
get_max_num_tiles * tile_size exactly, across the same shape
coverage.
Generic-model shapes parametrized:
- num_experts=8, top_k=2 (Mixtral-8x7B family) at EP=1
- num_experts=16, top_k=4 (DBRX family) at EP=1, EP=4
- num_experts=64, top_k=6 at EP=1, EP=8
- num_experts=32 (post-EP), top_k=16 at EP=4 (high-top_k regime)
Edge cases parametrized:
- num_expanded == num_local_experts (early-return boundary)
- num_expanded just past num_local_experts
- (E - L) exactly divisible by tile_size
- (E - L) one short of divisibility
- num_tokens=1, top_k=1 (smallest meaningful inputs)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7492ff1 to
557ad5a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/moe/test_cute_dsl_fused_moe.py (2)
473-523: ⚡ Quick winMake the parametrization tables immutable.
These shape matrices are class-level lists, so Ruff will flag them with
RUF012as mutable class attributes. Since they're constants used only for parametrization, tuples avoid the lint noise and accidental mutation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_cute_dsl_fused_moe.py` around lines 473 - 523, The three class-level mutable lists _DEEPSEEK_V3_SHAPES, _GENERIC_MOE_SHAPES, and _EDGE_CASE_SHAPES should be made immutable to avoid RUF012; replace each list literal with a tuple literal containing the pytest.param entries (i.e., change [...] to (...)) so the parametrization constants cannot be mutated while preserving their contents and ids.
471-471: ⚡ Quick winReplace the Unicode multiplication sign in test prose.
Ruff flags
×here (RUF002/RUF003). Using plainxkeeps the file lint-clean.Also applies to: 541-541, 627-627
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/moe/test_cute_dsl_fused_moe.py` at line 471, Replace the Unicode multiplication sign "×" with a plain ASCII "x" in the test prose comments to satisfy Ruff (RUF002/RUF003); locate the occurrences such as the comment line "# num_tokens=16384 (prefill) — full EP × tile_size matrix." (and the similar occurrences noted later) and update them to "# num_tokens=16384 (prefill) — full EP x tile_size matrix." so only the character changes and no other text or formatting is modified.flashinfer/fused_moe/cute_dsl/moe_utils.py (1)
48-59: ⚡ Quick winDocument the piecewise bound explicitly.
The docstring reads as if
(E + (T - 1) * L) // Tis the tight bound for all inputs, but the implementation still needs thenum_expanded_tokens <= num_local_expertsbranch. ForE <= L, the compact form overestimates (for exampleE=16, L=32, T=128gives31instead of16), so it would help to call out that the closed form only applies onceE > L.📝 Suggested doc tweak
- Mirrors TRT-LLM's ``GroupedGemmInputsHelper.get_max_num_tiles()`` in - ``tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py``. The compact - closed-form expression is the tight worst-case bound on + Mirrors TRT-LLM's ``GroupedGemmInputsHelper.get_max_num_tiles()`` in + ``tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py``. For ``E <= L``, + the tight bound is simply ``E``. Once ``E > L``, the compact + closed-form expression is the tight worst-case bound onAlso applies to: 75-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/cute_dsl/moe_utils.py` around lines 48 - 59, Update the docstring for the max-tiles calculation (the function mirroring TRT-LLM's get_max_num_tiles) to explicitly state the piecewise nature: when num_expanded_tokens (E) <= num_local_experts (L) the tight bound is E (i.e., no extra padding), and only when E > L does the compact closed-form (E + (T - 1) * L) // T apply; mention the existing branch that checks num_expanded_tokens <= num_local_experts and add the same clarification for the duplicate text near lines 75-76 so readers know why the implementation keeps that branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@flashinfer/fused_moe/cute_dsl/moe_utils.py`:
- Around line 48-59: Update the docstring for the max-tiles calculation (the
function mirroring TRT-LLM's get_max_num_tiles) to explicitly state the
piecewise nature: when num_expanded_tokens (E) <= num_local_experts (L) the
tight bound is E (i.e., no extra padding), and only when E > L does the compact
closed-form (E + (T - 1) * L) // T apply; mention the existing branch that
checks num_expanded_tokens <= num_local_experts and add the same clarification
for the duplicate text near lines 75-76 so readers know why the implementation
keeps that branch.
In `@tests/moe/test_cute_dsl_fused_moe.py`:
- Around line 473-523: The three class-level mutable lists _DEEPSEEK_V3_SHAPES,
_GENERIC_MOE_SHAPES, and _EDGE_CASE_SHAPES should be made immutable to avoid
RUF012; replace each list literal with a tuple literal containing the
pytest.param entries (i.e., change [...] to (...)) so the parametrization
constants cannot be mutated while preserving their contents and ids.
- Line 471: Replace the Unicode multiplication sign "×" with a plain ASCII "x"
in the test prose comments to satisfy Ruff (RUF002/RUF003); locate the
occurrences such as the comment line "# num_tokens=16384 (prefill) — full EP ×
tile_size matrix." (and the similar occurrences noted later) and update them to
"# num_tokens=16384 (prefill) — full EP x tile_size matrix." so only the
character changes and no other text or formatting is modified.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f974c850-dc14-4cad-a91d-ca022431bd8e
📥 Commits
Reviewing files that changed from the base of the PR and between 7492ff11d4ec855a235d8bfcef092c1a6ac2a8da and 557ad5a.
📒 Files selected for processing (2)
flashinfer/fused_moe/cute_dsl/moe_utils.pytests/moe/test_cute_dsl_fused_moe.py
|
/bot run |
nv-yunzheq
left a comment
There was a problem hiding this comment.
Approved as the unit test passed on GB300
…erged 2026-05-04 PR flashinfer-ai#3198 merged 2026-05-04 18:58 UTC as squash-merge commit `393e83ea`. fi's `get_max_num_tiles` formula now matches TRT-LLM's compact closed- form, eliminating the +1 tile over-allocation when (E - L) % T != 0. Updates the line-11 final-state entry and the line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me input (#3216) <!-- .github/pull_request_template.md --> ## 📌 Description The autotuner's `DynamicTensorSpec` in `flashinfer/fused_moe/cute_dsl/tuner.py` declared `gen_tuning_buckets` as the pre-computed tuple `get_hybrid_num_tokens_buckets(8192)` and `map_to_tuning_buckets` as `lambda x: map_to_hybrid_bucket(x,8192)`. The hardcoded 8192 cap silently clamped any runtime workload larger than that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill (N=16384) fi profiled at half the per-expert workload and used a tactic optimized for the wrong shape. This PR replaces the pre-computed tuple with the bare callable form (`get_hybrid_num_tokens_buckets`) and switches the mapper to the uncapped variant `map_to_hybrid_bucket_uncapped` (added alongside the hybrid-bucket scheme for exactly this case). The autotuner now invokes them with the actual input dim at autotune time, matching TRT-LLM's pattern at `cute_dsl_custom_ops.py:2390-2391` and flashinfer's own pattern at `gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG`. ## 🔍 Related Issues #3171 #3198 #3115 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * MoE autotuner now uses uncapped dynamic hybrid bucket mapping instead of a fixed-bounded set, improving adaptation to varying input token sizes. * **Tests** * Added offline tests validating autotuner bucket configuration: dynamic bucket generation, responsiveness to input size, monotonic mapping behavior, large-input scaling, and alignment with expected power-of-2 bucket values. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📌 Description
Replace
get_max_num_tilesinflashinfer/fused_moe/cute_dsl/moe_utils.pywith TRT-LLM's compact closed-form. The previous formula over-allocated by 1 tile whenever(E - L) % T != 0, whereE = num_tokens * top_k,L = num_local_experts,T = tile_size.Fix
Provably equivalent to
L + floor((E - L) / T), the tight worst-case bound onsum_e ceil(K_e / T). Matches TRT-LLM'stensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py.Concrete examples at
top_k=8,num_tokens=16384:num_local_experts=32,tile_size=128: 1056 → 1055num_local_experts=32,tile_size=256: 544 → 543🔍 Related Issues
#3171
#2398
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Refactor
Tests