Skip to content

fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM#3198

Merged
nv-yunzheq merged 2 commits intoflashinfer-ai:mainfrom
leejnau:fix-cute_dsl-moe-get-max-num-tiles-off-by-one
May 4, 2026
Merged

fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM#3198
nv-yunzheq merged 2 commits intoflashinfer-ai:mainfrom
leejnau:fix-cute_dsl-moe-get-max-num-tiles-off-by-one

Conversation

@leejnau
Copy link
Copy Markdown
Contributor

@leejnau leejnau commented Apr 28, 2026

📌 Description

Replace get_max_num_tiles in flashinfer/fused_moe/cute_dsl/moe_utils.py with TRT-LLM's compact closed-form. The previous formula over-allocated by 1 tile whenever (E - L) % T != 0, where E = num_tokens * top_k, L = num_local_experts, T = tile_size.

Fix

return (num_expanded_tokens + (tile_size - 1) * num_local_experts) // tile_size

Provably equivalent to L + floor((E - L) / T), the tight worst-case bound on sum_e ceil(K_e / T). Matches TRT-LLM's tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py.

Concrete examples at top_k=8, num_tokens=16384:

  • num_local_experts=32, tile_size=128: 1056 → 1055
  • num_local_experts=32, tile_size=256: 544 → 543

🔍 Related Issues

#3171

#2398

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Refactor

    • Improved MOE tile computation to use a single tight closed-form bound, simplifying logic and ensuring consistent worst-case tiling.
  • Tests

    • Added comprehensive tests validating the new tile-bound across edge cases (including zero tokens and threshold cases), monotonic behavior, and exact tightness between tiled and permuted token counts.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

Refactors get_max_num_tiles to a single closed-form worst-case bound and updates its docstring. Adds unit tests that assert the bound's tightness, edge cases, monotonicity, and consistency between get_max_num_permuted_tokens and get_max_num_tiles.

Changes

Cohort / File(s) Summary
Core Logic Update
flashinfer/fused_moe/cute_dsl/moe_utils.py
Replaced prior conditional/greedy logic with a unified closed-form formula (num_expanded_tokens + (tile_size - 1) * num_local_experts) // tile_size for get_max_num_tiles. Docstring rewritten to explain tight bound rationale and tile_size correspondence to moe_sort parameters.
Test Coverage
tests/moe/test_cute_dsl_fused_moe.py
Added tests validating get_max_num_tiles against worst-case per-expert distributions, boundary cases (including zero tokens and threshold where expanded ≤ local experts), monotonicity over num_tokens, and exact equality between get_max_num_permuted_tokens and get_max_num_tiles * tile_size.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run-ci

Suggested reviewers

  • nv-yunzheq
  • yzh119
  • cyx-6
  • djmmoss
  • samuellees

Poem

🐰 I hopped through tiles and counts so neat,
Replaced the branches with one tidy beat,
Tests sniffed out edges, tight as a seam,
Now bounds are snug — a rabbit's dream,
🌿✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing an off-by-one error in get_max_num_tiles to align with TRT-LLM's implementation, which is the core objective of the PR.
Description check ✅ Passed The PR description is comprehensive, including a clear explanation of the fix, the mathematical rationale, concrete examples, related issues, and completed checklist items for pre-commit checks and tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the MoE utility functions and tuner to support 2-CTA variants with a tile size of 256, ensuring layout consistency between GEMM1 and GEMM2. The get_max_num_tiles function has been refactored to use a compact closed-form expression that provides a tight upper bound on the number of tiles, mirroring TRT-LLM's implementation. Additionally, new unit tests have been introduced to verify tactic enumeration invariants and the accuracy of the tile count calculations. I have no feedback to provide.

leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…mula divergence

Updates the addendum's "+1 tile bucketing-formula divergence" subsection
to reference upstream PR flashinfer-ai#3198, which fixes get_max_num_tiles by
replacing fi's L + ceil((E - L) / T) with TRT-LLM's compact closed-form
(E + (T - 1) * L) // T. PR is awaiting review and is based on flashinfer-ai#3067.

Also adds a parenthetical clarifying that the perf impact of the +1
over-allocation was effectively zero at runtime (since the kernel
work is bounded by num_non_exiting_tiles, not buffer capacity), to
prevent future readers from over-attributing the EP=8 gap to this
formula bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
… bucket-cap fixes

Updates the post-close addendum and the flashinfer-ai#9 sub-section to reflect the
2026-04-28 root-cause work. The audit's headline +7-9% EP perf gaps at
N=16384 (EP=8 +8.6% kernel, EP=16 +13.7% kernel) are driven by two
paired CuteDSL MoE autotune bugs:

1. CuteDslMoEWrapper prealloc-bias (correctness): wrapper prealloc'd
   buffers sized for self.tile_size only, biasing autotune profiling
   against the other tile_size. Without the fix, fi is locked to
   tile_size=128 in 14/14 cache entries. Side branch
   cute-dsl-moe-wrapper-prealloc-bias-fix at commit a4137e6.
2. Autotuner bucket-cap mismatch (headline-perf): tuner.py:281 caps
   profile buckets at 8192 while runtime can be N=16384. fi profiles
   at half the per-expert workload, autotune-noise on tight margins
   (0.58% at EP=8 bucket=8192) destabilizes the cached choice. 1-line
   fix: bump cap to 16384.

Validation at --num-iters 100 --warmup 10 (3 runs each):

  no fixes:           EP=8 +5.5%/9.5pp  EP=16 +1.8%/9.7pp
  prealloc-fix only:  EP=8 +4.4%/11pp   EP=16 +2.4%/10.5pp
  both fixes:         EP=8 -0.8%/1.1pp  EP=16 -5.6%/10pp

Prealloc-fix alone does not close the headline gap. The bucket-cap-fix
is what closes it. Both fixes are required.

The EP=16 10pp spread under both-fixes is trt's autotune coin-flip
on its 0.08% tile=128 vs tile=256 profile-time margin -- fi's chosen
tactic is stable across all 3 runs; trt's is the noise source.

Several earlier 2026-04-28 revisions to this addendum had wrong
framings (EP=16 fully closed by prealloc-fix alone at --num-iters 1
was a lucky 2-run sample; necessary-but-not-sufficient at EP=8 was
partly true at --num-iters 1 but the residual was the bucket-cap
mismatch, not an unfixable deeper issue). The final state captured
here supersedes all earlier revisions in this same audit doc.

Neither fix is upstream as of 2026-04-28 -- held pending flashinfer-ai#3067
(fix-cutedsl-moe-gemm2-tactic-enum-tile256) and flashinfer-ai#3198
(get_max_num_tiles off-by-one) merging first. Thin-adapter refactor
remains the recommended medium-term direction; both short-term fixes
patch symptoms in code the refactor would delete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 29, 2026
…, mark superseded sub-sections

Lee asked for a thorough audit-of-the-audit pass: look for inconsistencies,
discrepancies, redundancies, and opportunities for compaction/simplification.
After reading the full 3952-line doc, the following systematic edits were
applied:

Top-of-doc:

- Added a "Final state (2026-04-29) — read this first" section at the very
  top (immediately after the title, before the chronological correction
  log). Captures: static audit summary, both upstream PRs (flashinfer-ai#3067, flashinfer-ai#3198),
  the two paired wrapper bugs and their side branches, the comprehensive
  30-cell validation matrix, the three-regime classification (large-N
  closure / medium-N wrapper-overhead-residual / small-N noise), the EP=1
  + version-skew explanation, and the closure of the empirical thread.
  This section is now authoritative; the rest of the doc is the
  chronological investigation record leading to it.

- Updated the title's "last updated" tag to 2026-04-29.

Stale references fixed:

- The flashinfer-ai#3067 fix branch was referred to via the stale commit hashes
  d291d17e and e2330e73/f0cf8cd0 in three places; updated to the current
  57de5f8 + 38293a0 (with notes that the original commit-hash references
  are preserved for traceability).
- The "Tunable runner" row in the Summary table marked tile_size=256 as
  "gated off"; updated to "fixed via PR flashinfer-ai#3067 (pending review)".
- Item 3 of the corrections list: "Recommended fix (drafted, not yet
  applied)" → "Fix landed and shipped as upstream PR flashinfer-ai#3067".

Duplicate content removed:

- Two "What still stands (unaffected by the autotune-context bug)"
  blocks were nearly identical; the first was removed and replaced
  with a one-line pointer to the consolidated version below.

Executive verdict updated:

- Rewritten to reflect current state: now references both PRs, the
  paired-fix story, the comprehensive 30-cell sweep matrix, and the
  +13.7% kernel-level closure. Previously it described "Leading
  hypothesis (NOT YET EMPIRICALLY VERIFIED)" framings that have since
  been confirmed and shipped.

Superseded sub-sections marked:

- "EP=8 vs EP=16: the fix has different impact at each" — was an
  early "necessary but not sufficient" framing under prealloc-only
  state; replaced with a SUPERSEDED note pointing to the paired-fix
  story below (the bucket-cap-fix discovery is what closed both EPs).
- "EP=16 confirmation (2026-04-28)" header clarified to indicate
  it's a single-iter prealloc-only snapshot; the H_A/H_B
  hypothesis-discrimination logic and trt 0.08% margin observation
  are still load-bearing and preserved.
- "Final cross-N validation (2026-04-28)" renamed to "Initial 6-cell
  cross-N spot-check" with a pointer to the comprehensive 30-cell
  sweep that supersedes it.
- Status sub-section: removed outdated EP=8 partial-closure / EP=16
  full-closure bullets (those were prealloc-only state); replaced
  with concise paired-fix branch summary and pointer to comprehensive
  sweep for per-cell results.
- Implications for the audit's earlier claims: updated the
  "dominant factor is the prealloc bias" bullet to the paired-fix
  framing (prealloc + bucket-cap, both required).

Follow-ups updated:

- Open follow-ups remaining: appended a "Update 2026-04-29 (final):
  flashinfer-ai#9 also closed" note documenting that the EP=8/EP=16 perf gap was
  root-caused to two paired wrapper bugs, with the empirical
  investigation thread now closed.
- Item flashinfer-ai#9 in the chronological "Follow-ups queued for later sessions"
  list: added a RESOLVED 2026-04-29 header with pointer to the
  paired-fix story, marking the 9a / 9b / 9c sub-mechanism candidates
  as pre-resolution candidate hypotheses (preserved as record of the
  search space, not blocking work).

What was preserved as historical record (not removed):

- Full "2026-04-24/25/28 correction" chronological log
- Per-component static audit detail
- Issue flashinfer-ai#3067 narrative including the original "Leading hypothesis"
  framing and the post-fix verification table
- The 22-day candidate-root-cause investigation that ruled out 5
  mechanisms before landing on the paired wrapper bugs
- Part 2 baseline numbers (with retraction annotations preserved
  as-is)
- Probes, MoE kernel coverage, reproduction runbook

Net effect: doc is more navigable, internally consistent, and the
authoritative summary is at the top. Body content preserves the
investigative reasoning. Diff: +241 / -217 lines, net +24 lines
(intentional — the new Final state section adds explicit
authoritative content).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 30, 2026
…g→functional-API migration (NOT committed)

Captures the analysis surfaced 2026-04-30 around an alternative
to the medium-term thin-adapter refactor: migrate SGLang from
`CuteDslMoEWrapper.run(...)` to `cute_dsl_fused_moe_nvfp4(...)`
directly, then deprecate the wrapper. Lee maintains both fi and
SGLang sides, so the cross-project coordination cost normally
faced by such a migration is essentially zero — but the question
of whether to pursue is not decided.

Heavily caveated as a future consideration that **may not be
pursued**. Priority is explicitly lower than the existing PR queue
(PR flashinfer-ai#3171, flashinfer-ai#3198, bucket-cap-fix, prealloc-fix, convergence) and
Phase 1 in follow-up flashinfer-ai#11.

Captured for two reasons:

1. The 5-minute pass on SGLang's `flashinfer_cutedsl.py:ensure_cutedsl_wrapper()`
   surfaces specific gaps in the functional API that would need to
   be closed before any migration. Documenting them here means
   future-Lee (or future-someone) doesn't have to re-derive them.

2. The thin-adapter refactor and the deprecate-the-wrapper migration
   are alternatives, not sequential. If migration is pursued, the
   thin-adapter is redundant. If it's not, the thin-adapter is the
   recommended cleanup. The relationship needed to be on the record.

Captured fi-side prerequisites (if pursued):

- Add `moe_sort_buffers`, `gemm1_out`, `gemm1_out_scale` parameters
  to `cute_dsl_fused_moe_nvfp4`. Forward to `_moe_core_impl` (which
  already accepts them). ~15 lines, additive (existing wrapper
  keeps working).
- Verify `CuteDslFusedMoENvfp4Runner` per-call construction doesn't
  bust the autotune cache or add overhead.
- CUDA-graph capture/replay validation of `cute_dsl_fused_moe_nvfp4`
  under SGLang-style preallocation.

Captured benefits / risks / decision criteria. Status: NOT decided;
revisit only after the existing PR queue settles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nv-yunzheq nv-yunzheq added the v0.6.11 release blocker label for 0.6.11 label Apr 30, 2026
@leejnau leejnau marked this pull request as ready for review April 30, 2026 19:01
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/cute_dsl/tuner.py`:
- Around line 186-188: Replace the Unicode multiplication sign "×" with a plain
ASCII "x" in the comment lines that read "# tile_size=128: 2 GEMM1 tactics × 4
GEMM2 tactics = 8" and "# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8"
(the comment block that also notes "Total: 16 tactics"); edit those exact
comment strings so they use "x" (e.g., "2 GEMM1 tactics x 4 GEMM2 tactics") to
satisfy Ruff RUF003 while keeping the meaning unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c899f78f-8552-4de8-9897-5d3c33d45cf0

📥 Commits

Reviewing files that changed from the base of the PR and between 4e64219 and ee7ae1f40503288825eba9f00fd735eccf9ae7af.

📒 Files selected for processing (3)
  • flashinfer/fused_moe/cute_dsl/moe_utils.py
  • flashinfer/fused_moe/cute_dsl/tuner.py
  • tests/moe/test_cute_dsl_fused_moe.py

Comment on lines 186 to +188
# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
# Total: 8 tactics
# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
# Total: 16 tactics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the Unicode multiplication sign in this comment.

Ruff flags × here as ambiguous (RUF003). Swapping it to plain x keeps the file lint-clean.

Suggested edit
-# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
-# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
+# tile_size=128: 2 GEMM1 tactics x 4 GEMM2 tactics = 8
+# tile_size=256: 2 GEMM1 tactics x 4 GEMM2 tactics = 8
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
# Total: 8 tactics
# tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8
# Total: 16 tactics
# tile_size=128: 2 GEMM1 tactics x 4 GEMM2 tactics = 8
# tile_size=256: 2 GEMM1 tactics x 4 GEMM2 tactics = 8
# Total: 16 tactics
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 186-186: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)


[warning] 187-187: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/tuner.py` around lines 186 - 188, Replace the
Unicode multiplication sign "×" with a plain ASCII "x" in the comment lines that
read "# tile_size=128: 2 GEMM1 tactics × 4 GEMM2 tactics = 8" and "#
tile_size=256: 2 GEMM1 tactics × 4 GEMM2 tactics = 8" (the comment block that
also notes "Total: 16 tactics"); edit those exact comment strings so they use
"x" (e.g., "2 GEMM1 tactics x 4 GEMM2 tactics") to satisfy Ruff RUF003 while
keeping the meaning unchanged.

@leejnau leejnau force-pushed the fix-cute_dsl-moe-get-max-num-tiles-off-by-one branch from ee7ae1f to 7492ff1 Compare May 1, 2026 00:37
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 1, 2026
…r-ai#3198 rebased onto main

PR flashinfer-ai#3171 (`fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic
enumeration`) merged into `flashinfer-ai/flashinfer:main`
2026-04-30 as squash-merge commit `070fabf0`.

PR flashinfer-ai#3198 (`fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles
to match TRT-LLM`) was originally stacked on PR flashinfer-ai#3171 (3 commits
on top of `ede2225a`). After flashinfer-ai#3171 merged as a squash-merge, PR
flashinfer-ai#3198 had a test-file conflict against main because git couldn't
reconcile the unsquashed pre-merge commits (`e2330e73` +
`f0cf8cd0`) with the squashed `070fabf0` on main. Rebased PR flashinfer-ai#3198
onto upstream main directly via
`git rebase --onto FETCH_HEAD f0cf8cd0` — drops the now-redundant
PR flashinfer-ai#3171 commits and replays just the off-by-one fix. New HEAD:
`7492ff11`. Single-commit clean branch off main, force-pushed to
fork. PR is now mergeable on GitHub (status went from CONFLICTING
to MERGEABLE / BLOCKED-on-checks).

Audit-doc updates:

- Final state (lines 10-11): PR flashinfer-ai#3171 marked MERGED; PR flashinfer-ai#3198
  reframed as rebased onto main, single clean commit.
- Per-component table (line 451): PR flashinfer-ai#3171 status flipped from
  "pending review" to "MERGED 2026-04-30 as `070fabf0`".
- Issue flashinfer-ai#3067 narrative (line 962): status updated to merged.
- Off-by-one PR section (line 1651): describes the rebase.
- Wrapper-fix gating note (line 1306): added 2026-04-30 update
  that the flashinfer-ai#3171 gating condition is removed.

Memory entries updated separately (closed-audit, prealloc-bias,
PR flashinfer-ai#3198 dedicated, MEMORY.md hook for flashinfer-ai#3198).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RT-LLM

The previous formula in flashinfer/fused_moe/cute_dsl/moe_utils.py
computed max_num_tiles as L + ceil((E - L) / T), where L =
num_local_experts, T = tile_size, and E = num_tokens * top_k. This
over-allocated by 1 tile whenever (E - L) was not divisible by T.

The moe_sort routing kernel
(include/flashinfer/trtllm/fused_moe/RoutingKernel.cuh) writes exactly
sum_e ceil(K_e / T) entries to tile_idx_to_expert_idx and
tile_idx_to_mn_limit, where K_e is the per-local-expert token count
and sum_e K_e = E. The tight worst-case upper bound on this sum is
(L - 1) + ceil((E - L + 1) / T), achieved when one expert receives
(E - L + 1) tokens and the others each receive 1. By the integer
identity ceil((X+1)/T) = floor(X/T) + 1 (for non-negative X), this
simplifies to L + floor((E - L) / T) = (E + (T - 1) * L) // T, which
is the compact form used in TRT-LLM's
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py. fi's old
formula exceeded this by 1 whenever (E - L) % T != 0.

The over-allocation never caused correctness issues — the routing
kernel populates a runtime num_non_exiting_tiles count, and downstream
gemm1/gemm2 kernels strictly bound their reads by it (see
blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py and
blockscaled_contiguous_grouped_gemm_finalize_fusion.py). But it
wasted ~1 tile of buffer space per call and produced shape divergence
vs TRT-LLM at the same configuration.

Concrete examples (DeepSeek-V3, top_k=8, num_tokens=16384):

  num_local_experts=32 tile_size=128: 1056 -> 1055
  num_local_experts=32 tile_size=256:  544 ->  543
  num_local_experts=16 tile_size=128: 1040 -> 1039
  num_local_experts=16 tile_size=256:  528 ->  527

## Tests

Adds two no-GPU test classes in tests/moe/test_cute_dsl_fused_moe.py
covering DeepSeek-V3 production shapes at every supported EP
partition (EP=1, 8, 16, 32) × every tile_size (128, 256), generic
MoE shapes representing other model families (Mixtral / DBRX /
mid-large / high-top_k), and edge cases (boundary values, divisibility).

TestGetMaxNumTiles (5 methods):
  test_matches_trtllm_compact_formula (24 parametrized cases): pin
    fi to TRT-LLM's compact closed-form across the full DeepSeek-V3
    EP × tile_size matrix, generic MoE shapes, and edge cases.
  test_worst_case_construction_is_tight (24 parametrized cases):
    build the worst-case routing distribution (one expert gets
    E - L + 1 tokens, the others each get 1) and verify fi's bound
    equals the actual sum_e ceil(K_e / T).
  test_zero_tokens, test_below_or_at_local_experts_threshold:
    early-return-path edge cases.
  test_monotonic_in_num_tokens (11 parametrized cases): sweep
    monotonicity check across the same EP × tile_size matrix and
    generic-model shapes; catches sign errors and off-by-one in
    future refactors.

TestGetMaxNumPermutedTokens (1 method, 24 parametrized cases):
  test_consistent_with_get_max_num_tiles: result equals
    get_max_num_tiles * tile_size exactly, across the same shape
    coverage.

Generic-model shapes parametrized:
  - num_experts=8, top_k=2 (Mixtral-8x7B family) at EP=1
  - num_experts=16, top_k=4 (DBRX family) at EP=1, EP=4
  - num_experts=64, top_k=6 at EP=1, EP=8
  - num_experts=32 (post-EP), top_k=16 at EP=4 (high-top_k regime)

Edge cases parametrized:
  - num_expanded == num_local_experts (early-return boundary)
  - num_expanded just past num_local_experts
  - (E - L) exactly divisible by tile_size
  - (E - L) one short of divisibility
  - num_tokens=1, top_k=1 (smallest meaningful inputs)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leejnau leejnau force-pushed the fix-cute_dsl-moe-get-max-num-tiles-off-by-one branch from 7492ff1 to 557ad5a Compare May 1, 2026 16:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/moe/test_cute_dsl_fused_moe.py (2)

473-523: ⚡ Quick win

Make the parametrization tables immutable.

These shape matrices are class-level lists, so Ruff will flag them with RUF012 as mutable class attributes. Since they're constants used only for parametrization, tuples avoid the lint noise and accidental mutation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_cute_dsl_fused_moe.py` around lines 473 - 523, The three
class-level mutable lists _DEEPSEEK_V3_SHAPES, _GENERIC_MOE_SHAPES, and
_EDGE_CASE_SHAPES should be made immutable to avoid RUF012; replace each list
literal with a tuple literal containing the pytest.param entries (i.e., change
[...] to (...)) so the parametrization constants cannot be mutated while
preserving their contents and ids.

471-471: ⚡ Quick win

Replace the Unicode multiplication sign in test prose.

Ruff flags × here (RUF002/RUF003). Using plain x keeps the file lint-clean.

Also applies to: 541-541, 627-627

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/moe/test_cute_dsl_fused_moe.py` at line 471, Replace the Unicode
multiplication sign "×" with a plain ASCII "x" in the test prose comments to
satisfy Ruff (RUF002/RUF003); locate the occurrences such as the comment line "#
num_tokens=16384 (prefill) — full EP × tile_size matrix." (and the similar
occurrences noted later) and update them to "# num_tokens=16384 (prefill) — full
EP x tile_size matrix." so only the character changes and no other text or
formatting is modified.
flashinfer/fused_moe/cute_dsl/moe_utils.py (1)

48-59: ⚡ Quick win

Document the piecewise bound explicitly.

The docstring reads as if (E + (T - 1) * L) // T is the tight bound for all inputs, but the implementation still needs the num_expanded_tokens <= num_local_experts branch. For E <= L, the compact form overestimates (for example E=16, L=32, T=128 gives 31 instead of 16), so it would help to call out that the closed form only applies once E > L.

📝 Suggested doc tweak
-    Mirrors TRT-LLM's ``GroupedGemmInputsHelper.get_max_num_tiles()`` in
-    ``tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py``. The compact
-    closed-form expression is the tight worst-case bound on
+    Mirrors TRT-LLM's ``GroupedGemmInputsHelper.get_max_num_tiles()`` in
+    ``tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py``. For ``E <= L``,
+    the tight bound is simply ``E``. Once ``E > L``, the compact
+    closed-form expression is the tight worst-case bound on

Also applies to: 75-76

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/moe_utils.py` around lines 48 - 59, Update the
docstring for the max-tiles calculation (the function mirroring TRT-LLM's
get_max_num_tiles) to explicitly state the piecewise nature: when
num_expanded_tokens (E) <= num_local_experts (L) the tight bound is E (i.e., no
extra padding), and only when E > L does the compact closed-form (E + (T - 1) *
L) // T apply; mention the existing branch that checks num_expanded_tokens <=
num_local_experts and add the same clarification for the duplicate text near
lines 75-76 so readers know why the implementation keeps that branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/fused_moe/cute_dsl/moe_utils.py`:
- Around line 48-59: Update the docstring for the max-tiles calculation (the
function mirroring TRT-LLM's get_max_num_tiles) to explicitly state the
piecewise nature: when num_expanded_tokens (E) <= num_local_experts (L) the
tight bound is E (i.e., no extra padding), and only when E > L does the compact
closed-form (E + (T - 1) * L) // T apply; mention the existing branch that
checks num_expanded_tokens <= num_local_experts and add the same clarification
for the duplicate text near lines 75-76 so readers know why the implementation
keeps that branch.

In `@tests/moe/test_cute_dsl_fused_moe.py`:
- Around line 473-523: The three class-level mutable lists _DEEPSEEK_V3_SHAPES,
_GENERIC_MOE_SHAPES, and _EDGE_CASE_SHAPES should be made immutable to avoid
RUF012; replace each list literal with a tuple literal containing the
pytest.param entries (i.e., change [...] to (...)) so the parametrization
constants cannot be mutated while preserving their contents and ids.
- Line 471: Replace the Unicode multiplication sign "×" with a plain ASCII "x"
in the test prose comments to satisfy Ruff (RUF002/RUF003); locate the
occurrences such as the comment line "# num_tokens=16384 (prefill) — full EP ×
tile_size matrix." (and the similar occurrences noted later) and update them to
"# num_tokens=16384 (prefill) — full EP x tile_size matrix." so only the
character changes and no other text or formatting is modified.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f974c850-dc14-4cad-a91d-ca022431bd8e

📥 Commits

Reviewing files that changed from the base of the PR and between 7492ff11d4ec855a235d8bfcef092c1a6ac2a8da and 557ad5a.

📒 Files selected for processing (2)
  • flashinfer/fused_moe/cute_dsl/moe_utils.py
  • tests/moe/test_cute_dsl_fused_moe.py

@nv-yunzheq
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !621 has been created, and the CI pipeline #50037953 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Collaborator

@nv-yunzheq nv-yunzheq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved as the unit test passed on GB300

@nv-yunzheq nv-yunzheq merged commit 393e83e into flashinfer-ai:main May 4, 2026
61 of 68 checks passed
@leejnau leejnau deleted the fix-cute_dsl-moe-get-max-num-tiles-off-by-one branch May 4, 2026 19:32
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 4, 2026
…erged 2026-05-04

PR flashinfer-ai#3198 merged 2026-05-04 18:58 UTC as squash-merge commit `393e83ea`.
fi's `get_max_num_tiles` formula now matches TRT-LLM's compact closed-
form, eliminating the +1 tile over-allocation when (E - L) % T != 0.

Updates the line-11 final-state entry and the line-4369 PR queue
annotation in follow-up flashinfer-ai#12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aleozlx pushed a commit that referenced this pull request May 6, 2026
…me input (#3216)

<!-- .github/pull_request_template.md -->

## 📌 Description

The autotuner's `DynamicTensorSpec` in
`flashinfer/fused_moe/cute_dsl/tuner.py` declared `gen_tuning_buckets`
as the pre-computed tuple `get_hybrid_num_tokens_buckets(8192)` and
`map_to_tuning_buckets` as `lambda x: map_to_hybrid_bucket(x,8192)`. The
hardcoded 8192 cap silently clamped any runtime workload larger than
that to the 8192-bucket's cached tactic — at DeepSeek-V3 prefill
(N=16384) fi profiled at half the per-expert workload and used a tactic
optimized for the wrong shape.

This PR replaces the pre-computed tuple with the bare callable form
(`get_hybrid_num_tokens_buckets`) and switches the mapper to the
uncapped variant `map_to_hybrid_bucket_uncapped` (added alongside the
hybrid-bucket scheme for exactly this case). The autotuner now invokes
them with the actual input dim at autotune time, matching TRT-LLM's
pattern at `cute_dsl_custom_ops.py:2390-2391` and flashinfer's own
pattern at `gemm/gemm_base.py:_FP8_GEMM_SM100_TUNING_CONFIG`.

## 🔍 Related Issues

#3171
#3198
#3115

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* MoE autotuner now uses uncapped dynamic hybrid bucket mapping instead
of a fixed-bounded set, improving adaptation to varying input token
sizes.

* **Tests**
* Added offline tests validating autotuner bucket configuration: dynamic
bucket generation, responsiveness to input size, monotonic mapping
behavior, large-input scaling, and alignment with expected power-of-2
bucket values.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

op: moe run-ci v0.6.11 release blocker label for 0.6.11

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants