Update wrapper to remove device arg#9
Merged
MasterJH5574 merged 1 commit intomainfrom Oct 30, 2023
Merged
Conversation
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…w follow-up flashinfer-ai#9 Bench now supports --ep N for single-GPU EP simulation (commit da2f0e3). Full sweep at EP=1 / 8 / 16 across all 15 token counts produced 45 (size, EP) data points — all pass parity with max_abs ≤ 0.0625. EP=1 results reproduce the audit's headline post-fix verification (+4.1% at N=16384, vs +3.5% in the 2026-04-25 table — within run- to-run noise). EP=8 and EP=16 add new coverage of the deployment- realistic configs that DeepSeek-V3 actually runs under. Audit changes: - New "EP=8 / EP=16 single-GPU sweep (2026-04-28)" section immediately after the ground-truth nsys verification. Documents the bench plumbing changes (Mapping construction with tp_size=ep, moe_tp_size=1, moe_ep_size=ep + the dual-binding monkey-patch on can_access_peer needed because plugin.py's imported binding is unaffected by patching _ipc_utils alone). Includes the full 45-cell (size, EP) Δ% table. - Three observations from the data: (1) Small-batch Δ% explodes at EP>1 due to fixed-overhead- fraction effect — not actionable. (2) Large-batch (N=16384) Δ% stays modest across EP values (+4.1, +7.3, +7.9% at EP=1, 8, 16). (3) Per-kernel gap widens in flashinfer's disfavor at smaller per-rank expert count: gemm1+gemm2 sum at N=16384 goes -2.4% (fi faster, EP=1) → +8.6% (EP=8) → +13.7% (EP=16). Both kernels' compiled PTX is byte-identical between sides (proven 2026-04-24), so this is tactic-selection or wrapper-overhead, not kernel-binary divergence. - Follow-up flashinfer-ai#7 marked RESOLVED. - New follow-up flashinfer-ai#9 added: "Investigate why flashinfer's per-kernel times grow faster than TRT-LLM's at smaller per-rank expert count (EP>1)." Three plausible mechanisms with concrete probe steps. Not blocking — EP=1 port-parity remains the load-bearing finding. Top-of-file correction title and open-follow-ups summary updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…indings Captured both autotuners' selected tactics at N=16384 across EP=1, 8, 16 via TLLM_LOG_LEVEL=DEBUG (TRT-LLM side) plus the bench's own _audit_selected_tactics helper (flashinfer side). Confirmed the 9a-class divergence at EP=16, falsified the simple-fix version. Key findings: - EP=1: fi and trt both pick tile_size=256 / 2-CTA at the largest shape bucket. gemm1 tactics agree exactly; gemm2 tactics differ slightly but both are 2-CTA. Consistent with the observed -2.4% (fi faster) kernel-level Δ%. - EP=16: clear tactic divergence. Flashinfer picks tile_size=128 for ALL 14 cache entries (no tile_size=256 at any shape). TRT-LLM picks tile_size=256 ((256, 256), (2, 1), False) at the largest shape bucket. Different MMA tile / cluster / throughput — exactly explains the +13.7% kernel-level gap. - EP=8: not pure tactic divergence. fi and trt pick essentially identical tactics (tile_size=256, gemm1=(256,256) 2-CTA, gemm2=(256,256) 2-CTA) at the largest shape. Yet kernel-level Δ% is +8.6% — must be from another source (shape-bucket rounding, per-call overhead, or measurement noise). The EP=16 finding does NOT have a one-line fix like flashinfer-ai#3067 did. Both sides have equivalent tactic enumerations (post-fix); both autotuners profile them; they just rank the profile-time outcomes differently. Closing this needs deeper engineering — verify profile-methodology equivalence (9a-i), then shape-bucket rounding diff (9a-ii), then deeper investigation if those hold. flashinfer-ai#9 entry expanded with three sub-mechanisms (9a-i, 9a-ii, 9a-iii) that decompose the original 9a hypothesis, plus 9b (per-call overhead at smaller per-rank work) and 9c (load-balancing). The recommendation has been refined: not blocking, real but production-relevance vs effort makes it a "future work" item. Closes the EP investigation for this audit cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…red-and-skipped
After 2026-04-24 / 25 / 28 we have five independent fi-vs-trt
agreement proofs:
1. source byte-identical with rc5.post2 (deep audit)
2. compiled PTX byte-identical at tile_size=256 (md5 401ebca6...)
3. per-call timing match within 0.1% at apples-to-apples tactic
4. 45 (size, EP) parity cells all pass within 0.5 FP4 step
5. nsys ground-truth verification with kernel mangled-name
structure agreement
The probability of fi+trt being wrong-but-agreeing across all five
is essentially zero, so a PyTorch FP4 third-reference check no
longer provides meaningful incremental confidence.
Additionally, `compute_reference_moe_fp4` has known limitations:
its PyTorch-eager FP4 simulation is stricter than the kernel's
actual FP4 representation, which made it ambiguous to interpret
during the original flashinfer-ai#3067 framing. Disagreement between bench
output and the reference would not unambiguously indicate a kernel
bug.
Cost is also non-trivial: Python-eager per-token / per-expert
loops would require running at a small problem-size subset to
keep wall-clock bearable.
Cost-to-incremental-confidence ratio is bad enough that this
follow-up is consciously skipped, not deferred. Future evidence
of a fi-vs-trt agreement that's actually wrong would re-elevate
it; otherwise no value.
Effective remaining open follow-ups: flashinfer-ai#3 (production-convention
scaling), flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun), flashinfer-ai#9 (EP=16 tactic-
divergence root cause).
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…d-and-skipped flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths run with degenerate values, so a divergent FP8_MAX or scale-convention mismatch between the two sides would silently produce identical output here and divergent output at non-trivial scales). Skipped because: - Both sides have their own internal scale-plumbing tests with non-trivial scales. TRT-LLM's tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise non-trivial-gs configurations against PyTorch references. The bench wouldn't add coverage their CI doesn't already have. - A failure here wouldn't tell us what we want to know. The most likely cause of fi-vs-trt parity divergence at production scales would be a scale-convention disagreement BETWEEN the two implementations — not fixable in flashinfer (TRT-LLM is upstream), and dramatically out of scope. - Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing has many surfaces (alpha, weight_scale_2, input_scale, is_sf_* flags, fp4_quantize signatures); any tiny bench-side mismatch produces a parity failure that looks like a port bug. That's the same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit. Scope-difference note: the audit's load-bearing question is "is the kernel port faithful?" — closed via byte-identical PTX + matching timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM agree on NVFP4 scaling conventions?" — a separate question, real but tangential to port-faithfulness. One scenario that would re-elevate this: a planned production ship of CuteDslMoEWrapper to a caller using non-trivial scales, where the team wants one independent cross-check (against CuteDslFusedMoE under the same scales) before merging. In that scenario flashinfer-ai#3 is exactly the right pre-ship sanity. For closing out the investigation audit, it's tangential. Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
flashinfer-ai#9 Earlier drafts of the EP=8/16 sweep section and follow-up flashinfer-ai#9 entry characterized the "main-pass total minus mapped-kernel sum" residual as "CPU/orchestration overhead" or "Python wrapper overhead per gemm call." That framing is structurally wrong under CUDA graphs. The bench captures the full forward pass into a CUDA graph during warmup, then replays via cudaGraphLaunch. During replay no Python runs between kernels — Python overhead is paid once at graph capture time and amortized over thousands of replay iterations. The CUDA-event main-pass measurements are GPU stream-time, not CPU time. The actual residual is GPU-side: small unmapped kernels (e.g., moe_sort_py_fills, output_bf16_zero, FillFunctor variants), inter-kernel GPU-idle gaps, and stream-coordination effects. Corrections: - "Three observations" finding 1 (small-batch Δ% explosion): reframed from "fixed CPU overhead" to "fixed-cost work in the captured graph" with an explicit note that this is GPU stream- time under CUDA graphs. - "Three observations" finding 4 (between-kernel time): retitled from "CPU/orchestration overhead" to "GPU-side between-kernel time"; explicit note that this comprises small unmapped kernels and GPU-side gaps, not CPU/Python overhead. - Follow-up flashinfer-ai#9 entry: 9b ("Per-call wrapper overhead in flashinfer scales with call rate") marked DROPPED 2026-04-28 with reasoning. Plausible remaining mechanisms for the EP=8 +8.6% gap at agreed tactics narrowed to GPU-side inter-kernel gap differences or measurement noise. Surfaced by user inspection of CUDA-events semantics (CUDA events measure stream-time, not CPU time, especially under graph replay). The earlier loose framing was a real-but-non-load-bearing error — the load-bearing audit findings (port faithfulness, parity, post- fix verification) are unaffected because they don't depend on attributing the residual to CPU vs GPU.
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…flashinfer-ai#9 remains Three close-out edits to wrap the CuteDSL MoE FP4 port audit: - Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped, matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons: (a) port-parity claims are unaffected by DSL-compiler version since both sides use the same in-container compiler, (b) flashinfer and TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus unsupported-config risk produces the same ambiguous-failure pattern that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the image. Install recipe preserved for future absolute-latency probes. - Add a "Version-skew caveat (2026-04-28)" subsection to the top-of-file correction. The bench compares flashinfer-with-post-rc5- forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2- without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may partly reflect the version asymmetry. Load-bearing claims (port faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067 fix) are unaffected because they do not depend on absolute deltas. Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing the post-rc5 commits. - Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16 tactic-divergence root cause) as the only effective open follow-up. Audit declared closed 2026-04-28. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…realloc bias Post-close finding (2026-04-28). The +8.6% kernel-level gap at EP=8 N=16384 — previously speculated to be "shape-bucket rounding, per-call overhead, or measurement noise" — is dominantly a real autotune-correctness bug in CuteDslMoEWrapper. Root cause: wrapper pre-allocates _gemm1_output, _gemm1_output_scale, and _moe_sort_buffers["out_permuted_idx_to_expanded_idx"] sized for self.tile_size at construction. The use_prealloc gate at fused_moe.py:495 honors prealloc only when tactic.tile_size == self.tile_size. During autotune profiling, the matching tactic gets pre-allocated buffers "for free" while the other tactic falls through to gemm1_out=None and pays per-call dynamic allocation. Autotuner sees the latter as artificially slower → biased tactic selection. Empirical evidence (N=16384 EP=8, single-iter): - Pre-fix (default tile_size=128 wrapper): fi all 14 cache entries pick tile=128; +15.3% Δ% with bias-flipped wrapper sized for tile=256. - Post-fix (VALID_TILE_SIZES + buffers sized for max/min of range): fi at largest bucket picks tile_size=256 / 2-CTA matching trt; Δ% drops to +4.1%. Architectural deviation: TRT-LLM's Sm100BlockScaledContiguousGather*Runner allocates output buffers per call inside forward() — no prealloc, no caching, no autotune bias. flashinfer added the prealloc as a CUDA-graph optimization but keyed it to a single tile_size. The fix preserves the optimization while generalizing it across the canonical tile_size enumeration (VALID_TILE_SIZES = (128, 256), monotonic-safe for any future additions). Status: wrapper-side fix prepared but NOT yet upstream — held pending merge of flashinfer-ai#3067 (PR branch fix-cutedsl-moe-gemm2-tactic-enum-tile256). Lands as a separate follow-up PR. Additional findings captured: - +1 tile bucketing-formula divergence between fi and trt's get_max_num_tiles (small ~0.18% effect, separate from the prealloc bias, lives in _moe_core_impl, shared by both wrapper and functional APIs). - Bench-side --dump-kernel-shapes diagnostic added for runtime kernel- call shape comparison; one subtlety: must NOT call .item() inside CUDA-graph capture (cudaErrorStreamCaptureInvalidated). - Architectural observation: wrapper-vs-functional API. CuteDslMoEWrapper has accumulated multiple wrapper-specific bugs across the audit (autotune-context, prealloc bias, now-inert tile_size arg). cute_dsl_fused_moe_nvfp4 (functional) has not. Recommended medium-term cleanup: refactor wrapper as thin stateful adapter delegating to functional API. Out of scope for this audit cycle. EP=16 +13.7% sub-question remains separately open. The prealloc bias affects EP=16 too but plausibly does not fully account for the gap; pending validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…ape comparison Adds a `--dump-kernel-shapes` flag to bench_cute_dsl_port_parity.py that monkey-patches the gemm1 and gemm2 kernel-call sites on both the flashinfer side (blockscaled_contiguous_*_nvfp4 functions in flashinfer.fused_moe.cute_dsl.fused_moe) and the TRT-LLM side (Sm100BlockScaledContiguous*Runner.forward methods). Prints unique shape signatures with a monotonic call counter so autotune-profiling shapes AND inference shapes are both captured in a single run. Used for audit follow-up flashinfer-ai#9 EP=8 investigation (commit dde2723 addendum) — surfaced the CuteDslMoEWrapper prealloc-bias bug by revealing that fi was systematically picking tile_size=128 even after the flashinfer-ai#3067 fix. Implementation is _install_kernel_shape_dumpers gated on the flag. Subtlety captured in the implementation comment: must NOT call .item() on tensor arguments inside the dumper. .item() is a CUDA host sync, illegal inside torch.cuda.graph() capture, and produces cudaErrorStreamCaptureInvalidated on the next kernel launch. Earlier draft of the dumper had .item() in _fmt_scalar via "hasattr(t, 'item') and t.numel() == 1: return int(t.item())"; removed in favor of always emitting tensor shape only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…shinfer-ai#3067 PR)" This reverts commit a4137e67ca52ed4b6d8b59f2c2d5a3da1b53a497. Strategic call (2026-04-28): pull the wrapper prealloc-bias fix off the audit branch. Two reasons: 1. **Don't stack ahead of main.** The audit branch already carries the flashinfer-ai#3067 preview at 074c185. Stacking another substantial wrapper change (this commit) on top, before flashinfer-ai#3067 has merged, creates a real PR-dependency chain. Reviewers of either PR would need context on both, and rebase cascades are likely. Keep the audit branch lean and the prealloc fix in its own slot until flashinfer-ai#3067 actually lands. 2. **The thin-adapter refactor is the better medium-term fix.** The audit's architectural recommendation (see addendum section *flashinfer-ai#9 EP=8 +8.6% gap — root cause identified 2026-04-28*, subsection *Architectural observation: wrapper vs functional API*) is to refactor CuteDslMoEWrapper as a thin stateful adapter holding only stream/event references and delegating ALL buffer allocation to the functional API. That refactor would delete _allocate_buffers entirely, along with self.tile_size, self.max_num_tokens, and the use_prealloc check — the bug class disappears structurally because there is no prealloc. The careful work in this commit patches a symptom in code we may delete; if the refactor pans out, this fix evaporates. The fix is preserved on branch `cute-dsl-moe-wrapper-prealloc-bias-fix` for resurrection if the refactor turns out impractical. Patch file also at /tmp/claude/cute_dsl_prealloc_fix.patch (Mac). Audit doc and memories updated to reflect this strategic decision. The 100-iter validation finding (+8.2% post-fix gap, see audit addendum) still stands — it confirms the prealloc bias is real but not sufficient on its own to close flashinfer-ai#9 EP=8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
… bucket-cap fixes Updates the post-close addendum and the flashinfer-ai#9 sub-section to reflect the 2026-04-28 root-cause work. The audit's headline +7-9% EP perf gaps at N=16384 (EP=8 +8.6% kernel, EP=16 +13.7% kernel) are driven by two paired CuteDSL MoE autotune bugs: 1. CuteDslMoEWrapper prealloc-bias (correctness): wrapper prealloc'd buffers sized for self.tile_size only, biasing autotune profiling against the other tile_size. Without the fix, fi is locked to tile_size=128 in 14/14 cache entries. Side branch cute-dsl-moe-wrapper-prealloc-bias-fix at commit a4137e6. 2. Autotuner bucket-cap mismatch (headline-perf): tuner.py:281 caps profile buckets at 8192 while runtime can be N=16384. fi profiles at half the per-expert workload, autotune-noise on tight margins (0.58% at EP=8 bucket=8192) destabilizes the cached choice. 1-line fix: bump cap to 16384. Validation at --num-iters 100 --warmup 10 (3 runs each): no fixes: EP=8 +5.5%/9.5pp EP=16 +1.8%/9.7pp prealloc-fix only: EP=8 +4.4%/11pp EP=16 +2.4%/10.5pp both fixes: EP=8 -0.8%/1.1pp EP=16 -5.6%/10pp Prealloc-fix alone does not close the headline gap. The bucket-cap-fix is what closes it. Both fixes are required. The EP=16 10pp spread under both-fixes is trt's autotune coin-flip on its 0.08% tile=128 vs tile=256 profile-time margin -- fi's chosen tactic is stable across all 3 runs; trt's is the noise source. Several earlier 2026-04-28 revisions to this addendum had wrong framings (EP=16 fully closed by prealloc-fix alone at --num-iters 1 was a lucky 2-run sample; necessary-but-not-sufficient at EP=8 was partly true at --num-iters 1 but the residual was the bucket-cap mismatch, not an unfixable deeper issue). The final state captured here supersedes all earlier revisions in this same audit doc. Neither fix is upstream as of 2026-04-28 -- held pending flashinfer-ai#3067 (fix-cutedsl-moe-gemm2-tactic-enum-tile256) and flashinfer-ai#3198 (get_max_num_tiles off-by-one) merging first. Thin-adapter refactor remains the recommended medium-term direction; both short-term fixes patch symptoms in code the refactor would delete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
Adds a "Final cross-N validation" sub-section under flashinfer-ai#9 capturing the 3-point baseline-vs-both-fixes sweep at N ∈ {4096, 8192, 16384} × EP ∈ {8, 16} × 3 runs at --num-iters 100. All 6 (EP, N) cells with both fixes applied are at fi-parity or fi-faster; no regression vs baseline. The N=16384 closure is reproduced (EP=8: +8.3% → -0.5%; EP=16: +5.2% → -2.8%). Notable observation: EP=16 N=4096 fi-advantage shrinks from -3.5% to -0.6% with the fixes. fi still wins, just by less. The fixes trade a small accidental advantage at small N (where tile=128 was locked-in correct due to the prealloc bias) for a substantial closure at N=16384. This is the closing empirical validation of the paired-fix story. The audit's perf-investigation thread is now empirically settled; remaining audit-cycle work is the cleanup follow-ups (PR opening, test additions, refactor scoping) tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…uns at ±3% noise floor) Promotes the +13.7% kernel-level closure from "closed by inference" to "closed by direct measurement". Adds a sub-section under flashinfer-ai#9 capturing the 3-run per-kernel measurement at EP=16 N=16384 --num-iters 100 with both fixes applied: Run 1 (trt picked tile=128, slower kernel): gemm1 Δ% -17.0%, gemm2 Δ% -18.0%, kernel-sum Δ% -17.4%, full-pass -7.8% Run 2 (trt picked tile=256, faster kernel): gemm1 Δ% +3.4%, gemm2 Δ% +2.5%, kernel-sum Δ% +3.0%, full-pass +0.4% Run 3 (trt picked tile=256, faster kernel): gemm1 Δ% -1.4%, gemm2 Δ% -2.4%, kernel-sum Δ% -1.8%, full-pass +0.4% In matched-tactic runs (Runs 2-3, both at tile=256), kernel-level Δ% lands within the ±3% autotune-noise floor -- fully closing the audit's +13.7% reading. In the mismatched run (Run 1), fi is 17% faster at kernels because fi-with-fixes consistently uses tile=256 while trt's coin-flip can land on tile=128. The prior +13.7% audit measurement (fi-128 vs trt-256) is no longer reproducible because fi never picks tile=128 at runtime under both-fixes. This run also directly verifies the trt-coin-flip noise model that was previously inferred from the 0.08% profile margin: trt's per-run gemm1 time splits cleanly between tile=128 (~0.1234 ms) and tile=256 (~0.1004-0.1029 ms), tracking trt's tactic choice. fi is stable across all 3 runs; trt is the noise source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 29, 2026
…dual as version-skew Adds the closing fresh-container reproduction sub-section under flashinfer-ai#9 covering EP ∈ {1, 8, 16} × N ∈ {4096, 8192, 16384} × 3 runs at --num-iters 100 with both fixes applied. This is the comprehensive empirical validation of the audit's perf-investigation closure: | (EP, N) | Mean Δ% | Spread | Verdict | EP=1 N=16384 | +3.9% | 0.3pp | fi slower (version-skew) | EP=8 N=4096 | -7.0% | 3.7pp | fi faster | EP=8 N=8192 | +1.4% | 2.3pp | parity | EP=8 N=16384 | -4.4% | 8.7pp | fi faster (trt coin-flip variance) | EP=16 N=4096 | -5.4% | 1.0pp | fi faster | EP=16 N=8192 | +3.0% | 2.4pp | parity | EP=16 N=16384 | -9.7% | 1.8pp | fi much faster (all 3 trt-tile=128) Key findings: (1) EP>1 perf gap is closed at all measured shapes -- mean Δ% is fi-at-parity-or-faster across the EP × N grid. (2) EP=1 +3.9% residual is real (tight 0.3pp spread) but explained by the audit's documented version-skew caveat (NGC trt 1.3.0rc5.post2 lacks 3 forward-ported commits). Auto-resolves on next NGC bump. NOT closed by the paired fixes -- mechanism is separate (kernel-level fi is -2.4% faster at EP=1; the +3.9% full-pass gap lives in non-kernel GPU-side work, not autotune). (3) Per-kernel detail at EP=16 N=16384 (this session): all 3 runs caught trt-tile=128 by coin-flip; gemm1+gemm2 sum Δ% = -13 to -15% (fi much faster). The audit's original +13.7% kernel-level is fully closed -- the *opposite* mismatch is now possible because fi consistently picks tile=256. (4) Cross-session variance documented: EP=8 N=16384 came in -4.4% here vs -0.8% in an earlier same-day session, both with both fixes. The variance is from trt's coin-flip ratio across sessions (different SLURM allocation / GPU instance). Sign is consistent (fi-at-parity-or-faster); magnitude varies. For audit-closure purposes the sign + tight bound is sufficient. This commit closes the empirical perf-investigation thread of the audit. Remaining audit-cycle work is upstream-shipping (PRs not yet opened, tests not yet added) tracked separately -- not part of this empirical thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 29, 2026
Adds the EP=8 per-kernel breakdown (3 runs at EP=8 N=16384 --num-iters 100 with both fixes) showing the same trt-coin-flip pattern as EP=16: Run 1 (trt tile=256): kernel-sum Δ% = -0.4%, full-pass +1.2% Run 2 (trt tile=256): kernel-sum Δ% = -0.4%, full-pass -1.0% Run 3 (trt tile=128): kernel-sum Δ% = -12.1%, full-pass -1.0% trt's gemm1 splits cleanly between ~0.206 ms (tile=256) and 0.238 ms (tile=128), a ~15% jump tracking trt's per-tactic perf -- mirroring the EP=16 ~20% jump previously documented. Confirms trt's coin-flip is the noise source at EP=8 too, not fi. Matched-tactic case (Runs 1+2): kernel-sum Δ% within ±0.4% noise floor. Mismatched case (Run 3): fi 12% faster at kernels. This upgrades the EP=8 closure from "inferred from EP=16 evidence" to "direct measurement", symmetric to the EP=16 closure committed earlier today. Also documents an interesting wrapper-overhead observation: in the mismatched run fi is 12% faster at kernels but only -1.0% at full-pass. fi has ~44 µs more non-kernel time than trt at EP=8 (CuteDslMoEWrapper Python layer overhead vs trt's C++ thop). Same mechanism as the EP=1 +3.9% residual; not a port-correctness issue, but is the architectural difference the recommended thin-adapter refactor addresses. This commit is the final piece of empirical evidence for the audit's perf-investigation closure. Both flashinfer-ai#9 sub-questions (EP=8 +8.6% and EP=16 +13.7%) are now closed by direct per-kernel measurement, not inference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 29, 2026
…, mark superseded sub-sections Lee asked for a thorough audit-of-the-audit pass: look for inconsistencies, discrepancies, redundancies, and opportunities for compaction/simplification. After reading the full 3952-line doc, the following systematic edits were applied: Top-of-doc: - Added a "Final state (2026-04-29) — read this first" section at the very top (immediately after the title, before the chronological correction log). Captures: static audit summary, both upstream PRs (flashinfer-ai#3067, flashinfer-ai#3198), the two paired wrapper bugs and their side branches, the comprehensive 30-cell validation matrix, the three-regime classification (large-N closure / medium-N wrapper-overhead-residual / small-N noise), the EP=1 + version-skew explanation, and the closure of the empirical thread. This section is now authoritative; the rest of the doc is the chronological investigation record leading to it. - Updated the title's "last updated" tag to 2026-04-29. Stale references fixed: - The flashinfer-ai#3067 fix branch was referred to via the stale commit hashes d291d17e and e2330e73/f0cf8cd0 in three places; updated to the current 57de5f8 + 38293a0 (with notes that the original commit-hash references are preserved for traceability). - The "Tunable runner" row in the Summary table marked tile_size=256 as "gated off"; updated to "fixed via PR flashinfer-ai#3067 (pending review)". - Item 3 of the corrections list: "Recommended fix (drafted, not yet applied)" → "Fix landed and shipped as upstream PR flashinfer-ai#3067". Duplicate content removed: - Two "What still stands (unaffected by the autotune-context bug)" blocks were nearly identical; the first was removed and replaced with a one-line pointer to the consolidated version below. Executive verdict updated: - Rewritten to reflect current state: now references both PRs, the paired-fix story, the comprehensive 30-cell sweep matrix, and the +13.7% kernel-level closure. Previously it described "Leading hypothesis (NOT YET EMPIRICALLY VERIFIED)" framings that have since been confirmed and shipped. Superseded sub-sections marked: - "EP=8 vs EP=16: the fix has different impact at each" — was an early "necessary but not sufficient" framing under prealloc-only state; replaced with a SUPERSEDED note pointing to the paired-fix story below (the bucket-cap-fix discovery is what closed both EPs). - "EP=16 confirmation (2026-04-28)" header clarified to indicate it's a single-iter prealloc-only snapshot; the H_A/H_B hypothesis-discrimination logic and trt 0.08% margin observation are still load-bearing and preserved. - "Final cross-N validation (2026-04-28)" renamed to "Initial 6-cell cross-N spot-check" with a pointer to the comprehensive 30-cell sweep that supersedes it. - Status sub-section: removed outdated EP=8 partial-closure / EP=16 full-closure bullets (those were prealloc-only state); replaced with concise paired-fix branch summary and pointer to comprehensive sweep for per-cell results. - Implications for the audit's earlier claims: updated the "dominant factor is the prealloc bias" bullet to the paired-fix framing (prealloc + bucket-cap, both required). Follow-ups updated: - Open follow-ups remaining: appended a "Update 2026-04-29 (final): flashinfer-ai#9 also closed" note documenting that the EP=8/EP=16 perf gap was root-caused to two paired wrapper bugs, with the empirical investigation thread now closed. - Item flashinfer-ai#9 in the chronological "Follow-ups queued for later sessions" list: added a RESOLVED 2026-04-29 header with pointer to the paired-fix story, marking the 9a / 9b / 9c sub-mechanism candidates as pre-resolution candidate hypotheses (preserved as record of the search space, not blocking work). What was preserved as historical record (not removed): - Full "2026-04-24/25/28 correction" chronological log - Per-component static audit detail - Issue flashinfer-ai#3067 narrative including the original "Leading hypothesis" framing and the post-fix verification table - The 22-day candidate-root-cause investigation that ruled out 5 mechanisms before landing on the paired wrapper bugs - Part 2 baseline numbers (with retraction annotations preserved as-is) - Probes, MoE kernel coverage, reproduction runbook Net effect: doc is more navigable, internally consistent, and the authoritative summary is at the top. Body content preserves the investigative reasoning. Diff: +241 / -217 lines, net +24 lines (intentional — the new Final state section adds explicit authoritative content). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.