Rename target `tvm_binding` to `flashinfer_tvm` by junrushao · Pull Request #8 · flashinfer-ai/flashinfer

junrushao · 2023-10-20T04:39:04Z

No description provided.

This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.

yzh119

LGTM, thanks!

This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

Co-authored-by: wuguanyu02 <wuguanyu02@meituan.com>

… test artifact + separate perf gap The decisive experiment on 2026-04-24: force flashinfer's autotuner to ONLY consider tile_size=256 tactics (tuner.py patched to `for tile_size in [256]`), run the port-parity bench at N=16384. Result: `parity: max_abs=0.0625 -> PASS`. max_abs=0.0625 is literally one FP4 step (1/16) — the finest difference FP4 can represent. Head-to-head comparison (no pure-PyTorch FP4 reference in the loop, just torch.allclose(fi_out, trt_out, atol=0.1, rtol=0.01)) confirms flashinfer's tile_size=256 / 2CTA kernel produces output that matches TRT-LLM's tile_size=256 output within FP4 noise. Therefore: the tile_size=256 "correctness bug" hunted over 22 days is not real. test_all_tactics_accuracy's 8/16 failures at tile_size=256 (78.40% / 94.78% within tolerance against compute_reference_moe_fp4) are a reference-simulator strictness artifact, not a kernel bug. The stable failure rate across shapes and all 8 failing tactic variants is the signature of a deterministic numerical offset from the reference, not corruption — exactly what we would expect if the reference models the FP4 pipeline differently than the actual kernel does at 2CTA. This vindicates the port author's 2026-04-14 theory ("flashinfer's test may be stricter than what TRT-LLM exercises"), which the audit initially claimed to have disproven via Reproduction B. That "disproof" was retracted on 2026-04-24 — Reproduction B used compute_reference_moe_bf16, while test_all_tactics_accuracy uses compute_reference_moe_fp4; the two references compute different targets. SEPARATELY, at tile_size=256 flashinfer is still +46% slower on gemm1_swiglu than TRT-LLM at the same tactic (2.644 ms vs 1.806 ms at N=16384). Enabling tile_size=256 does NOT resolve the large-batch regression — it just changes the tactic being run. The regression's cause is something other than tactic gating — working hypothesis: compile-time parameter, launch-grid configuration, or JIT template instantiation differences between the two wrappers. Tracked as new follow-up flashinfer-ai#8 (added in a separate commit). Concrete recommendations for flashinfer (now that correctness is established): 1. Un-gate tile_size=256 in tuner.py 2. Fix test_all_tactics_accuracy's reference or threshold at 2CTA 3. Investigate the +46% gemm1_swiglu gap at tile_size=256 (follow-up flashinfer-ai#8) Executive verdict rewritten to reflect this pivot. A prominent Resolution block added at the top of the "Known divergences" section (issue flashinfer-ai#3067) — the detailed investigation below it is preserved as historical record but reframed as closed.

…lu gap at tile_size=256 Now that correctness at tile_size=256 is established (previous commit reclassifies flashinfer-ai#3067 as test artifact), the real open work item is the perf gap that the audit had mis-attributed to the correctness bug. 2026-04-24 experiment, forced tile_size=256 on flashinfer via tuner.py patch (for tile_size in [256]): fi gemm1_swiglu at N=16384, tile=256: 2.644 ms trt gemm1_swiglu at N=16384, tile=256: 1.806 ms +46.4 percent at the SAME tactic At tile=128, flashinfer's gemm1 was 2.711 ms — so enabling tile=256 barely helps flashinfer, while TRT-LLM at the same tile=256 tactic gets much more throughput. The large-batch +27 percent top-line regression is NOT resolved by un-gating tile_size=256. Both sides compile the identical CuteDSL kernel source (deep audit established semantic identity of kernel bodies). So the SASS or runtime differs despite identical Python source. Working hypotheses: - 8a. Compile-time parameter / constexpr drift between wrappers' invocations causes different SASS - 8b. Launch-grid / cluster config mis-set on flashinfer (2CTA effectively running as 1CTA) - 8c. Input buffer alignment / stride differences the kernel optimizer exploits - 8d. Stream / cooperative-group sync context difference Suggested first probe: nsys trace at N=16384 for both sides at tile_size=256, compare launch config + SASS identifier + span alignment. Should quickly narrow 8a vs 8b vs 8d. Note on prior follow-up flashinfer-ai#1: this supersedes its framing. "MbarrierArray shim / tile_gating causes the regression" is now invalidated by the forced-tile=256 experiment showing un-gating does not recover the regression. The rest of flashinfer-ai#3067-era candidates (flashinfer-ai#3 fence_proxy, flashinfer-ai#5 orchestration, flashinfer-ai#6 top-level wrappers) were framed around a correctness bug that doesn't exist and are now subordinate. Follow-up flashinfer-ai#8 replaces them as the primary perf work item.

…ssification — bench autotune-context bug Discovered 2026-04-24: benchmarks/bench_cute_dsl_port_parity.py imports AutoTuner and autotune from tensorrt_llm._torch.autotuner only. The flashinfer side's CuteDslMoEWrapper.run() consults flashinfer.autotuner.AutoTuner — a different singleton — which never enters tuning mode in the bench. Result: flashinfer's tuner.choose_one returned tactic=-1 on cache miss → forward_impl falls through to DEFAULT_MOE_TACTIC = (128, ((128,128),(1,1),False), ((128,128),(1,1),False)) at tuner.py:454-455. Every "+46% perf gap" / "+27% large-batch regression" measurement was therefore comparing flashinfer-untuned- default vs TRT-LLM-tuned-best. Two findings flipped: 1. Follow-up flashinfer-ai#8 RETRACTED. With the bench fixed, gemm1_swiglu PTX is byte-identical between sides (md5 401ebca6, zero diff lines after symbol normalization) and per-call time matches within 0.1% (1.8305 vs 1.8316 ms at N=16384). The kernel port is bit-identical to TRT-LLM's at the compiled-code level. 2. Issue flashinfer-ai#3067 reclassification "test simulator artifact" RETRACTED. The "head-to-head parity passes at tile_size=256" experiment that underpinned the reclassification was made under the same buggy bench. With the bench fixed, parity FAILS at tile_size=256 (max_abs=13.69). The original tuner.py "produces incorrect results" comment was correct. Leading hypothesis for the parity failure (NOT YET VERIFIED): flashinfer ALL_MOE_TACTICS at tile_size=256 enumerates only 1-CTA gemm2 tactics (cluster=(1, X)). TRT-LLM picks 2-CTA gemm2 tactics (cluster=(2, X)). 2-CTA gemm1 output feeding 1-CTA gemm2 plausibly produces wrong output. Two probes pending in next session. What stands unaffected: Part-1 static audit (HIGH/MEDIUM/LOW commit verdicts), kernel source byte-identical-with-rc5 verification, and the five candidate mechanisms ruled out for flashinfer-ai#3067 (each remains falsified — divergence must lie elsewhere). v3 baseline numbers preserved as-measured for reproducibility but inline interpretation-correction notes mark the wrong "regression" framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lashinfer-ai#8 class of bug The previous warmup block in `run_one_size` imported `AutoTuner` and `autotune` only from `tensorrt_llm._torch.autotuner` and entered just that one context. flashinfer's `CuteDslMoEWrapper.run()` consults `flashinfer.autotuner.AutoTuner` — a separate singleton — which never entered tuning mode, causing flashinfer's `tuner.choose_one()` to return tactic=-1 on cache miss and `forward_impl` to fall through to `DEFAULT_MOE_TACTIC` (1-CTA / 128x128 MMA) for every measurement. Discovered 2026-04-24; produced two weeks of misleading perf data (retracted in commit a0c0b6c). Changes: - `import_flashinfer()` now returns `(AutoTuner, autotune)` from `flashinfer.autotuner` alongside the kernel symbols, symmetric to `import_trtllm()`. Updated all 4 call sites to unpack 7 values. - `_assert_distinct_autotuners(fi_AutoTuner, trt_AutoTuner)`: startup check that hard-exits if the two AutoTuner singletons are the same object. Catches a future framework refactor that unifies them — the dual-context warmup below would silently become redundant or, worse, regress to single-context if reverted later. Idempotent across the per-size loop. - `_audit_selected_tactics(fi_AT, trt_AT, num_tokens=...)`: post- warmup audit that hard-exits if either autotuner's profiling_cache is empty (= the corresponding `autotune()` context did not engage). Prints the selected tactic on each side. Soft-warns when flashinfer picks `DEFAULT_MOE_TACTIC` (legitimate at small N where the validity filter rejects every entry, but worth surfacing). - The warmup block now clears both caches and stacks both `autotune()` contexts in a single `with` statement. - A "Autotuner-context safety net" comment block documents the parallel-singleton design and the two failure modes the helpers catch (single-context warmup; future singleton unification). The audit doc already covers the empirical verification — at N=16384 with this fix in place, gemm1_swiglu PTX is byte-identical between sides (md5 match after symbol normalization, zero diff lines) and per-call time matches within 0.1%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ground-truth verification A single nsys trace at N=8192 with `--nsys-capture-range` (commit 40bb77e) bracketing only the timed measurement passes resolved both remaining measurement-related follow-ups. flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation): direct wall-clock comparison at N=16384 / 30 iters shows identical wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the 3 s delta is autotune-compile + Python-startup variance, well below the ~240 ms of actual GPU measurement work in 70+ s of total wall-clock). The historical 2× signature was always a `cupti-python` span-attribution artifact, never real GPU work — and it does not reproduce under current methodology. A smaller asymmetric bias (~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which is the rationale for keeping `--use-cupti` opt-in (default off). flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys ground truth at N=8192 = 0.737 ms; current in-bench reports 0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms (7.1% below ground truth, harness-to-harness rounding tolerance). The original 19% gap was specific to the older `--use-cupti` config against the standalone — under current methodology there is no systematic bias. Audit changes: - New "Ground-truth nsys verification (2026-04-28)" section immediately after the post-fix verification, documenting the run command, per-kernel ground truth, the resolution of both follow-ups with quantitative tables, and a note that the trace also serves as a third independent kernel-port faithfulness check (kernel mangled-name structure matches modulo encoded module path). - Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray` framing was wrong; actual cause was the gemm2-enumeration gap fixed at d291d17e/f0cf8cd0 on the standalone PR branch). - Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes. - Top-of-file correction section title updated to "2026-04-24/25/28" and short summary expanded to mention the verification round. The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are all scope-expansions, not investigations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d-and-skipped flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths run with degenerate values, so a divergent FP8_MAX or scale-convention mismatch between the two sides would silently produce identical output here and divergent output at non-trivial scales). Skipped because: - Both sides have their own internal scale-plumbing tests with non-trivial scales. TRT-LLM's tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise non-trivial-gs configurations against PyTorch references. The bench wouldn't add coverage their CI doesn't already have. - A failure here wouldn't tell us what we want to know. The most likely cause of fi-vs-trt parity divergence at production scales would be a scale-convention disagreement BETWEEN the two implementations — not fixable in flashinfer (TRT-LLM is upstream), and dramatically out of scope. - Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing has many surfaces (alpha, weight_scale_2, input_scale, is_sf_* flags, fp4_quantize signatures); any tiny bench-side mismatch produces a parity failure that looks like a port bug. That's the same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit. Scope-difference note: the audit's load-bearing question is "is the kernel port faithful?" — closed via byte-identical PTX + matching timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM agree on NVFP4 scaling conventions?" — a separate question, real but tangential to port-faithfulness. One scenario that would re-elevate this: a planned production ship of CuteDslMoEWrapper to a caller using non-trivial scales, where the team wants one independent cross-check (against CuteDslFusedMoE under the same scales) before merging. In that scenario flashinfer-ai#3 is exactly the right pre-ship sanity. For closing out the investigation audit, it's tangential. Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).

…flashinfer-ai#9 remains Three close-out edits to wrap the CuteDSL MoE FP4 port audit: - Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped, matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons: (a) port-parity claims are unaffected by DSL-compiler version since both sides use the same in-container compiler, (b) flashinfer and TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus unsupported-config risk produces the same ambiguous-failure pattern that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the image. Install recipe preserved for future absolute-latency probes. - Add a "Version-skew caveat (2026-04-28)" subsection to the top-of-file correction. The bench compares flashinfer-with-post-rc5- forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2- without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may partly reflect the version asymmetry. Load-bearing claims (port faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067 fix) are unaffected because they do not depend on absolute deltas. Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing the post-rc5 commits. - Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16 tactic-divergence root cause) as the only effective open follow-up. Audit declared closed 2026-04-28. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rename target tvm_binding to flashinfer_tvm

41f4330

yzh119 approved these changes Oct 20, 2023

View reviewed changes

yzh119 merged commit 2711216 into main Oct 20, 2023

yzh119 deleted the junrushao-patch-1 branch October 20, 2023 14:55

yyihuang mentioned this pull request Sep 19, 2025

[bug] bmm_fp8 test error #1738

Closed

wangbo981016 pushed a commit to meituan-longcat/flashinfer that referenced this pull request Feb 5, 2026

merge_state support any num_head (flashinfer-ai#8)

1c36f60

Co-authored-by: wuguanyu02 <wuguanyu02@meituan.com>

kahyunnam mentioned this pull request May 2, 2026

DGX Spark (SM121) Current Support Audit #3170

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename target `tvm_binding` to `flashinfer_tvm`#8

Rename target `tvm_binding` to `flashinfer_tvm`#8
yzh119 merged 1 commit intomainfrom
junrushao-patch-1

junrushao commented Oct 20, 2023 •

edited

Loading

Uh oh!

yzh119 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

junrushao commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

junrushao commented Oct 20, 2023 •

edited

Loading