Conversation
junrushao
added a commit
to junrushao/mlc-llm
that referenced
this pull request
Oct 20, 2023
This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.
junrushao
added a commit
to junrushao/mlc-llm
that referenced
this pull request
Oct 20, 2023
This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.
junrushao
added a commit
to junrushao/mlc-llm
that referenced
this pull request
Oct 20, 2023
This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.
MasterJH5574
pushed a commit
to MasterJH5574/mlc-llm
that referenced
this pull request
Oct 31, 2023
This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.
MasterJH5574
pushed a commit
to MasterJH5574/mlc-llm
that referenced
this pull request
Oct 31, 2023
This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.
MasterJH5574
pushed a commit
to MasterJH5574/mlc-llm
that referenced
this pull request
Oct 31, 2023
This commit integrates FlashInfer as an optional dependency to libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which we could directly link against in MLC LLM. Depends on flashinfer-ai/flashinfer#8. Also, will need to update git submodule to https-based after FlashInfer becomes public.
diptorupd
referenced
this pull request
in ROCm/flashinfer
Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test #1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test #2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test #8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
wangbo981016
pushed a commit
to meituan-longcat/flashinfer
that referenced
this pull request
Feb 5, 2026
Co-authored-by: wuguanyu02 <wuguanyu02@meituan.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 24, 2026
… test artifact + separate perf gap
The decisive experiment on 2026-04-24: force flashinfer's autotuner
to ONLY consider tile_size=256 tactics (tuner.py patched to
`for tile_size in [256]`), run the port-parity bench at N=16384.
Result: `parity: max_abs=0.0625 -> PASS`. max_abs=0.0625 is literally
one FP4 step (1/16) — the finest difference FP4 can represent.
Head-to-head comparison (no pure-PyTorch FP4 reference in the loop,
just torch.allclose(fi_out, trt_out, atol=0.1, rtol=0.01)) confirms
flashinfer's tile_size=256 / 2CTA kernel produces output that matches
TRT-LLM's tile_size=256 output within FP4 noise.
Therefore: the tile_size=256 "correctness bug" hunted over 22 days is
not real. test_all_tactics_accuracy's 8/16 failures at tile_size=256
(78.40% / 94.78% within tolerance against compute_reference_moe_fp4)
are a reference-simulator strictness artifact, not a kernel bug. The
stable failure rate across shapes and all 8 failing tactic variants is
the signature of a deterministic numerical offset from the reference,
not corruption — exactly what we would expect if the reference models
the FP4 pipeline differently than the actual kernel does at 2CTA.
This vindicates the port author's 2026-04-14 theory ("flashinfer's
test may be stricter than what TRT-LLM exercises"), which the audit
initially claimed to have disproven via Reproduction B. That
"disproof" was retracted on 2026-04-24 — Reproduction B used
compute_reference_moe_bf16, while test_all_tactics_accuracy uses
compute_reference_moe_fp4; the two references compute different
targets.
SEPARATELY, at tile_size=256 flashinfer is still +46% slower on
gemm1_swiglu than TRT-LLM at the same tactic (2.644 ms vs 1.806 ms
at N=16384). Enabling tile_size=256 does NOT resolve the large-batch
regression — it just changes the tactic being run. The regression's
cause is something other than tactic gating — working hypothesis:
compile-time parameter, launch-grid configuration, or JIT template
instantiation differences between the two wrappers. Tracked as new
follow-up flashinfer-ai#8 (added in a separate commit).
Concrete recommendations for flashinfer (now that correctness is
established):
1. Un-gate tile_size=256 in tuner.py
2. Fix test_all_tactics_accuracy's reference or threshold at 2CTA
3. Investigate the +46% gemm1_swiglu gap at tile_size=256 (follow-up flashinfer-ai#8)
Executive verdict rewritten to reflect this pivot. A prominent
Resolution block added at the top of the "Known divergences" section
(issue flashinfer-ai#3067) — the detailed investigation below it is preserved as
historical record but reframed as closed.
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 24, 2026
…lu gap at tile_size=256 Now that correctness at tile_size=256 is established (previous commit reclassifies flashinfer-ai#3067 as test artifact), the real open work item is the perf gap that the audit had mis-attributed to the correctness bug. 2026-04-24 experiment, forced tile_size=256 on flashinfer via tuner.py patch (for tile_size in [256]): fi gemm1_swiglu at N=16384, tile=256: 2.644 ms trt gemm1_swiglu at N=16384, tile=256: 1.806 ms +46.4 percent at the SAME tactic At tile=128, flashinfer's gemm1 was 2.711 ms — so enabling tile=256 barely helps flashinfer, while TRT-LLM at the same tile=256 tactic gets much more throughput. The large-batch +27 percent top-line regression is NOT resolved by un-gating tile_size=256. Both sides compile the identical CuteDSL kernel source (deep audit established semantic identity of kernel bodies). So the SASS or runtime differs despite identical Python source. Working hypotheses: - 8a. Compile-time parameter / constexpr drift between wrappers' invocations causes different SASS - 8b. Launch-grid / cluster config mis-set on flashinfer (2CTA effectively running as 1CTA) - 8c. Input buffer alignment / stride differences the kernel optimizer exploits - 8d. Stream / cooperative-group sync context difference Suggested first probe: nsys trace at N=16384 for both sides at tile_size=256, compare launch config + SASS identifier + span alignment. Should quickly narrow 8a vs 8b vs 8d. Note on prior follow-up flashinfer-ai#1: this supersedes its framing. "MbarrierArray shim / tile_gating causes the regression" is now invalidated by the forced-tile=256 experiment showing un-gating does not recover the regression. The rest of flashinfer-ai#3067-era candidates (flashinfer-ai#3 fence_proxy, flashinfer-ai#5 orchestration, flashinfer-ai#6 top-level wrappers) were framed around a correctness bug that doesn't exist and are now subordinate. Follow-up flashinfer-ai#8 replaces them as the primary perf work item.
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 25, 2026
…ssification — bench autotune-context bug Discovered 2026-04-24: benchmarks/bench_cute_dsl_port_parity.py imports AutoTuner and autotune from tensorrt_llm._torch.autotuner only. The flashinfer side's CuteDslMoEWrapper.run() consults flashinfer.autotuner.AutoTuner — a different singleton — which never enters tuning mode in the bench. Result: flashinfer's tuner.choose_one returned tactic=-1 on cache miss → forward_impl falls through to DEFAULT_MOE_TACTIC = (128, ((128,128),(1,1),False), ((128,128),(1,1),False)) at tuner.py:454-455. Every "+46% perf gap" / "+27% large-batch regression" measurement was therefore comparing flashinfer-untuned- default vs TRT-LLM-tuned-best. Two findings flipped: 1. Follow-up flashinfer-ai#8 RETRACTED. With the bench fixed, gemm1_swiglu PTX is byte-identical between sides (md5 401ebca6, zero diff lines after symbol normalization) and per-call time matches within 0.1% (1.8305 vs 1.8316 ms at N=16384). The kernel port is bit-identical to TRT-LLM's at the compiled-code level. 2. Issue flashinfer-ai#3067 reclassification "test simulator artifact" RETRACTED. The "head-to-head parity passes at tile_size=256" experiment that underpinned the reclassification was made under the same buggy bench. With the bench fixed, parity FAILS at tile_size=256 (max_abs=13.69). The original tuner.py "produces incorrect results" comment was correct. Leading hypothesis for the parity failure (NOT YET VERIFIED): flashinfer ALL_MOE_TACTICS at tile_size=256 enumerates only 1-CTA gemm2 tactics (cluster=(1, X)). TRT-LLM picks 2-CTA gemm2 tactics (cluster=(2, X)). 2-CTA gemm1 output feeding 1-CTA gemm2 plausibly produces wrong output. Two probes pending in next session. What stands unaffected: Part-1 static audit (HIGH/MEDIUM/LOW commit verdicts), kernel source byte-identical-with-rc5 verification, and the five candidate mechanisms ruled out for flashinfer-ai#3067 (each remains falsified — divergence must lie elsewhere). v3 baseline numbers preserved as-measured for reproducibility but inline interpretation-correction notes mark the wrong "regression" framing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 25, 2026
…lashinfer-ai#8 class of bug The previous warmup block in `run_one_size` imported `AutoTuner` and `autotune` only from `tensorrt_llm._torch.autotuner` and entered just that one context. flashinfer's `CuteDslMoEWrapper.run()` consults `flashinfer.autotuner.AutoTuner` — a separate singleton — which never entered tuning mode, causing flashinfer's `tuner.choose_one()` to return tactic=-1 on cache miss and `forward_impl` to fall through to `DEFAULT_MOE_TACTIC` (1-CTA / 128x128 MMA) for every measurement. Discovered 2026-04-24; produced two weeks of misleading perf data (retracted in commit a0c0b6c). Changes: - `import_flashinfer()` now returns `(AutoTuner, autotune)` from `flashinfer.autotuner` alongside the kernel symbols, symmetric to `import_trtllm()`. Updated all 4 call sites to unpack 7 values. - `_assert_distinct_autotuners(fi_AutoTuner, trt_AutoTuner)`: startup check that hard-exits if the two AutoTuner singletons are the same object. Catches a future framework refactor that unifies them — the dual-context warmup below would silently become redundant or, worse, regress to single-context if reverted later. Idempotent across the per-size loop. - `_audit_selected_tactics(fi_AT, trt_AT, num_tokens=...)`: post- warmup audit that hard-exits if either autotuner's profiling_cache is empty (= the corresponding `autotune()` context did not engage). Prints the selected tactic on each side. Soft-warns when flashinfer picks `DEFAULT_MOE_TACTIC` (legitimate at small N where the validity filter rejects every entry, but worth surfacing). - The warmup block now clears both caches and stacks both `autotune()` contexts in a single `with` statement. - A "Autotuner-context safety net" comment block documents the parallel-singleton design and the two failure modes the helpers catch (single-context warmup; future singleton unification). The audit doc already covers the empirical verification — at N=16384 with this fix in place, gemm1_swiglu PTX is byte-identical between sides (md5 match after symbol normalization, zero diff lines) and per-call time matches within 0.1%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…ground-truth verification A single nsys trace at N=8192 with `--nsys-capture-range` (commit 40bb77e) bracketing only the timed measurement passes resolved both remaining measurement-related follow-ups. flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation): direct wall-clock comparison at N=16384 / 30 iters shows identical wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the 3 s delta is autotune-compile + Python-startup variance, well below the ~240 ms of actual GPU measurement work in 70+ s of total wall-clock). The historical 2× signature was always a `cupti-python` span-attribution artifact, never real GPU work — and it does not reproduce under current methodology. A smaller asymmetric bias (~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which is the rationale for keeping `--use-cupti` opt-in (default off). flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys ground truth at N=8192 = 0.737 ms; current in-bench reports 0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms (7.1% below ground truth, harness-to-harness rounding tolerance). The original 19% gap was specific to the older `--use-cupti` config against the standalone — under current methodology there is no systematic bias. Audit changes: - New "Ground-truth nsys verification (2026-04-28)" section immediately after the post-fix verification, documenting the run command, per-kernel ground truth, the resolution of both follow-ups with quantitative tables, and a note that the trace also serves as a third independent kernel-port faithfulness check (kernel mangled-name structure matches modulo encoded module path). - Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray` framing was wrong; actual cause was the gemm2-enumeration gap fixed at d291d17e/f0cf8cd0 on the standalone PR branch). - Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes. - Top-of-file correction section title updated to "2026-04-24/25/28" and short summary expanded to mention the verification round. The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are all scope-expansions, not investigations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…d-and-skipped flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths run with degenerate values, so a divergent FP8_MAX or scale-convention mismatch between the two sides would silently produce identical output here and divergent output at non-trivial scales). Skipped because: - Both sides have their own internal scale-plumbing tests with non-trivial scales. TRT-LLM's tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise non-trivial-gs configurations against PyTorch references. The bench wouldn't add coverage their CI doesn't already have. - A failure here wouldn't tell us what we want to know. The most likely cause of fi-vs-trt parity divergence at production scales would be a scale-convention disagreement BETWEEN the two implementations — not fixable in flashinfer (TRT-LLM is upstream), and dramatically out of scope. - Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing has many surfaces (alpha, weight_scale_2, input_scale, is_sf_* flags, fp4_quantize signatures); any tiny bench-side mismatch produces a parity failure that looks like a port bug. That's the same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit. Scope-difference note: the audit's load-bearing question is "is the kernel port faithful?" — closed via byte-identical PTX + matching timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM agree on NVFP4 scaling conventions?" — a separate question, real but tangential to port-faithfulness. One scenario that would re-elevate this: a planned production ship of CuteDslMoEWrapper to a caller using non-trivial scales, where the team wants one independent cross-check (against CuteDslFusedMoE under the same scales) before merging. In that scenario flashinfer-ai#3 is exactly the right pre-ship sanity. For closing out the investigation audit, it's tangential. Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…flashinfer-ai#9 remains Three close-out edits to wrap the CuteDSL MoE FP4 port audit: - Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped, matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons: (a) port-parity claims are unaffected by DSL-compiler version since both sides use the same in-container compiler, (b) flashinfer and TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus unsupported-config risk produces the same ambiguous-failure pattern that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the image. Install recipe preserved for future absolute-latency probes. - Add a "Version-skew caveat (2026-04-28)" subsection to the top-of-file correction. The bench compares flashinfer-with-post-rc5- forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2- without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may partly reflect the version asymmetry. Load-bearing claims (port faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067 fix) are unaffected because they do not depend on absolute deltas. Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing the post-rc5 commits. - Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16 tactic-divergence root cause) as the only effective open follow-up. Audit declared closed 2026-04-28. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.