Skip to content

Rename target tvm_binding to flashinfer_tvm#8

Merged
yzh119 merged 1 commit intomainfrom
junrushao-patch-1
Oct 20, 2023
Merged

Rename target tvm_binding to flashinfer_tvm#8
yzh119 merged 1 commit intomainfrom
junrushao-patch-1

Conversation

@junrushao
Copy link
Copy Markdown
Collaborator

@junrushao junrushao commented Oct 20, 2023

No description provided.

junrushao added a commit to junrushao/mlc-llm that referenced this pull request Oct 20, 2023
This commit integrates FlashInfer as an optional dependency to
libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which
we could directly link against in MLC LLM.

Depends on flashinfer-ai/flashinfer#8. Also, will need
to update git submodule to https-based after FlashInfer becomes public.
junrushao added a commit to junrushao/mlc-llm that referenced this pull request Oct 20, 2023
This commit integrates FlashInfer as an optional dependency to
libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which
we could directly link against in MLC LLM.

Depends on flashinfer-ai/flashinfer#8. Also, will need
to update git submodule to https-based after FlashInfer becomes public.
Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@yzh119 yzh119 merged commit 2711216 into main Oct 20, 2023
@yzh119 yzh119 deleted the junrushao-patch-1 branch October 20, 2023 14:55
junrushao added a commit to junrushao/mlc-llm that referenced this pull request Oct 20, 2023
This commit integrates FlashInfer as an optional dependency to
libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which
we could directly link against in MLC LLM.

Depends on flashinfer-ai/flashinfer#8. Also, will need
to update git submodule to https-based after FlashInfer becomes public.
MasterJH5574 pushed a commit to MasterJH5574/mlc-llm that referenced this pull request Oct 31, 2023
This commit integrates FlashInfer as an optional dependency to
libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which
we could directly link against in MLC LLM.

Depends on flashinfer-ai/flashinfer#8. Also, will need
to update git submodule to https-based after FlashInfer becomes public.
MasterJH5574 pushed a commit to MasterJH5574/mlc-llm that referenced this pull request Oct 31, 2023
This commit integrates FlashInfer as an optional dependency to
libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which
we could directly link against in MLC LLM.

Depends on flashinfer-ai/flashinfer#8. Also, will need
to update git submodule to https-based after FlashInfer becomes public.
MasterJH5574 pushed a commit to MasterJH5574/mlc-llm that referenced this pull request Oct 31, 2023
This commit integrates FlashInfer as an optional dependency to
libmlc_llm. FlashInfer exposes a TVM-native target flashinfer_tvm, which
we could directly link against in MLC LLM.

Depends on flashinfer-ai/flashinfer#8. Also, will need
to update git submodule to https-based after FlashInfer becomes public.
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
wangbo981016 pushed a commit to meituan-longcat/flashinfer that referenced this pull request Feb 5, 2026
Co-authored-by: wuguanyu02 <wuguanyu02@meituan.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
… test artifact + separate perf gap

The decisive experiment on 2026-04-24: force flashinfer's autotuner
to ONLY consider tile_size=256 tactics (tuner.py patched to
`for tile_size in [256]`), run the port-parity bench at N=16384.
Result: `parity: max_abs=0.0625 -> PASS`. max_abs=0.0625 is literally
one FP4 step (1/16) — the finest difference FP4 can represent.
Head-to-head comparison (no pure-PyTorch FP4 reference in the loop,
just torch.allclose(fi_out, trt_out, atol=0.1, rtol=0.01)) confirms
flashinfer's tile_size=256 / 2CTA kernel produces output that matches
TRT-LLM's tile_size=256 output within FP4 noise.

Therefore: the tile_size=256 "correctness bug" hunted over 22 days is
not real. test_all_tactics_accuracy's 8/16 failures at tile_size=256
(78.40% / 94.78% within tolerance against compute_reference_moe_fp4)
are a reference-simulator strictness artifact, not a kernel bug. The
stable failure rate across shapes and all 8 failing tactic variants is
the signature of a deterministic numerical offset from the reference,
not corruption — exactly what we would expect if the reference models
the FP4 pipeline differently than the actual kernel does at 2CTA.

This vindicates the port author's 2026-04-14 theory ("flashinfer's
test may be stricter than what TRT-LLM exercises"), which the audit
initially claimed to have disproven via Reproduction B. That
"disproof" was retracted on 2026-04-24 — Reproduction B used
compute_reference_moe_bf16, while test_all_tactics_accuracy uses
compute_reference_moe_fp4; the two references compute different
targets.

SEPARATELY, at tile_size=256 flashinfer is still +46% slower on
gemm1_swiglu than TRT-LLM at the same tactic (2.644 ms vs 1.806 ms
at N=16384). Enabling tile_size=256 does NOT resolve the large-batch
regression — it just changes the tactic being run. The regression's
cause is something other than tactic gating — working hypothesis:
compile-time parameter, launch-grid configuration, or JIT template
instantiation differences between the two wrappers. Tracked as new
follow-up flashinfer-ai#8 (added in a separate commit).

Concrete recommendations for flashinfer (now that correctness is
established):
1. Un-gate tile_size=256 in tuner.py
2. Fix test_all_tactics_accuracy's reference or threshold at 2CTA
3. Investigate the +46% gemm1_swiglu gap at tile_size=256 (follow-up flashinfer-ai#8)

Executive verdict rewritten to reflect this pivot. A prominent
Resolution block added at the top of the "Known divergences" section
(issue flashinfer-ai#3067) — the detailed investigation below it is preserved as
historical record but reframed as closed.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
…lu gap at tile_size=256

Now that correctness at tile_size=256 is established (previous commit
reclassifies flashinfer-ai#3067 as test artifact), the real open work item is the
perf gap that the audit had mis-attributed to the correctness bug.

2026-04-24 experiment, forced tile_size=256 on flashinfer via
tuner.py patch (for tile_size in [256]):
  fi gemm1_swiglu at N=16384, tile=256: 2.644 ms
  trt gemm1_swiglu at N=16384, tile=256: 1.806 ms
  +46.4 percent at the SAME tactic

At tile=128, flashinfer's gemm1 was 2.711 ms — so enabling tile=256
barely helps flashinfer, while TRT-LLM at the same tile=256 tactic
gets much more throughput. The large-batch +27 percent top-line
regression is NOT resolved by un-gating tile_size=256.

Both sides compile the identical CuteDSL kernel source (deep audit
established semantic identity of kernel bodies). So the SASS or
runtime differs despite identical Python source. Working hypotheses:

 - 8a. Compile-time parameter / constexpr drift between wrappers'
   invocations causes different SASS
 - 8b. Launch-grid / cluster config mis-set on flashinfer (2CTA
   effectively running as 1CTA)
 - 8c. Input buffer alignment / stride differences the kernel
   optimizer exploits
 - 8d. Stream / cooperative-group sync context difference

Suggested first probe: nsys trace at N=16384 for both sides at
tile_size=256, compare launch config + SASS identifier + span
alignment. Should quickly narrow 8a vs 8b vs 8d.

Note on prior follow-up flashinfer-ai#1: this supersedes its framing. "MbarrierArray
shim / tile_gating causes the regression" is now invalidated by the
forced-tile=256 experiment showing un-gating does not recover the
regression. The rest of flashinfer-ai#3067-era candidates (flashinfer-ai#3 fence_proxy, flashinfer-ai#5
orchestration, flashinfer-ai#6 top-level wrappers) were framed around a
correctness bug that doesn't exist and are now subordinate.
Follow-up flashinfer-ai#8 replaces them as the primary perf work item.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 25, 2026
…ssification — bench autotune-context bug

Discovered 2026-04-24: benchmarks/bench_cute_dsl_port_parity.py imports
AutoTuner and autotune from tensorrt_llm._torch.autotuner only. The
flashinfer side's CuteDslMoEWrapper.run() consults
flashinfer.autotuner.AutoTuner — a different singleton — which never
enters tuning mode in the bench. Result: flashinfer's tuner.choose_one
returned tactic=-1 on cache miss → forward_impl falls through to
DEFAULT_MOE_TACTIC = (128, ((128,128),(1,1),False), ((128,128),(1,1),False))
at tuner.py:454-455. Every "+46% perf gap" / "+27% large-batch
regression" measurement was therefore comparing flashinfer-untuned-
default vs TRT-LLM-tuned-best.

Two findings flipped:

1. Follow-up flashinfer-ai#8 RETRACTED. With the bench fixed, gemm1_swiglu PTX is
   byte-identical between sides (md5 401ebca6, zero diff lines after
   symbol normalization) and per-call time matches within 0.1%
   (1.8305 vs 1.8316 ms at N=16384). The kernel port is bit-identical
   to TRT-LLM's at the compiled-code level.

2. Issue flashinfer-ai#3067 reclassification "test simulator artifact" RETRACTED.
   The "head-to-head parity passes at tile_size=256" experiment that
   underpinned the reclassification was made under the same buggy
   bench. With the bench fixed, parity FAILS at tile_size=256
   (max_abs=13.69). The original tuner.py "produces incorrect
   results" comment was correct.

Leading hypothesis for the parity failure (NOT YET VERIFIED):
flashinfer ALL_MOE_TACTICS at tile_size=256 enumerates only 1-CTA
gemm2 tactics (cluster=(1, X)). TRT-LLM picks 2-CTA gemm2 tactics
(cluster=(2, X)). 2-CTA gemm1 output feeding 1-CTA gemm2 plausibly
produces wrong output. Two probes pending in next session.

What stands unaffected: Part-1 static audit (HIGH/MEDIUM/LOW commit
verdicts), kernel source byte-identical-with-rc5 verification, and
the five candidate mechanisms ruled out for flashinfer-ai#3067 (each remains
falsified — divergence must lie elsewhere).

v3 baseline numbers preserved as-measured for reproducibility but
inline interpretation-correction notes mark the wrong "regression"
framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 25, 2026
…lashinfer-ai#8 class of bug

The previous warmup block in `run_one_size` imported `AutoTuner` and
`autotune` only from `tensorrt_llm._torch.autotuner` and entered just
that one context. flashinfer's `CuteDslMoEWrapper.run()` consults
`flashinfer.autotuner.AutoTuner` — a separate singleton — which never
entered tuning mode, causing flashinfer's `tuner.choose_one()` to
return tactic=-1 on cache miss and `forward_impl` to fall through
to `DEFAULT_MOE_TACTIC` (1-CTA / 128x128 MMA) for every measurement.
Discovered 2026-04-24; produced two weeks of misleading perf data
(retracted in commit a0c0b6c).

Changes:

- `import_flashinfer()` now returns `(AutoTuner, autotune)` from
  `flashinfer.autotuner` alongside the kernel symbols, symmetric to
  `import_trtllm()`. Updated all 4 call sites to unpack 7 values.

- `_assert_distinct_autotuners(fi_AutoTuner, trt_AutoTuner)`:
  startup check that hard-exits if the two AutoTuner singletons are
  the same object. Catches a future framework refactor that unifies
  them — the dual-context warmup below would silently become
  redundant or, worse, regress to single-context if reverted later.
  Idempotent across the per-size loop.

- `_audit_selected_tactics(fi_AT, trt_AT, num_tokens=...)`: post-
  warmup audit that hard-exits if either autotuner's profiling_cache
  is empty (= the corresponding `autotune()` context did not
  engage). Prints the selected tactic on each side. Soft-warns when
  flashinfer picks `DEFAULT_MOE_TACTIC` (legitimate at small N where
  the validity filter rejects every entry, but worth surfacing).

- The warmup block now clears both caches and stacks both
  `autotune()` contexts in a single `with` statement.

- A "Autotuner-context safety net" comment block documents the
  parallel-singleton design and the two failure modes the helpers
  catch (single-context warmup; future singleton unification).

The audit doc already covers the empirical verification — at N=16384
with this fix in place, gemm1_swiglu PTX is byte-identical between
sides (md5 match after symbol normalization, zero diff lines) and
per-call time matches within 0.1%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…ground-truth verification

A single nsys trace at N=8192 with `--nsys-capture-range` (commit
40bb77e) bracketing only the timed measurement passes resolved
both remaining measurement-related follow-ups.

flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation):
direct wall-clock comparison at N=16384 / 30 iters shows identical
wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the
3 s delta is autotune-compile + Python-startup variance, well below
the ~240 ms of actual GPU measurement work in 70+ s of total
wall-clock). The historical 2× signature was always a `cupti-python`
span-attribution artifact, never real GPU work — and it does not
reproduce under current methodology. A smaller asymmetric bias
(~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which
is the rationale for keeping `--use-cupti` opt-in (default off).

flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys
ground truth at N=8192 = 0.737 ms; current in-bench reports
0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms
(7.1% below ground truth, harness-to-harness rounding tolerance).
The original 19% gap was specific to the older `--use-cupti` config
against the standalone — under current methodology there is no
systematic bias.

Audit changes:

- New "Ground-truth nsys verification (2026-04-28)" section
  immediately after the post-fix verification, documenting the run
  command, per-kernel ground truth, the resolution of both
  follow-ups with quantitative tables, and a note that the trace
  also serves as a third independent kernel-port faithfulness
  check (kernel mangled-name structure matches modulo encoded
  module path).

- Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray`
  framing was wrong; actual cause was the gemm2-enumeration gap
  fixed at d291d17e/f0cf8cd0 on the standalone PR branch).

- Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes.

- Top-of-file correction section title updated to "2026-04-24/25/28"
  and short summary expanded to mention the verification round.

The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully
closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are
all scope-expansions, not investigations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…d-and-skipped

flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths
run with degenerate values, so a divergent FP8_MAX or scale-convention
mismatch between the two sides would silently produce identical
output here and divergent output at non-trivial scales). Skipped
because:

- Both sides have their own internal scale-plumbing tests with
  non-trivial scales. TRT-LLM's
  tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and
  flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise
  non-trivial-gs configurations against PyTorch references. The
  bench wouldn't add coverage their CI doesn't already have.

- A failure here wouldn't tell us what we want to know. The most
  likely cause of fi-vs-trt parity divergence at production scales
  would be a scale-convention disagreement BETWEEN the two
  implementations — not fixable in flashinfer (TRT-LLM is upstream),
  and dramatically out of scope.

- Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing
  has many surfaces (alpha, weight_scale_2, input_scale, is_sf_*
  flags, fp4_quantize signatures); any tiny bench-side mismatch
  produces a parity failure that looks like a port bug. That's the
  same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit.

Scope-difference note: the audit's load-bearing question is "is the
kernel port faithful?" — closed via byte-identical PTX + matching
timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM
agree on NVFP4 scaling conventions?" — a separate question, real
but tangential to port-faithfulness.

One scenario that would re-elevate this: a planned production ship
of CuteDslMoEWrapper to a caller using non-trivial scales, where
the team wants one independent cross-check (against CuteDslFusedMoE
under the same scales) before merging. In that scenario flashinfer-ai#3 is
exactly the right pre-ship sanity. For closing out the investigation
audit, it's tangential.

Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity
rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…flashinfer-ai#9 remains

Three close-out edits to wrap the CuteDSL MoE FP4 port audit:

- Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped,
  matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons:
  (a) port-parity claims are unaffected by DSL-compiler version since
  both sides use the same in-container compiler, (b) flashinfer and
  TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus
  unsupported-config risk produces the same ambiguous-failure pattern
  that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the
  image. Install recipe preserved for future absolute-latency probes.

- Add a "Version-skew caveat (2026-04-28)" subsection to the
  top-of-file correction. The bench compares flashinfer-with-post-rc5-
  forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2-
  without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may
  partly reflect the version asymmetry. Load-bearing claims (port
  faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067
  fix) are unaffected because they do not depend on absolute deltas.
  Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing
  the post-rc5 commits.

- Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the
  considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16
  tactic-divergence root cause) as the only effective open follow-up.
  Audit declared closed 2026-04-28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants