Skip to content

[Wrapper] Add the CSR indptr of append lengths to interface#4

Merged
yzh119 merged 1 commit intomainfrom
decode-wrapper-csr
Oct 9, 2023
Merged

[Wrapper] Add the CSR indptr of append lengths to interface#4
yzh119 merged 1 commit intomainfrom
decode-wrapper-csr

Conversation

@MasterJH5574
Copy link
Copy Markdown
Collaborator

@MasterJH5574 MasterJH5574 commented Oct 3, 2023

This PR also unifies the decode wrapper with the (potential) prefill wrapper.

This PR also unifies the decode wrapper with the (potential)
prefill wrapper.
@yzh119 yzh119 merged commit a0f03eb into main Oct 9, 2023
@yzh119 yzh119 deleted the decode-wrapper-csr branch October 12, 2023 06:10
@yongwww yongwww mentioned this pull request Jul 28, 2025
5 tasks
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
bobboli pushed a commit to bobboli/flashinfer that referenced this pull request Mar 4, 2026
feat: add SM120 fmha_v2 flash attention kernel support
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 22, 2026
…stream measurement discrepancy

Two related runbook updates after the v6 run revealed a measurement
issue:

1. cupti-python install step added as step 4 in the Part 2 reproduction
   runbook. The NGC 1.3.0rc5.post2 image does NOT ship cupti-python by
   default - earlier audit claims that it did were wrong (corrected in
   the earlier cupti-attribution commit). Running the bench in a fresh
   container silently falls back to CUDA-event timing without this
   install.

2. Added a known-caveat note about the v6 run showing main-pass trt_ms
   doubling relative to the per-kernel sum when CUPTI is enabled.
   flashinfer's side reconciles within 1%, TRT-LLM's by ~2x - a sharp
   asymmetry that almost certainly indicates a bench measurement
   artifact (likely bench_gpu_time_with_cupti interaction with TRT-LLM
   CuteDslFusedMoE's aux_stream_dict pattern, or nested graph capture).
   Runbook instructs the reader to use --no-cupti as a workaround for
   the main timing pass if the output looks anomalous; per-kernel view
   is unaffected and remains trustworthy.

Adds this as follow-up flashinfer-ai#4 in the queued-investigations list so a future
session can diagnose whether it's a CUPTI-span vs torch.profiler
measurement divergence, a graphs-within-graphs artifact, or a genuine
cross-stream sync overhead that only CUPTI sees.

Also expanded the "5. Run the full parity sweep" expected-output
section to include the reconciliation check and the per-kernel CSV
dump flag, so a fresh-container reproducer has a clear "did it work"
signal beyond just "parity PASS".

No code changes in this commit. Documentation only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 22, 2026
…fix wide-table alignment

Four cleanup items bundled, all downstream of the v7 run that confirmed
the CUPTI + CUDA graphs + TRT-LLM aux-stream measurement issue.

1. Drop the bucket-substring view. The parallel-run verification across
   v5/v6/v7 showed phase rollup totals match bucket rollup totals at
   every size, so the bucket view has served its purpose. Removed:
   - KERNEL_BUCKETS constant and BUCKET_ORDER derivation.
   - classify_kernels() function.
   - Bucket fields on PerKernelTimings (moe_sort / gemm1_swiglu /
     output_zero / gemm2_finalize / misc) — replaced total property
     now sums logical_ops + unmapped.
   - Per-size "per-kernel BUCKETED view" print.
   - End-of-run "Per-kernel BUCKETED summary" table.
   - Bucket-totals CSV from write_kernel_csv (kept logical-op CSV as
     authoritative and raw CSV as ground truth).
   Net: -150 lines of parallel-path code.

2. Flip CUPTI default to opt-in. Previously --no-cupti, default ON.
   v7 confirmed that bench_gpu_time_with_cupti(use_cuda_graph=True)
   produces a ~2x inflated trt_ms for TRT-LLM's CuteDslFusedMoE
   aux_stream_dict pattern (not a flashinfer issue — shows up only on
   TRT-LLM side). Now --use-cupti, default OFF. Per-kernel
   torch.profiler pass continues to use CUPTI under the hood
   independent of this flag (it's a separate code path via
   ProfilerActivity.CUDA).

3. Fix wide-table header-vs-data alignment. The end-of-run phase
   rollup table had bucket / phase labels longer than the data values,
   so the columns visually mismatched even though right-edges lined up.
   Replaced with a proper 2-row header:
   - Row 1: phase name centered across its 3 sub-columns.
   - Row 2: fi_ms / trt_ms / Δ% sub-labels, 9-char fields.
   Data uses the same 9-char sub-columns. Header and data now line up
   exactly. Factored the rendering into a helper
   _print_phase_rollup_table() since the logic is non-trivial.

4. Audit report updates:
   - Runbook step 4: CUPTI install is now optional/opt-in rather than
     recommended. Explains why with empirical v6-vs-v7 evidence.
   - Follow-up flashinfer-ai#4: upgraded from "hypothetical" to "confirmed, not
     hypothetical" with the v7 confirmation as the smoking gun.
   - Removed all references to --no-cupti flag (now --use-cupti).

Non-goals for this commit (kept as separate future work):
- Diagnosing and fixing the CUPTI+graph+multi-stream root cause
  (follow-up flashinfer-ai#4 queued).
- Investigating the tile_size=256 / MbarrierArray suspect that drives
  the 4096+ large-batch regression (original follow-up flashinfer-ai#1 queued).
- Adding a third-reference PyTorch eager comparison (follow-up flashinfer-ai#2).

The per-kernel accuracy story remains intact: logical-op mapping has
reconciled cleanly across v5/v6/v7 with zero unmapped kernels after the
routing-kernel substring fix (136c8b1). Phase rollup tables are the
canonical at-a-glance presentation; per-logical-op CSV and raw per-
kernel CSV preserve the full detail for offline analysis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 22, 2026
…ss section

Comprehensive pass on cute_dsl_moe_port_audit.md to make it accurate and
less redundant as the session's durable notes. Changes:

Compressions / redundancy removal:

- Executive verdict: reduced from 70 lines to 30. Same structure
  (static audit / runtime parity / runtime perf / flashinfer-ai#3067 open
  divergence) but tighter; specific commit-by-commit representative
  results moved out of the exec summary (they're in the *Post-port
  TRT-LLM commits — status* section below).

- v1 full baseline table deleted (16 rows). The v1 vs v2 regime-level
  comparison table in *Meta-finding* covers what was informative;
  the full v1 numbers were redundant historical state.

- Performance verdict section: tightened from ~50 lines to ~25. Same
  three-regime conclusion (small-batch slightly slower, mid on-par,
  large-batch +12–27% regression). Dropped the explanatory repetition
  of CUDA-graph mechanics (now in Meta-finding only).

- Meta-finding section: rewritten to lead with the CUDA-graphs takeaway
  and drop the obsolete claim that CUPTI was "a smaller additional
  improvement" (v6 showed CUPTI+graphs is actually broken for TRT-LLM's
  multi-stream pattern).

- flashinfer-ai#3067 investigative-steps section: the 65-line moe_sort diff code
  sketch condensed to a 10-line spec with parameter-name gotchas
  preserved. The full sketch was teaching HOW to do the experiment;
  the condensed version is enough to reproduce it.

- flashinfer-ai#3067 investigative-steps preamble: "Reordered 2026-04-22 after the
  deep audit promoted..." dropped. Numbering reflects current priority.

- Candidate root causes preamble: "Updated 2026-04-22..." dropped for
  the same reason.

- Generated-date line: 4-iteration changelog collapsed to a single
  date with pointer to per-section notes.

Correctness fixes:

- v2 baseline provenance rewritten to remove the "CUPTI fallback"
  framing. CUDA-event timing is now the deliberate default (not a
  fallback); the report reflects that, citing v5/v7/v8 rerun
  stability as evidence.

- Follow-up flashinfer-ai#4 tightened to definitive language ("empirically
  confirmed") rather than the hypothetical phrasing from when the
  CUPTI issue was first spotted.

Additions:

- New section *Version-drift robustness (newer NGC containers)*. Lists
  what stays safe automatically when NGC is upgraded (-Bsymbolic is
  defensive regardless; reconciliation check catches kernel-name
  drift; API drift produces loud Python errors) and what requires a
  manual edit (NGC image tag in runbook, KERNEL_MAP substrings if
  reconciliation fires, baseline table rerun). Includes concrete
  substring-update patterns for the cases most likely to drift
  (routing kernels, fp4_quantize, memsets).

Net: 1608 → 1465 lines (−143, ~9% reduction). No factual content
removed; duplication consolidated and over-verbose code sketches
compressed while preserving their load-bearing details (the kernel-
substring gotchas, the parameter-name gotchas, the line references
into the code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 23, 2026
Aligns the main-pass bench_gpu_time calls with bench_moe_deepseek.py
(fix_moe_benchmark branch) so absolute fi_ms numbers are comparable
for cross-verification. Verified 2026-04-23: 10/11 sizes agree within
plus/minus 8 percent vs bench_moe_deepseek.py CuteDSL column, 8/11
within plus/minus 5 percent, no systematic bias.

Changes:
- import fused_topk_deepseek from flashinfer.fused_moe; have both
  fi_run and trt_run call it inside the timed region with the same
  inputs, so the two sides continue to receive byte-identical
  topk_indices / topk_values per iteration (parity-isolation
  property preserved).
- pass input tensors to bench_gpu_time via input_kwargs with
  cold_l2_cache=True so rotation engages. Previously the lambdas
  closed over tensors, leaving bench_gpu_time unable to see them;
  cold_l2_cache was silently disabled via a runtime warning that
  was easy to miss. L2 was effectively warm across iterations.
- parity check retains shared_routing outputs (pre-computed via
  DeepSeekV3MoeRoutingMethod.apply) so any parity failure remains
  attributable to the MoE path, not to routing-function drift
  between fused_topk_deepseek and TRT-LLM's routing method.

CUPTI default intentionally NOT flipped to on: audit follow-up flashinfer-ai#4
(2x trt_ms inflation with CUPTI + use_cuda_graph=True under the
aux_stream_dict pattern) remains unresolved. --use-cupti stays
opt-in; use it when cross-verifying vs bench_moe_deepseek.py
whose default is CUPTI on.

Per-kernel profile pass (profile_per_kernel) unchanged: its lambdas
still exclude fused_topk_deepseek so the KERNEL_MAP classification
and per-logical-op breakdown remain kernel-isolated and match the
TRT-LLM per-kernel reference scripts in tests/scripts/cute_dsl_kernels/.
Audit per-phase regression conclusions (gemm1 plus 16-45 percent,
gemm2 plus 10-17 percent) are therefore unaffected by this change.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 23, 2026
…n results

Surfaces the v3 runtime parity dataset (committed methodology, tip
bb2afbf) and the completed cross-verification against
bench_moe_deepseek.py throughout the Part 2 results section, rather
than appending. Key edits:

- Authoritative baseline flag now points at v3. v3 table (15 sizes,
  2026-04-23, cold-L2 rotation + routing-in-graph, CUPTI off)
  replaces v2 as the reference. v1 and v2 demoted to historical
  notes inside the Meta-finding section.
- Performance verdict updated to cite v3 numbers (small-batch 2-9
  percent slower, mid 0-2 percent, large +11 to +28 percent at 4096
  through 16384).
- Per-logical-op breakdown updated with v3 phase-rollup numbers;
  Category C now cites the empirical tile-gating signature observed
  at the N=2048 to N=4096 boundary (TRT-LLM gemm1 grows +3 percent
  while flashinfer grows +60 percent, consistent with TRT switching
  tactic to tile_size=256 / 2CTA while flashinfer stays on 128).
- Meta-finding section now captures two methodology transitions
  (v1 to v2 and v2 to v3) with a three-column comparison table,
  explaining the mechanics of each shift.
- Deliberate choices section extended with two new items: cold-L2
  rotation via input_kwargs, and routing inside the timed graph
  (including how parity-isolation is preserved).
- Cross-verification section reframed from "optional, how to run"
  to "performed, here are the results". External bench 3
  (bench_moe_deepseek.py) subsection now contains the 15-size
  agreement table: plus/minus 8.3 percent at every size, 13/15
  within plus/minus 7.5 percent. External benches 1 and 2 (TRT-LLM
  standalone kernels) remain open follow-ups.
- Follow-up flashinfer-ai#4 updated with the non-reproduction observation under
  v3 methodology: the 2x trt_ms inflation bug did not reproduce at
  1..1024 with CUPTI on, possibly because v3's input_kwargs +
  rotated-buffer pattern dodges the CUPTI span-computation edge
  case that v2's closure-captured lambda hit. Root cause still not
  understood; CUPTI stays opt-in pending explanation.
- Reproduction runbook step 5 expected output updated to v3 range;
  step 4 CUPTI rationale updated to cite both the v2 reproduction
  and the v3 non-reproduction.
- Stale v2 references in version-drift-robustness and external
  bench 1/2 expected-correspondence updated to v3 values
  (gemm1_swiglu_trt at 8192 now ~1.20 ms, gemm2_finalize_trt at
  8192 now ~0.85 ms).

Scope: 10 hunks spanning lines 1078-1648; net +286/-141.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
…ency check

Ran moe_sort_self_consistency_test.py on 2026-04-24: flashinfer's
moe_sort output satisfies both necessary correctness invariants at
tile_size=256, in the same self-consistent sense as at tile_size=128.

Invariant 1 (round-trip): permuted_idx_to_expanded_idx[
    expanded_idx_to_permuted_idx[t, k]] == t * top_k + k

Invariant 2 (tile-expert consistency): the tile at position
    expanded_idx_to_permuted_idx[t, k] // tile_size has
    tile_idx_to_expert_idx equal to the token's selected expert
    (minus local_expert_offset).

All 2304/2304 valid-entry checks pass across the two test shapes
(128x256x512x256x2 and 256x1024x2048x256x8, matching
TestAllValidTactics parametrize) at both tile_size=128 and
tile_size=256. Flashinfer's moe_sort produces tables that satisfy
both necessary correctness properties at tile_size=256 — candidate
flashinfer-ai#2 is effectively ruled out.

The tile_size=256 correctness bug (flashinfer-ai#3067) must therefore live
downstream of moe_sort: moe_permute, GEMM1 gather logic, or GEMM2
finalize. Since the deep audit already established kernel bodies are
semantically identical to TRT-LLM's, the remaining surface is
narrow. Updated executive summary and candidate flashinfer-ai#2 entry to reflect
this. Current working suspicion shifts to candidate flashinfer-ai#4
(convert_sf_to_mma_layout / weight-layout conversion) based on the
"most-elements-right, ~22%-wrong" signature (78.40% within-tolerance
stable across shapes and tactics) which is consistent with a
tile_size-dependent SF-layout mismatch rather than a structural bug.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
Code-reading review 2026-04-24: `convert_sf_to_mma_layout` is a pure
`.view(...).permute(...)` strided view — it does not move data; the
underlying GPU bytes ARE the input SF bytes. The kernel reads via
`data_ptr()` + stride metadata, getting the same bytes TRT-LLM's
kernel reads. TRT-LLM's `swizzle_sf(unswizzle_sf(sf, ...))` is a
round-trip empirically verified byte-identical to the input SF.
Both paths hand the CuteDSL kernel the same bytes.

Also: the 6D layout (32, 4, m//128, 4, k//4, num_groups) uses M=128
as fundamental sub-tile REGARDLESS of tile_size. The 2CTA variant
at tile_size=256 reads 2 adjacent m_tiles across two CTAs; the SF
byte layout doesn't change. The mechanism originally proposed for
this candidate (tile_size-dependent SF layout mismatch) was based
on a misreading of the layout.

Kept the abandoned sf_layout_diff_test.py attempt as a record —
its .contiguous()-on-strided-view comparison produced a false
88.72 percent divergence report that was a test-harness artifact,
not a real finding. The corrected interpretation supersedes that
test's nominal verdict.

Working suspicion now moves to moe_permute (JIT-compiled sibling
of moe_sort in moe_utils.py) — consumes moe_sort's now-verified
output, explicitly tile_size-parameterized, and has not been
isolated by any prior probe.

Candidates ruled out so far:
 - kernel bodies (deep audit)
 - flashinfer-ai#1 MbarrierArray shim (2026-04-23 revert experiment)
 - flashinfer-ai#2 moe_sort / routing tables (2026-04-24 self-consistency)
 - flashinfer-ai#4 SF layout conversion (2026-04-24 code reading)

Candidates still open: flashinfer-ai#3 fence_proxy shim (low prior), flashinfer-ai#5
orchestration / buffer sizing, flashinfer-ai#6 top-level wrappers. moe_permute
now promoted to primary suspect (wasn't cleanly separated in the
original flashinfer-ai#2 entry; test script in progress).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
Ran moe_permute_invariant_test.py on 2026-04-24: for every valid
(t, k) pair with a local expert, verified
permuted_output[expanded_idx_to_permuted_idx[t, k]] element-wise
equals input[t] after moe_permute executes. bf16 input, no SF —
focuses purely on the gather/copy path. All 4608/4608 active-pair
checks pass across both test shapes
((128, hidden=256, top_k=2) and (256, hidden=1024, top_k=8))
at both tile_size=128 and tile_size=256.

Combined with moe_sort's verified self-consistency, this means the
entire routing-table + permute layer is behaving correctly at
tile_size=256. moe_permute is NOT the root cause.

Ruled-out set so far (cumulative, tile_size=256 correctness bug
flashinfer-ai#3067):
 - kernel bodies (deep audit; blackwell/*_fusion.py semantically
   identical to TRT-LLM copies modulo whitespace)
 - flashinfer-ai#1 MbarrierArray shim (2026-04-23 revert experiment)
 - flashinfer-ai#2 moe_sort routing tables (2026-04-24 self-consistency
   invariants, 2304/2304 checks passed)
 - flashinfer-ai#4 convert_sf_to_mma_layout (2026-04-24 code reading; pure
   strided view, byte-identical to input)
 - moe_permute (this test)

Remaining surface is narrow and largely at the Python orchestration
/ kernel invocation level: CuteDslMoEWrapper buffer sizing, tactic
parameter plumbing in the top-level blockscaled_contiguous_*
wrappers, max_num_permuted_tokens derivation in tuner dispatch.
Runtime black-box probing has reached diminishing returns — further
investigation requires source-level diffs of the Python wrapper /
orchestration code or CSRC C++ sources.

Executive-summary paragraph and candidate flashinfer-ai#2 entry updated to
reflect moe_permute ruled out. Test script kept at
/Users/lnau/flashinfer/moe_permute_invariant_test.py (not in repo).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
The audit body references four investigation scripts by filename
("preserved for reference", "ran ...", "kept for record"). Previously
those scripts existed only as untracked files in Lee's local
checkout, so the audit's references pointed at local-only files and
the investigation was not reproducible from the committed branch.

Moves the four scripts into benchmarks/investigation/ and commits
them, alongside a README that indexes each script's hypothesis,
outcome, and audit cross-reference:

  - moe_sort_diff_test.py (Candidate flashinfer-ai#2 Experiment 1, inconclusive)
  - moe_sort_self_consistency_test.py (Candidate flashinfer-ai#2 Experiment 3,
    PASS 2304/2304 invariants)
  - moe_permute_invariant_test.py (Candidate flashinfer-ai#2 Experiment 4,
    PASS 4608/4608 copies)
  - sf_layout_diff_test.py (Candidate flashinfer-ai#4 attempted test, kept as
    cautionary record of a .contiguous()-on-strided-view
    measurement artifact)

Also updates the five filename references in the audit to use the
new path (e.g. `benchmarks/investigation/moe_sort_diff_test.py`)
so the audit is now self-contained with respect to the investigation
artifacts.

The original one-shot patches used during the investigation
(moe_sort_substitution.patch, tile_256_investigation.patch,
tile_256_enable_only.patch, tile_256_only.patch) are NOT preserved
— they were small enough to regenerate from the audit's prose
descriptions if ever needed, and their whole content was "change
one line in tuner.py" or "monkey-patch one function". The
investigation scripts are the durable methodology artifact.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…ground-truth verification

A single nsys trace at N=8192 with `--nsys-capture-range` (commit
40bb77e) bracketing only the timed measurement passes resolved
both remaining measurement-related follow-ups.

flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation):
direct wall-clock comparison at N=16384 / 30 iters shows identical
wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the
3 s delta is autotune-compile + Python-startup variance, well below
the ~240 ms of actual GPU measurement work in 70+ s of total
wall-clock). The historical 2× signature was always a `cupti-python`
span-attribution artifact, never real GPU work — and it does not
reproduce under current methodology. A smaller asymmetric bias
(~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which
is the rationale for keeping `--use-cupti` opt-in (default off).

flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys
ground truth at N=8192 = 0.737 ms; current in-bench reports
0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms
(7.1% below ground truth, harness-to-harness rounding tolerance).
The original 19% gap was specific to the older `--use-cupti` config
against the standalone — under current methodology there is no
systematic bias.

Audit changes:

- New "Ground-truth nsys verification (2026-04-28)" section
  immediately after the post-fix verification, documenting the run
  command, per-kernel ground truth, the resolution of both
  follow-ups with quantitative tables, and a note that the trace
  also serves as a third independent kernel-port faithfulness
  check (kernel mangled-name structure matches modulo encoded
  module path).

- Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray`
  framing was wrong; actual cause was the gemm2-enumeration gap
  fixed at d291d17e/f0cf8cd0 on the standalone PR branch).

- Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes.

- Top-of-file correction section title updated to "2026-04-24/25/28"
  and short summary expanded to mention the verification round.

The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully
closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are
all scope-expansions, not investigations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…d-and-skipped

flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths
run with degenerate values, so a divergent FP8_MAX or scale-convention
mismatch between the two sides would silently produce identical
output here and divergent output at non-trivial scales). Skipped
because:

- Both sides have their own internal scale-plumbing tests with
  non-trivial scales. TRT-LLM's
  tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and
  flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise
  non-trivial-gs configurations against PyTorch references. The
  bench wouldn't add coverage their CI doesn't already have.

- A failure here wouldn't tell us what we want to know. The most
  likely cause of fi-vs-trt parity divergence at production scales
  would be a scale-convention disagreement BETWEEN the two
  implementations — not fixable in flashinfer (TRT-LLM is upstream),
  and dramatically out of scope.

- Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing
  has many surfaces (alpha, weight_scale_2, input_scale, is_sf_*
  flags, fp4_quantize signatures); any tiny bench-side mismatch
  produces a parity failure that looks like a port bug. That's the
  same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit.

Scope-difference note: the audit's load-bearing question is "is the
kernel port faithful?" — closed via byte-identical PTX + matching
timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM
agree on NVFP4 scaling conventions?" — a separate question, real
but tangential to port-faithfulness.

One scenario that would re-elevate this: a planned production ship
of CuteDslMoEWrapper to a caller using non-trivial scales, where
the team wants one independent cross-check (against CuteDslFusedMoE
under the same scales) before merging. In that scenario flashinfer-ai#3 is
exactly the right pre-ship sanity. For closing out the investigation
audit, it's tangential.

Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity
rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…flashinfer-ai#9 remains

Three close-out edits to wrap the CuteDSL MoE FP4 port audit:

- Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped,
  matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons:
  (a) port-parity claims are unaffected by DSL-compiler version since
  both sides use the same in-container compiler, (b) flashinfer and
  TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus
  unsupported-config risk produces the same ambiguous-failure pattern
  that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the
  image. Install recipe preserved for future absolute-latency probes.

- Add a "Version-skew caveat (2026-04-28)" subsection to the
  top-of-file correction. The bench compares flashinfer-with-post-rc5-
  forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2-
  without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may
  partly reflect the version asymmetry. Load-bearing claims (port
  faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067
  fix) are unaffected because they do not depend on absolute deltas.
  Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing
  the post-rc5 commits.

- Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the
  considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16
  tactic-divergence root cause) as the only effective open follow-up.
  Audit declared closed 2026-04-28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants