Skip to content

Add tmp buffer and rotary mode to BatchDecode wrapper#2

Merged
yzh119 merged 1 commit intomainfrom
batch-decode-tmp-rotary
Sep 13, 2023
Merged

Add tmp buffer and rotary mode to BatchDecode wrapper#2
yzh119 merged 1 commit intomainfrom
batch-decode-tmp-rotary

Conversation

@MasterJH5574
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzh119 yzh119 merged commit 3d1f5b3 into main Sep 13, 2023
@MasterJH5574 MasterJH5574 deleted the batch-decode-tmp-rotary branch September 18, 2023 13:43
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
bobboli pushed a commit to bobboli/flashinfer that referenced this pull request Feb 17, 2026
yzh119 pushed a commit that referenced this pull request Feb 25, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

To fix the following bug:
When the CuteDSL MoE kernels were ported from TensorRT-LLM to
FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally
dropped from the routing kernel's DataBase struct in RoutingKernel.h.
TRT-LLM's routing kernel produces three reverse-mapping outputs:

1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward
mapping
2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to
expanded index (token_idx * topk + k)
3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token
index only

FlashInfer's port kept only #1 and #3, dropping #2. The binding in
moe_utils_binding.cu then had to wire the Python buffer
permuted_idx_to_expanded_idx to the only available reverse-mapping field
— mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of
expandedIdx.
The Impact
The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize)
all expect expanded indices and derive the token index via expanded_idx
// topk. When they received plain tokenIdx instead, they computed
tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init
for memset, and wrong scatter position + wrong routing scale for
finalize.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Refined MOE (Mixture of Experts) routing infrastructure by extending
index mapping capabilities across multiple kernel implementations to
improve internal data flow consistency.

* **Tests**
* Strengthened accuracy validation thresholds from 0.925 to 0.97 with
adjusted error tolerance parameters, ensuring more rigorous testing of
MOE operations under FP4 quantization conditions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
danisereb pushed a commit to danisereb/flashinfer that referenced this pull request Mar 25, 2026
…play-dim0-check

Relax routing_replay_out dim0 validation for CUDA graph compatibility
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 22, 2026
…pect

Walks every non-whitespace hunk across all 4 direct_port kernel files
(utils.py, custom_pipeline.py, blockscaled_contiguous_grouped_gemm_
finalize_fusion.py, blockscaled_contiguous_gather_grouped_gemm_swiglu_
fusion.py) between TRT-LLM v1.3.0rc5.post2 (the user's NGC container
version) and flashinfer's port. 159 hunks classified total. No
functional kernel-logic divergence found; all substantive differences
fall into a small number of categorized buckets (cosmetic line-wrapping,
pdl env-var adaptation, cutlass version shims, lint suppressions).

Primary finding — a highly plausible alternative suspect for flashinfer-ai#3067,
narrower than the prior Candidate Root Causes list:

The MbarrierArray shim in custom_pipeline.py (3 occurrences). Flashinfer
replaces PipelineAsync._make_sync_object(..., TCGen05Mma) with direct
MbarrierArray(...) construction, unconditionally. Comment claims
_make_sync_object "does not handle TCGen05Mma in cutlass >= 4.4.0", but
TRT-LLM (pinned to cutlass 4.3.4) keeps using _make_sync_object and our
test_nvfp4_gather_grouped_gemm_swiglu_blackwell[tile_size=256] runs
prove it works at that cutlass version. If the two paths produce
mbarriers with even subtly different transaction-count / arrive-count
semantics, the 2CTA synchronization protocol (more elaborate than 1CTA)
would be affected exactly the way the flashinfer-ai#3067 reproduction shows: pass
at tile_size=128 (1CTA), fail at tile_size=256 (2CTA) at a consistent
fraction of output rows (94.78% within tolerance).

Secondary finding:

The cute.arch.fence_proxy enum -> string conversion (~20 occurrences
across both GEMM kernels). Benign iff cutlass accepts both forms
interchangeably; worth a spot-check.

What the deep audit ruled out:

- Kernel bodies in both GEMM1 (gather+SwiGLU) and GEMM2 (finalize)
  are semantically identical between TRT-LLM and flashinfer. All
  2CTA-specific code paths (use_2cta_instrs, sync_transform_warp_id,
  SharedStorage1cta/2cta, cta_group, overlapping_accum, etc.) match.
- The cutlass-version shims for monkey-patches and nvvm.fmin are
  correctly version-gated and degenerate to TRT-LLM's exact behavior
  on cutlass 4.3.4 (the user's runtime).
- PDL env-var adaptation uses default-True matching TRT-LLM's default.
- moe_utils.py (the former flashinfer-ai#1 suspect) drops to flashinfer-ai#2 in the refined
  list; MbarrierArray shim is now flashinfer-ai#1.

Artifacts added:

- benchmarks/cute_dsl_moe_port_deep_audit_log.md (344 lines): full
  per-hunk classification log with bucket vocabulary, suspicion
  analysis for MbarrierArray, and reproduction instructions.

Artifacts updated:

- benchmarks/cute_dsl_moe_port_audit.md: header-date line notes the
  deep audit; Executive verdict updated with the new primary suspect;
  Candidate Root Causes list reordered (MbarrierArray shim -> flashinfer-ai#1,
  moe_utils.py -> flashinfer-ai#2, fence_proxy enum->string -> flashinfer-ai#3) with per-item
  "how to check" instructions; "What the deep audit ruled out"
  subsection added; Suggested next investigative steps reordered with
  a concrete MbarrierArray bypass experiment as the first step.

No runtime work in this commit; purely static source review. The
MbarrierArray hypothesis is now actionable: a ~3-line source edit in
custom_pipeline.py + a pytest rerun should confirm or rule it out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 22, 2026
…fix wide-table alignment

Four cleanup items bundled, all downstream of the v7 run that confirmed
the CUPTI + CUDA graphs + TRT-LLM aux-stream measurement issue.

1. Drop the bucket-substring view. The parallel-run verification across
   v5/v6/v7 showed phase rollup totals match bucket rollup totals at
   every size, so the bucket view has served its purpose. Removed:
   - KERNEL_BUCKETS constant and BUCKET_ORDER derivation.
   - classify_kernels() function.
   - Bucket fields on PerKernelTimings (moe_sort / gemm1_swiglu /
     output_zero / gemm2_finalize / misc) — replaced total property
     now sums logical_ops + unmapped.
   - Per-size "per-kernel BUCKETED view" print.
   - End-of-run "Per-kernel BUCKETED summary" table.
   - Bucket-totals CSV from write_kernel_csv (kept logical-op CSV as
     authoritative and raw CSV as ground truth).
   Net: -150 lines of parallel-path code.

2. Flip CUPTI default to opt-in. Previously --no-cupti, default ON.
   v7 confirmed that bench_gpu_time_with_cupti(use_cuda_graph=True)
   produces a ~2x inflated trt_ms for TRT-LLM's CuteDslFusedMoE
   aux_stream_dict pattern (not a flashinfer issue — shows up only on
   TRT-LLM side). Now --use-cupti, default OFF. Per-kernel
   torch.profiler pass continues to use CUPTI under the hood
   independent of this flag (it's a separate code path via
   ProfilerActivity.CUDA).

3. Fix wide-table header-vs-data alignment. The end-of-run phase
   rollup table had bucket / phase labels longer than the data values,
   so the columns visually mismatched even though right-edges lined up.
   Replaced with a proper 2-row header:
   - Row 1: phase name centered across its 3 sub-columns.
   - Row 2: fi_ms / trt_ms / Δ% sub-labels, 9-char fields.
   Data uses the same 9-char sub-columns. Header and data now line up
   exactly. Factored the rendering into a helper
   _print_phase_rollup_table() since the logic is non-trivial.

4. Audit report updates:
   - Runbook step 4: CUPTI install is now optional/opt-in rather than
     recommended. Explains why with empirical v6-vs-v7 evidence.
   - Follow-up flashinfer-ai#4: upgraded from "hypothetical" to "confirmed, not
     hypothetical" with the v7 confirmation as the smoking gun.
   - Removed all references to --no-cupti flag (now --use-cupti).

Non-goals for this commit (kept as separate future work):
- Diagnosing and fixing the CUPTI+graph+multi-stream root cause
  (follow-up flashinfer-ai#4 queued).
- Investigating the tile_size=256 / MbarrierArray suspect that drives
  the 4096+ large-batch regression (original follow-up flashinfer-ai#1 queued).
- Adding a third-reference PyTorch eager comparison (follow-up flashinfer-ai#2).

The per-kernel accuracy story remains intact: logical-op mapping has
reconciled cleanly across v5/v6/v7 with zero unmapped kernels after the
routing-kernel substring fix (136c8b1). Phase rollup tables are the
canonical at-a-glance presentation; per-logical-op CSV and raw per-
kernel CSV preserve the full detail for offline analysis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 23, 2026
Empirical falsification run on 2026-04-23 per the audit's candidate
flashinfer-ai#1 investigation recipe:

- Reverted the 3 MbarrierArray() calls in
  flashinfer/fused_moe/cute_dsl/blackwell/custom_pipeline.py back to
  PipelineAsync._make_sync_object(...) matching TRT-LLM's code
  exactly.
- Re-enabled tile_size=256 in tuner.py:169.
- Cleared the JIT cache and ran
  TestAllValidTactics::test_all_tactics_accuracy at both parametrized
  shapes ((128,256,512,256,2) and (256,1024,2048,256,8)).

Result: identical 8/16 failure pattern at tile_size=256, 78.40
percent within-tolerance rate stable across both problem shapes and
all 8 failing tactic variants. Every failure confined to
cluster_shape=(2,1) / 2CTA tactic variants.

Conclusion: _make_sync_object and MbarrierArray paths are
behaviorally equivalent at cutlass 4.3.4 for this workload.
Neither is the 2CTA correctness bug.

Useful signal for the next candidate: the stability of the 78.40
percent within-tolerance rate across both shapes and all failing
tactic variants indicates systematic tile_size-dependent corruption
(a specific subset of tokens mis-routed / mis-addressed), not random
numerical error. That profile matches candidate flashinfer-ai#2 (moe_sort /
permute helpers in moe_utils.py) much better than candidate flashinfer-ai#1 —
moe_sort computes per-tile padding and group indices as a function
of tile_size and could produce correct-at-128-but-wrong-at-256
permutation tables that the (now-confirmed-clean) GEMM kernels then
consume.

Executive-summary primary suspect updated accordingly; candidate flashinfer-ai#1
section annotated with the falsification evidence and the extracted
signal pointing at candidate flashinfer-ai#2.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 23, 2026
…sults

Two experiments on 2026-04-23 probed whether flashinfer's
JIT-compiled moe_sort (audit candidate flashinfer-ai#2) is the tile_size=256
correctness bug (flashinfer-ai#3067):

Experiment 1 — element-wise diff of flashinfer vs TRT-LLM moe_sort
outputs on identical inputs at tile_tokens_dim=128 and =256, at two
problem shapes matching TestAllValidTactics' parametrize. Result:
total_num_padded_tokens and num_non_exiting_tiles always agreed;
tile_idx_to_expert_idx and tile_idx_to_mn_limit agreed at the active-
tile prefix (tail diffs confined to inactive tiles that both sides
left uninitialized via torch::empty); expanded_idx_to_permuted_idx
and permuted_idx_to_expanded_idx diverged heavily (up to 83.20
percent) at both tile sizes equally — which is in the regime
expected from "different valid orderings within expert groups"
rather than a tile_size-specific bug. Inconclusive.

Experiment 2 — substitute TRT-LLM's moe_sort into flashinfer's
downstream via an FLASHINFER_MOE_SORT_USE_TRTLLM env-var toggle in
moe_utils.py, re-enable tile_size=256, re-run
TestAllValidTactics::test_all_tactics_accuracy. Result: CUDA
cudaErrorIllegalAddress on the FIRST tactic executed, which is a
tile_size=128 tactic — the known-good configuration for flashinfer's
own moe_sort. Substitution crashes at both tile sizes, not just 256.

What we learned: flashinfer's moe_sort and flashinfer's downstream
kernels are coupled by convention, not just interface. TRT-LLM's
moe_sort output is in a different (also-correct-for-its-own-
downstream) format that flashinfer's GEMMs don't accept. The
naive substitution probe is exhausted as a diagnostic.

What we did NOT rule out: candidate flashinfer-ai#2 itself. The tile_size=256
bug could still live in flashinfer's moe_sort (tile-size-dependent
indexing bug that produces downstream-consistent tables at tile=128
but inconsistent ones at tile=256) or in flashinfer's downstream's
tile_size=256 handling. Remaining probes: self-consistency check of
flashinfer's moe_sort output (round-trip and tile-expert invariants),
and source-level diff of the CSRC routing kernels vs TRT-LLM's
internal copies.

Executive-summary paragraph also updated to reflect candidate flashinfer-ai#2's
current status (runtime substitution exhausted, neither confirmed
nor ruled out).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
…ency check

Ran moe_sort_self_consistency_test.py on 2026-04-24: flashinfer's
moe_sort output satisfies both necessary correctness invariants at
tile_size=256, in the same self-consistent sense as at tile_size=128.

Invariant 1 (round-trip): permuted_idx_to_expanded_idx[
    expanded_idx_to_permuted_idx[t, k]] == t * top_k + k

Invariant 2 (tile-expert consistency): the tile at position
    expanded_idx_to_permuted_idx[t, k] // tile_size has
    tile_idx_to_expert_idx equal to the token's selected expert
    (minus local_expert_offset).

All 2304/2304 valid-entry checks pass across the two test shapes
(128x256x512x256x2 and 256x1024x2048x256x8, matching
TestAllValidTactics parametrize) at both tile_size=128 and
tile_size=256. Flashinfer's moe_sort produces tables that satisfy
both necessary correctness properties at tile_size=256 — candidate
flashinfer-ai#2 is effectively ruled out.

The tile_size=256 correctness bug (flashinfer-ai#3067) must therefore live
downstream of moe_sort: moe_permute, GEMM1 gather logic, or GEMM2
finalize. Since the deep audit already established kernel bodies are
semantically identical to TRT-LLM's, the remaining surface is
narrow. Updated executive summary and candidate flashinfer-ai#2 entry to reflect
this. Current working suspicion shifts to candidate flashinfer-ai#4
(convert_sf_to_mma_layout / weight-layout conversion) based on the
"most-elements-right, ~22%-wrong" signature (78.40% within-tolerance
stable across shapes and tactics) which is consistent with a
tile_size-dependent SF-layout mismatch rather than a structural bug.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
Code-reading review 2026-04-24: `convert_sf_to_mma_layout` is a pure
`.view(...).permute(...)` strided view — it does not move data; the
underlying GPU bytes ARE the input SF bytes. The kernel reads via
`data_ptr()` + stride metadata, getting the same bytes TRT-LLM's
kernel reads. TRT-LLM's `swizzle_sf(unswizzle_sf(sf, ...))` is a
round-trip empirically verified byte-identical to the input SF.
Both paths hand the CuteDSL kernel the same bytes.

Also: the 6D layout (32, 4, m//128, 4, k//4, num_groups) uses M=128
as fundamental sub-tile REGARDLESS of tile_size. The 2CTA variant
at tile_size=256 reads 2 adjacent m_tiles across two CTAs; the SF
byte layout doesn't change. The mechanism originally proposed for
this candidate (tile_size-dependent SF layout mismatch) was based
on a misreading of the layout.

Kept the abandoned sf_layout_diff_test.py attempt as a record —
its .contiguous()-on-strided-view comparison produced a false
88.72 percent divergence report that was a test-harness artifact,
not a real finding. The corrected interpretation supersedes that
test's nominal verdict.

Working suspicion now moves to moe_permute (JIT-compiled sibling
of moe_sort in moe_utils.py) — consumes moe_sort's now-verified
output, explicitly tile_size-parameterized, and has not been
isolated by any prior probe.

Candidates ruled out so far:
 - kernel bodies (deep audit)
 - flashinfer-ai#1 MbarrierArray shim (2026-04-23 revert experiment)
 - flashinfer-ai#2 moe_sort / routing tables (2026-04-24 self-consistency)
 - flashinfer-ai#4 SF layout conversion (2026-04-24 code reading)

Candidates still open: flashinfer-ai#3 fence_proxy shim (low prior), flashinfer-ai#5
orchestration / buffer sizing, flashinfer-ai#6 top-level wrappers. moe_permute
now promoted to primary suspect (wasn't cleanly separated in the
original flashinfer-ai#2 entry; test script in progress).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
Ran moe_permute_invariant_test.py on 2026-04-24: for every valid
(t, k) pair with a local expert, verified
permuted_output[expanded_idx_to_permuted_idx[t, k]] element-wise
equals input[t] after moe_permute executes. bf16 input, no SF —
focuses purely on the gather/copy path. All 4608/4608 active-pair
checks pass across both test shapes
((128, hidden=256, top_k=2) and (256, hidden=1024, top_k=8))
at both tile_size=128 and tile_size=256.

Combined with moe_sort's verified self-consistency, this means the
entire routing-table + permute layer is behaving correctly at
tile_size=256. moe_permute is NOT the root cause.

Ruled-out set so far (cumulative, tile_size=256 correctness bug
flashinfer-ai#3067):
 - kernel bodies (deep audit; blackwell/*_fusion.py semantically
   identical to TRT-LLM copies modulo whitespace)
 - flashinfer-ai#1 MbarrierArray shim (2026-04-23 revert experiment)
 - flashinfer-ai#2 moe_sort routing tables (2026-04-24 self-consistency
   invariants, 2304/2304 checks passed)
 - flashinfer-ai#4 convert_sf_to_mma_layout (2026-04-24 code reading; pure
   strided view, byte-identical to input)
 - moe_permute (this test)

Remaining surface is narrow and largely at the Python orchestration
/ kernel invocation level: CuteDslMoEWrapper buffer sizing, tactic
parameter plumbing in the top-level blockscaled_contiguous_*
wrappers, max_num_permuted_tokens derivation in tuner dispatch.
Runtime black-box probing has reached diminishing returns — further
investigation requires source-level diffs of the Python wrapper /
orchestration code or CSRC C++ sources.

Executive-summary paragraph and candidate flashinfer-ai#2 entry updated to
reflect moe_permute ruled out. Test script kept at
/Users/lnau/flashinfer/moe_permute_invariant_test.py (not in repo).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
The audit body references four investigation scripts by filename
("preserved for reference", "ran ...", "kept for record"). Previously
those scripts existed only as untracked files in Lee's local
checkout, so the audit's references pointed at local-only files and
the investigation was not reproducible from the committed branch.

Moves the four scripts into benchmarks/investigation/ and commits
them, alongside a README that indexes each script's hypothesis,
outcome, and audit cross-reference:

  - moe_sort_diff_test.py (Candidate flashinfer-ai#2 Experiment 1, inconclusive)
  - moe_sort_self_consistency_test.py (Candidate flashinfer-ai#2 Experiment 3,
    PASS 2304/2304 invariants)
  - moe_permute_invariant_test.py (Candidate flashinfer-ai#2 Experiment 4,
    PASS 4608/4608 copies)
  - sf_layout_diff_test.py (Candidate flashinfer-ai#4 attempted test, kept as
    cautionary record of a .contiguous()-on-strided-view
    measurement artifact)

Also updates the five filename references in the audit to use the
new path (e.g. `benchmarks/investigation/moe_sort_diff_test.py`)
so the audit is now self-contained with respect to the investigation
artifacts.

The original one-shot patches used during the investigation
(moe_sort_substitution.patch, tile_256_investigation.patch,
tile_256_enable_only.patch, tile_256_only.patch) are NOT preserved
— they were small enough to regenerate from the audit's prose
descriptions if ever needed, and their whole content was "change
one line in tuner.py" or "monkey-patch one function". The
investigation scripts are the durable methodology artifact.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…ground-truth verification

A single nsys trace at N=8192 with `--nsys-capture-range` (commit
40bb77e) bracketing only the timed measurement passes resolved
both remaining measurement-related follow-ups.

flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation):
direct wall-clock comparison at N=16384 / 30 iters shows identical
wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the
3 s delta is autotune-compile + Python-startup variance, well below
the ~240 ms of actual GPU measurement work in 70+ s of total
wall-clock). The historical 2× signature was always a `cupti-python`
span-attribution artifact, never real GPU work — and it does not
reproduce under current methodology. A smaller asymmetric bias
(~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which
is the rationale for keeping `--use-cupti` opt-in (default off).

flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys
ground truth at N=8192 = 0.737 ms; current in-bench reports
0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms
(7.1% below ground truth, harness-to-harness rounding tolerance).
The original 19% gap was specific to the older `--use-cupti` config
against the standalone — under current methodology there is no
systematic bias.

Audit changes:

- New "Ground-truth nsys verification (2026-04-28)" section
  immediately after the post-fix verification, documenting the run
  command, per-kernel ground truth, the resolution of both
  follow-ups with quantitative tables, and a note that the trace
  also serves as a third independent kernel-port faithfulness
  check (kernel mangled-name structure matches modulo encoded
  module path).

- Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray`
  framing was wrong; actual cause was the gemm2-enumeration gap
  fixed at d291d17e/f0cf8cd0 on the standalone PR branch).

- Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes.

- Top-of-file correction section title updated to "2026-04-24/25/28"
  and short summary expanded to mention the verification round.

The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully
closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are
all scope-expansions, not investigations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…red-and-skipped

After 2026-04-24 / 25 / 28 we have five independent fi-vs-trt
agreement proofs:
  1. source byte-identical with rc5.post2 (deep audit)
  2. compiled PTX byte-identical at tile_size=256 (md5 401ebca6...)
  3. per-call timing match within 0.1% at apples-to-apples tactic
  4. 45 (size, EP) parity cells all pass within 0.5 FP4 step
  5. nsys ground-truth verification with kernel mangled-name
     structure agreement

The probability of fi+trt being wrong-but-agreeing across all five
is essentially zero, so a PyTorch FP4 third-reference check no
longer provides meaningful incremental confidence.

Additionally, `compute_reference_moe_fp4` has known limitations:
its PyTorch-eager FP4 simulation is stricter than the kernel's
actual FP4 representation, which made it ambiguous to interpret
during the original flashinfer-ai#3067 framing. Disagreement between bench
output and the reference would not unambiguously indicate a kernel
bug.

Cost is also non-trivial: Python-eager per-token / per-expert
loops would require running at a small problem-size subset to
keep wall-clock bearable.

Cost-to-incremental-confidence ratio is bad enough that this
follow-up is consciously skipped, not deferred. Future evidence
of a fi-vs-trt agreement that's actually wrong would re-elevate
it; otherwise no value.

Effective remaining open follow-ups: flashinfer-ai#3 (production-convention
scaling), flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun), flashinfer-ai#9 (EP=16 tactic-
divergence root cause).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…flashinfer-ai#9 remains

Three close-out edits to wrap the CuteDSL MoE FP4 port audit:

- Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped,
  matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons:
  (a) port-parity claims are unaffected by DSL-compiler version since
  both sides use the same in-container compiler, (b) flashinfer and
  TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus
  unsupported-config risk produces the same ambiguous-failure pattern
  that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the
  image. Install recipe preserved for future absolute-latency probes.

- Add a "Version-skew caveat (2026-04-28)" subsection to the
  top-of-file correction. The bench compares flashinfer-with-post-rc5-
  forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2-
  without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may
  partly reflect the version asymmetry. Load-bearing claims (port
  faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067
  fix) are unaffected because they do not depend on absolute deltas.
  Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing
  the post-rc5 commits.

- Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the
  considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16
  tactic-divergence root cause) as the only effective open follow-up.
  Audit declared closed 2026-04-28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants