Skip to content

Update wrapper, decrease pages ndim by 1#6

Merged
MasterJH5574 merged 1 commit intomainfrom
pages-ndim-decrease
Oct 12, 2023
Merged

Update wrapper, decrease pages ndim by 1#6
MasterJH5574 merged 1 commit intomainfrom
pages-ndim-decrease

Conversation

@MasterJH5574
Copy link
Copy Markdown
Collaborator

No description provided.

@MasterJH5574 MasterJH5574 merged commit 7c00474 into main Oct 12, 2023
@yzh119 yzh119 deleted the pages-ndim-decrease branch October 13, 2023 06:32
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues

```
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.31 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.36 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed    3.35 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  114.08 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.22 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  559.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 719.07 sec
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR, we add infra for enabling decode via flashinfer gpu_iface.
This PR does not change existing infrastructure and we can still build
decode using AOT and JIT.

Tested locally 
```
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.12 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  541.87 sec
```

We will have a follow up PR for enabling AOT decode using flashinfer
gpu_iface
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`

This PR has been tested locally 
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/6 Test #1: MathTest .........................   Passed    3.40 sec
    Start 2: PosEncTest
2/6 Test #2: PosEncTest .......................   Passed    3.40 sec
    Start 3: CascadeTest
3/6 Test #3: CascadeTest ......................   Passed  985.27 sec
    Start 4: PageTest
4/6 Test #4: PageTest .........................   Passed  112.40 sec
    Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest .................   Passed   35.46 sec
    Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest ..................   Passed  556.81 sec

100% tests passed, 0 tests failed out of 6
```

To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
wangbo981016 pushed a commit to meituan-longcat/flashinfer that referenced this pull request Feb 5, 2026
 (flashinfer-ai#6)

Co-authored-by: yangxurui <yangxurui@meituan.com>
Co-authored-by: lifengcun <lifengcun@meituan.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 23, 2026
…lashinfer-ai#6

Completes the external cross-verification pass against TRT-LLM's
tests/scripts/cute_dsl_kernels/ standalone per-kernel benches.

Bench 1 (gemm1+swiglu): PASS. Standalone 1283.95 us vs our v3
gemm1_swiglu_trt @ N=8192 = 1.202 ms → +6.8 percent, within plus/minus
10 percent pass band. Confirmed via raw per-kernel CSV that our
measurement corresponds to exactly one kernel launch per forward.
Independently validates our gemm1_swiglu_trt absolute number.

Bench 2 (gemm2+finalize): 19 percent gap, unexplained. Standalone
684.78 us (684 us without cold-L2) vs our v3 gemm2_finalize_trt @
N=8192 = 0.848 ms → −19.2 percent. Outside pass band. Investigation
ruled out two hypotheses with evidence:
- KERNEL_MAP over-match: ruled out (raw CSV shows exactly one kernel
  launch matching gemm2_finalize_trt at 0.847876 ms).
- L2-state difference: ruled out (standalone without --use_cold_l2
  gave 683.93 us vs 684.78 us with it, 0.1 percent noise-level
  difference; standalone is not L2-state-sensitive for this kernel).

Remaining hypotheses untested, not individually resolvable from
existing data: aux-stream execution context (B, direction-consistent
given gemm1/main stream bench 1 cross-verifies cleanly while
gemm2/aux stream bench 2 does not), CUDA-graph vs bare launch (D),
torch.profiler vs direct-CUPTI integration (E). Added as follow-up
flashinfer-ai#6 with nsys-trace next step.

Critically, this gap does NOT invalidate any audit conclusion. The
port-parity direction (fi vs trt) is invariant under this measurement
gap because the v3 bench measures fi and trt through the same
torch.profiler pipeline; systematic inflation affects both sides
equally (or at least correlatedly), leaving fi-vs-trt Δ percent
intact. The gap only affects absolute per-kernel numbers quoted for
gemm2_finalize_trt — reader should treat our ~0.85 ms as a ~19
percent overestimate of the ~0.68 ms standalone ground truth,
pending follow-up flashinfer-ai#6.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
Code-reading review 2026-04-24: `convert_sf_to_mma_layout` is a pure
`.view(...).permute(...)` strided view — it does not move data; the
underlying GPU bytes ARE the input SF bytes. The kernel reads via
`data_ptr()` + stride metadata, getting the same bytes TRT-LLM's
kernel reads. TRT-LLM's `swizzle_sf(unswizzle_sf(sf, ...))` is a
round-trip empirically verified byte-identical to the input SF.
Both paths hand the CuteDSL kernel the same bytes.

Also: the 6D layout (32, 4, m//128, 4, k//4, num_groups) uses M=128
as fundamental sub-tile REGARDLESS of tile_size. The 2CTA variant
at tile_size=256 reads 2 adjacent m_tiles across two CTAs; the SF
byte layout doesn't change. The mechanism originally proposed for
this candidate (tile_size-dependent SF layout mismatch) was based
on a misreading of the layout.

Kept the abandoned sf_layout_diff_test.py attempt as a record —
its .contiguous()-on-strided-view comparison produced a false
88.72 percent divergence report that was a test-harness artifact,
not a real finding. The corrected interpretation supersedes that
test's nominal verdict.

Working suspicion now moves to moe_permute (JIT-compiled sibling
of moe_sort in moe_utils.py) — consumes moe_sort's now-verified
output, explicitly tile_size-parameterized, and has not been
isolated by any prior probe.

Candidates ruled out so far:
 - kernel bodies (deep audit)
 - flashinfer-ai#1 MbarrierArray shim (2026-04-23 revert experiment)
 - flashinfer-ai#2 moe_sort / routing tables (2026-04-24 self-consistency)
 - flashinfer-ai#4 SF layout conversion (2026-04-24 code reading)

Candidates still open: flashinfer-ai#3 fence_proxy shim (low prior), flashinfer-ai#5
orchestration / buffer sizing, flashinfer-ai#6 top-level wrappers. moe_permute
now promoted to primary suspect (wasn't cleanly separated in the
original flashinfer-ai#2 entry; test script in progress).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 24, 2026
…lu gap at tile_size=256

Now that correctness at tile_size=256 is established (previous commit
reclassifies flashinfer-ai#3067 as test artifact), the real open work item is the
perf gap that the audit had mis-attributed to the correctness bug.

2026-04-24 experiment, forced tile_size=256 on flashinfer via
tuner.py patch (for tile_size in [256]):
  fi gemm1_swiglu at N=16384, tile=256: 2.644 ms
  trt gemm1_swiglu at N=16384, tile=256: 1.806 ms
  +46.4 percent at the SAME tactic

At tile=128, flashinfer's gemm1 was 2.711 ms — so enabling tile=256
barely helps flashinfer, while TRT-LLM at the same tile=256 tactic
gets much more throughput. The large-batch +27 percent top-line
regression is NOT resolved by un-gating tile_size=256.

Both sides compile the identical CuteDSL kernel source (deep audit
established semantic identity of kernel bodies). So the SASS or
runtime differs despite identical Python source. Working hypotheses:

 - 8a. Compile-time parameter / constexpr drift between wrappers'
   invocations causes different SASS
 - 8b. Launch-grid / cluster config mis-set on flashinfer (2CTA
   effectively running as 1CTA)
 - 8c. Input buffer alignment / stride differences the kernel
   optimizer exploits
 - 8d. Stream / cooperative-group sync context difference

Suggested first probe: nsys trace at N=16384 for both sides at
tile_size=256, compare launch config + SASS identifier + span
alignment. Should quickly narrow 8a vs 8b vs 8d.

Note on prior follow-up flashinfer-ai#1: this supersedes its framing. "MbarrierArray
shim / tile_gating causes the regression" is now invalidated by the
forced-tile=256 experiment showing un-gating does not recover the
regression. The rest of flashinfer-ai#3067-era candidates (flashinfer-ai#3 fence_proxy, flashinfer-ai#5
orchestration, flashinfer-ai#6 top-level wrappers) were framed around a
correctness bug that doesn't exist and are now subordinate.
Follow-up flashinfer-ai#8 replaces them as the primary perf work item.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…ground-truth verification

A single nsys trace at N=8192 with `--nsys-capture-range` (commit
40bb77e) bracketing only the timed measurement passes resolved
both remaining measurement-related follow-ups.

flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation):
direct wall-clock comparison at N=16384 / 30 iters shows identical
wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the
3 s delta is autotune-compile + Python-startup variance, well below
the ~240 ms of actual GPU measurement work in 70+ s of total
wall-clock). The historical 2× signature was always a `cupti-python`
span-attribution artifact, never real GPU work — and it does not
reproduce under current methodology. A smaller asymmetric bias
(~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which
is the rationale for keeping `--use-cupti` opt-in (default off).

flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys
ground truth at N=8192 = 0.737 ms; current in-bench reports
0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms
(7.1% below ground truth, harness-to-harness rounding tolerance).
The original 19% gap was specific to the older `--use-cupti` config
against the standalone — under current methodology there is no
systematic bias.

Audit changes:

- New "Ground-truth nsys verification (2026-04-28)" section
  immediately after the post-fix verification, documenting the run
  command, per-kernel ground truth, the resolution of both
  follow-ups with quantitative tables, and a note that the trace
  also serves as a third independent kernel-port faithfulness
  check (kernel mangled-name structure matches modulo encoded
  module path).

- Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray`
  framing was wrong; actual cause was the gemm2-enumeration gap
  fixed at d291d17e/f0cf8cd0 on the standalone PR branch).

- Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes.

- Top-of-file correction section title updated to "2026-04-24/25/28"
  and short summary expanded to mention the verification round.

The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully
closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are
all scope-expansions, not investigations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…d-and-skipped

flashinfer-ai#3 is a real coverage gap (at gs=1.0 the scale-conversion code paths
run with degenerate values, so a divergent FP8_MAX or scale-convention
mismatch between the two sides would silently produce identical
output here and divergent output at non-trivial scales). Skipped
because:

- Both sides have their own internal scale-plumbing tests with
  non-trivial scales. TRT-LLM's
  tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py and
  flashinfer's tests/moe/test_cute_dsl_fused_moe.py both exercise
  non-trivial-gs configurations against PyTorch references. The
  bench wouldn't add coverage their CI doesn't already have.

- A failure here wouldn't tell us what we want to know. The most
  likely cause of fi-vs-trt parity divergence at production scales
  would be a scale-convention disagreement BETWEEN the two
  implementations — not fixable in flashinfer (TRT-LLM is upstream),
  and dramatically out of scope.

- Risk of getting nerd-sniped on bench-side mistakes. Scale plumbing
  has many surfaces (alpha, weight_scale_2, input_scale, is_sf_*
  flags, fp4_quantize signatures); any tiny bench-side mismatch
  produces a parity failure that looks like a port bug. That's the
  same failure pattern as flashinfer-ai#4 / flashinfer-ai#6 / flashinfer-ai#8 from earlier in the audit.

Scope-difference note: the audit's load-bearing question is "is the
kernel port faithful?" — closed via byte-identical PTX + matching
timing + 45 parity cells. flashinfer-ai#3 is about "do flashinfer and TRT-LLM
agree on NVFP4 scaling conventions?" — a separate question, real
but tangential to port-faithfulness.

One scenario that would re-elevate this: a planned production ship
of CuteDslMoEWrapper to a caller using non-trivial scales, where
the team wants one independent cross-check (against CuteDslFusedMoE
under the same scales) before merging. In that scenario flashinfer-ai#3 is
exactly the right pre-ship sanity. For closing out the investigation
audit, it's tangential.

Effective remaining open follow-ups: flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity
rerun) and flashinfer-ai#9 (EP=16 tactic-divergence root cause).
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…flashinfer-ai#9 remains

Three close-out edits to wrap the CuteDSL MoE FP4 port audit:

- Mark flashinfer-ai#5 (cutlass-dsl 4.4.2 sanity rerun) considered-and-skipped,
  matching the closure pattern used for flashinfer-ai#2 and flashinfer-ai#3. Three reasons:
  (a) port-parity claims are unaffected by DSL-compiler version since
  both sides use the same in-container compiler, (b) flashinfer and
  TRT-LLM each have CI testing 4.4.2 already, (c) install hassle plus
  unsupported-config risk produces the same ambiguous-failure pattern
  that cost time on flashinfer-ai#4/flashinfer-ai#6/flashinfer-ai#8. Auto-resolves whenever NGC bumps the
  image. Install recipe preserved for future absolute-latency probes.

- Add a "Version-skew caveat (2026-04-28)" subsection to the
  top-of-file correction. The bench compares flashinfer-with-post-rc5-
  forward-ports (bb2f88329, 6b8ae6fa8, fae498579) vs TRT-LLM-rc5.post2-
  without-them, so the +3.5% / +4.1% headlines at EP=1 N=16384 may
  partly reflect the version asymmetry. Load-bearing claims (port
  faithfulness via byte-identical source + PTX, 45/45 parity, flashinfer-ai#3067
  fix) are unaffected because they do not depend on absolute deltas.
  Naturally re-baselines when NGC publishes a 1.3.x.x image absorbing
  the post-rc5 commits.

- Update the "Open follow-ups remaining" summary: flashinfer-ai#5 added to the
  considered-and-skipped list alongside flashinfer-ai#2/flashinfer-ai#3, leaving flashinfer-ai#9 (EP=16
  tactic-divergence root cause) as the only effective open follow-up.
  Audit declared closed 2026-04-28.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant