Conversation
diptorupd
referenced
this pull request
in ROCm/flashinfer
Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test #1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test #2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test #8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 23, 2026
…mulated, single-GPU) Current v3 baseline covers only EP=1 (256 experts on one rank, no collectives) — the cleanest port-parity setup but not what DeepSeek-V3 actually deploys. bench_moe_deepseek.py supports single-GPU EP simulation by slicing weight tensors to the local expert subset; we should mirror that pattern. Captured the ~20-line bench change scope (plumb --ep through, num_local_experts/local_expert_offset to both wrappers, slice weights), the data plan (rerun 15 sizes at EP=8 and EP=16, extend v3 table, cross-verify against bench_moe_deepseek.py --ep 8/16), and the scientific motivation (L=32 and L=16 change persistent- kernel outer-loop shape, may expose different tactic selection, could sharpen follow-up flashinfer-ai#1's tile_size=256 investigation). Not blocking the current audit's EP=1 conclusion.
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…ground-truth verification A single nsys trace at N=8192 with `--nsys-capture-range` (commit 40bb77e) bracketing only the timed measurement passes resolved both remaining measurement-related follow-ups. flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation): direct wall-clock comparison at N=16384 / 30 iters shows identical wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the 3 s delta is autotune-compile + Python-startup variance, well below the ~240 ms of actual GPU measurement work in 70+ s of total wall-clock). The historical 2× signature was always a `cupti-python` span-attribution artifact, never real GPU work — and it does not reproduce under current methodology. A smaller asymmetric bias (~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which is the rationale for keeping `--use-cupti` opt-in (default off). flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys ground truth at N=8192 = 0.737 ms; current in-bench reports 0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms (7.1% below ground truth, harness-to-harness rounding tolerance). The original 19% gap was specific to the older `--use-cupti` config against the standalone — under current methodology there is no systematic bias. Audit changes: - New "Ground-truth nsys verification (2026-04-28)" section immediately after the post-fix verification, documenting the run command, per-kernel ground truth, the resolution of both follow-ups with quantitative tables, and a note that the trace also serves as a third independent kernel-port faithfulness check (kernel mangled-name structure matches modulo encoded module path). - Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray` framing was wrong; actual cause was the gemm2-enumeration gap fixed at d291d17e/f0cf8cd0 on the standalone PR branch). - Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes. - Top-of-file correction section title updated to "2026-04-24/25/28" and short summary expanded to mention the verification round. The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are all scope-expansions, not investigations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 28, 2026
…w follow-up flashinfer-ai#9 Bench now supports --ep N for single-GPU EP simulation (commit da2f0e3). Full sweep at EP=1 / 8 / 16 across all 15 token counts produced 45 (size, EP) data points — all pass parity with max_abs ≤ 0.0625. EP=1 results reproduce the audit's headline post-fix verification (+4.1% at N=16384, vs +3.5% in the 2026-04-25 table — within run- to-run noise). EP=8 and EP=16 add new coverage of the deployment- realistic configs that DeepSeek-V3 actually runs under. Audit changes: - New "EP=8 / EP=16 single-GPU sweep (2026-04-28)" section immediately after the ground-truth nsys verification. Documents the bench plumbing changes (Mapping construction with tp_size=ep, moe_tp_size=1, moe_ep_size=ep + the dual-binding monkey-patch on can_access_peer needed because plugin.py's imported binding is unaffected by patching _ipc_utils alone). Includes the full 45-cell (size, EP) Δ% table. - Three observations from the data: (1) Small-batch Δ% explodes at EP>1 due to fixed-overhead- fraction effect — not actionable. (2) Large-batch (N=16384) Δ% stays modest across EP values (+4.1, +7.3, +7.9% at EP=1, 8, 16). (3) Per-kernel gap widens in flashinfer's disfavor at smaller per-rank expert count: gemm1+gemm2 sum at N=16384 goes -2.4% (fi faster, EP=1) → +8.6% (EP=8) → +13.7% (EP=16). Both kernels' compiled PTX is byte-identical between sides (proven 2026-04-24), so this is tactic-selection or wrapper-overhead, not kernel-binary divergence. - Follow-up flashinfer-ai#7 marked RESOLVED. - New follow-up flashinfer-ai#9 added: "Investigate why flashinfer's per-kernel times grow faster than TRT-LLM's at smaller per-rank expert count (EP>1)." Three plausible mechanisms with concrete probe steps. Not blocking — EP=1 port-parity remains the load-bearing finding. Top-of-file correction title and open-follow-ups summary updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.