[Fix] Free allocated CUDA memory in prefill by MasterJH5574 · Pull Request #7 · flashinfer-ai/flashinfer

MasterJH5574 · 2023-10-15T02:20:12Z

No description provided.

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

…mulated, single-GPU) Current v3 baseline covers only EP=1 (256 experts on one rank, no collectives) — the cleanest port-parity setup but not what DeepSeek-V3 actually deploys. bench_moe_deepseek.py supports single-GPU EP simulation by slicing weight tensors to the local expert subset; we should mirror that pattern. Captured the ~20-line bench change scope (plumb --ep through, num_local_experts/local_expert_offset to both wrappers, slice weights), the data plan (rerun 15 sizes at EP=8 and EP=16, extend v3 table, cross-verify against bench_moe_deepseek.py --ep 8/16), and the scientific motivation (L=32 and L=16 change persistent- kernel outer-loop shape, may expose different tactic selection, could sharpen follow-up flashinfer-ai#1's tile_size=256 investigation). Not blocking the current audit's EP=1 conclusion.

…ground-truth verification A single nsys trace at N=8192 with `--nsys-capture-range` (commit 40bb77e) bracketing only the timed measurement passes resolved both remaining measurement-related follow-ups. flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation): direct wall-clock comparison at N=16384 / 30 iters shows identical wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the 3 s delta is autotune-compile + Python-startup variance, well below the ~240 ms of actual GPU measurement work in 70+ s of total wall-clock). The historical 2× signature was always a `cupti-python` span-attribution artifact, never real GPU work — and it does not reproduce under current methodology. A smaller asymmetric bias (~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which is the rationale for keeping `--use-cupti` opt-in (default off). flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys ground truth at N=8192 = 0.737 ms; current in-bench reports 0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms (7.1% below ground truth, harness-to-harness rounding tolerance). The original 19% gap was specific to the older `--use-cupti` config against the standalone — under current methodology there is no systematic bias. Audit changes: - New "Ground-truth nsys verification (2026-04-28)" section immediately after the post-fix verification, documenting the run command, per-kernel ground truth, the resolution of both follow-ups with quantitative tables, and a note that the trace also serves as a third independent kernel-port faithfulness check (kernel mangled-name structure matches modulo encoded module path). - Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray` framing was wrong; actual cause was the gemm2-enumeration gap fixed at d291d17e/f0cf8cd0 on the standalone PR branch). - Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes. - Top-of-file correction section title updated to "2026-04-24/25/28" and short summary expanded to mention the verification round. The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are all scope-expansions, not investigations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…w follow-up flashinfer-ai#9 Bench now supports --ep N for single-GPU EP simulation (commit da2f0e3). Full sweep at EP=1 / 8 / 16 across all 15 token counts produced 45 (size, EP) data points — all pass parity with max_abs ≤ 0.0625. EP=1 results reproduce the audit's headline post-fix verification (+4.1% at N=16384, vs +3.5% in the 2026-04-25 table — within run- to-run noise). EP=8 and EP=16 add new coverage of the deployment- realistic configs that DeepSeek-V3 actually runs under. Audit changes: - New "EP=8 / EP=16 single-GPU sweep (2026-04-28)" section immediately after the ground-truth nsys verification. Documents the bench plumbing changes (Mapping construction with tp_size=ep, moe_tp_size=1, moe_ep_size=ep + the dual-binding monkey-patch on can_access_peer needed because plugin.py's imported binding is unaffected by patching _ipc_utils alone). Includes the full 45-cell (size, EP) Δ% table. - Three observations from the data: (1) Small-batch Δ% explodes at EP>1 due to fixed-overhead- fraction effect — not actionable. (2) Large-batch (N=16384) Δ% stays modest across EP values (+4.1, +7.3, +7.9% at EP=1, 8, 16). (3) Per-kernel gap widens in flashinfer's disfavor at smaller per-rank expert count: gemm1+gemm2 sum at N=16384 goes -2.4% (fi faster, EP=1) → +8.6% (EP=8) → +13.7% (EP=16). Both kernels' compiled PTX is byte-identical between sides (proven 2026-04-24), so this is tactic-selection or wrapper-overhead, not kernel-binary divergence. - Follow-up flashinfer-ai#7 marked RESOLVED. - New follow-up flashinfer-ai#9 added: "Investigate why flashinfer's per-kernel times grow faster than TRT-LLM's at smaller per-rank expert count (EP>1)." Three plausible mechanisms with concrete probe steps. Not blocking — EP=1 port-parity remains the load-bearing finding. Top-of-file correction title and open-follow-ups summary updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[Fix] Free allocated CUDA memory in prefill

1bd9fd9

yzh119 merged commit 1dfac85 into main Oct 15, 2023

MasterJH5574 deleted the cudafree branch October 17, 2023 21:53

yyihuang mentioned this pull request Sep 19, 2025

[bug] bmm_fp8 test error #1738

Closed

kahyunnam mentioned this pull request May 2, 2026

DGX Spark (SM121) Current Support Audit #3170

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Free allocated CUDA memory in prefill#7

[Fix] Free allocated CUDA memory in prefill#7
yzh119 merged 1 commit intomainfrom
cudafree

MasterJH5574 commented Oct 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MasterJH5574 commented Oct 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants