Skip to content

[Fix] Free allocated CUDA memory in prefill#7

Merged
yzh119 merged 1 commit intomainfrom
cudafree
Oct 15, 2023
Merged

[Fix] Free allocated CUDA memory in prefill#7
yzh119 merged 1 commit intomainfrom
cudafree

Conversation

@MasterJH5574
Copy link
Copy Markdown
Collaborator

No description provided.

@yzh119 yzh119 merged commit 1dfac85 into main Oct 15, 2023
@MasterJH5574 MasterJH5574 deleted the cudafree branch October 17, 2023 21:53
diptorupd referenced this pull request in ROCm/flashinfer Sep 29, 2025
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.

Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.


```Test project /root/flashinfer/libflashinfer/tests/hip/build
    Start 1: MathTest
1/8 Test #1: MathTest ............................   Passed    0.31 sec
    Start 2: PosEncTest
2/8 Test #2: PosEncTest ..........................   Passed    0.31 sec
    Start 3: CascadeTest
3/8 Test #3: CascadeTest .........................   Passed  1369.12 sec
    Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest ....................   Passed  7726.35 sec
    Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest .....................   Passed  811.61 sec
    Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 .........   Passed    0.30 sec
    Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ...   Passed    0.28 sec
    Start 8: test_rowsum
8/8 Test #8: test_rowsum .........................   Passed    0.27 sec

100% tests passed, 0 tests failed out of 8
```
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 23, 2026
…mulated, single-GPU)

Current v3 baseline covers only EP=1 (256 experts on one rank, no
collectives) — the cleanest port-parity setup but not what
DeepSeek-V3 actually deploys. bench_moe_deepseek.py supports
single-GPU EP simulation by slicing weight tensors to the local
expert subset; we should mirror that pattern.

Captured the ~20-line bench change scope (plumb --ep through,
num_local_experts/local_expert_offset to both wrappers, slice
weights), the data plan (rerun 15 sizes at EP=8 and EP=16, extend
v3 table, cross-verify against bench_moe_deepseek.py --ep 8/16),
and the scientific motivation (L=32 and L=16 change persistent-
kernel outer-loop shape, may expose different tactic selection,
could sharpen follow-up flashinfer-ai#1's tile_size=256 investigation).

Not blocking the current audit's EP=1 conclusion.
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…ground-truth verification

A single nsys trace at N=8192 with `--nsys-capture-range` (commit
40bb77e) bracketing only the timed measurement passes resolved
both remaining measurement-related follow-ups.

flashinfer-ai#4 (`bench_gpu_time_with_cupti(use_cuda_graph=True)` 2× inflation):
direct wall-clock comparison at N=16384 / 30 iters shows identical
wall-clocks with and without `--use-cupti` (1m12.5s vs 1m9.5s; the
3 s delta is autotune-compile + Python-startup variance, well below
the ~240 ms of actual GPU measurement work in 70+ s of total
wall-clock). The historical 2× signature was always a `cupti-python`
span-attribution artifact, never real GPU work — and it does not
reproduce under current methodology. A smaller asymmetric bias
(~13% under-report on `trt_ms` vs ~5% on `fi_ms`) persists, which
is the rationale for keeping `--use-cupti` opt-in (default off).

flashinfer-ai#6 (in-bench vs standalone 19% gap on trt `gemm2_finalize`): nsys
ground truth at N=8192 = 0.737 ms; current in-bench reports
0.7465 ms (1.3% delta — within noise); standalone reports 0.685 ms
(7.1% below ground truth, harness-to-harness rounding tolerance).
The original 19% gap was specific to the older `--use-cupti` config
against the standalone — under current methodology there is no
systematic bias.

Audit changes:

- New "Ground-truth nsys verification (2026-04-28)" section
  immediately after the post-fix verification, documenting the run
  command, per-kernel ground truth, the resolution of both
  follow-ups with quantitative tables, and a note that the trace
  also serves as a third independent kernel-port faithfulness
  check (kernel mangled-name structure matches modulo encoded
  module path).

- Follow-up flashinfer-ai#1 marked RESOLVED (the original `MbarrierArray`
  framing was wrong; actual cause was the gemm2-enumeration gap
  fixed at d291d17e/f0cf8cd0 on the standalone PR branch).

- Follow-ups flashinfer-ai#4 and flashinfer-ai#6 entries replaced with closure notes.

- Top-of-file correction section title updated to "2026-04-24/25/28"
  and short summary expanded to mention the verification round.

The original "open mysteries" list (flashinfer-ai#1, flashinfer-ai#4, flashinfer-ai#6, flashinfer-ai#8) is now fully
closed. Items remaining in *Follow-ups queued* (flashinfer-ai#2, flashinfer-ai#3, flashinfer-ai#5, flashinfer-ai#7) are
all scope-expansions, not investigations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 28, 2026
…w follow-up flashinfer-ai#9

Bench now supports --ep N for single-GPU EP simulation (commit
da2f0e3). Full sweep at EP=1 / 8 / 16 across all 15 token counts
produced 45 (size, EP) data points — all pass parity with max_abs
≤ 0.0625.

EP=1 results reproduce the audit's headline post-fix verification
(+4.1% at N=16384, vs +3.5% in the 2026-04-25 table — within run-
to-run noise). EP=8 and EP=16 add new coverage of the deployment-
realistic configs that DeepSeek-V3 actually runs under.

Audit changes:

- New "EP=8 / EP=16 single-GPU sweep (2026-04-28)" section
  immediately after the ground-truth nsys verification. Documents
  the bench plumbing changes (Mapping construction with
  tp_size=ep, moe_tp_size=1, moe_ep_size=ep + the dual-binding
  monkey-patch on can_access_peer needed because plugin.py's
  imported binding is unaffected by patching _ipc_utils alone).
  Includes the full 45-cell (size, EP) Δ% table.

- Three observations from the data:
  (1) Small-batch Δ% explodes at EP>1 due to fixed-overhead-
      fraction effect — not actionable.
  (2) Large-batch (N=16384) Δ% stays modest across EP values
      (+4.1, +7.3, +7.9% at EP=1, 8, 16).
  (3) Per-kernel gap widens in flashinfer's disfavor at smaller
      per-rank expert count: gemm1+gemm2 sum at N=16384 goes
      -2.4% (fi faster, EP=1) → +8.6% (EP=8) → +13.7% (EP=16).
      Both kernels' compiled PTX is byte-identical between sides
      (proven 2026-04-24), so this is tactic-selection or
      wrapper-overhead, not kernel-binary divergence.

- Follow-up flashinfer-ai#7 marked RESOLVED.

- New follow-up flashinfer-ai#9 added: "Investigate why flashinfer's per-kernel
  times grow faster than TRT-LLM's at smaller per-rank expert
  count (EP>1)." Three plausible mechanisms with concrete probe
  steps. Not blocking — EP=1 port-parity remains the load-bearing
  finding.

Top-of-file correction title and open-follow-ups summary updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants