[Wrapper] Split prefill/decode wrapper by MasterJH5574 · Pull Request #12 · flashinfer-ai/flashinfer

MasterJH5574 · 2023-11-08T18:38:04Z

This PR splits the TVM wrapper so that prefill and decode has their own wrapper function.

This prevents the runtime overhead of dispatching to prefill kernel or decode kernel.

This PR splits the TVM wrapper so that prefill and decode has their own wrapper function. This prevents the runtime overhead of dispatching to prefill kernel or decode kernel.

(flashinfer-ai#12) Co-authored-by: wangbo134 <wangbo134@meituan.com>

…g→functional-API migration (NOT committed) Captures the analysis surfaced 2026-04-30 around an alternative to the medium-term thin-adapter refactor: migrate SGLang from `CuteDslMoEWrapper.run(...)` to `cute_dsl_fused_moe_nvfp4(...)` directly, then deprecate the wrapper. Lee maintains both fi and SGLang sides, so the cross-project coordination cost normally faced by such a migration is essentially zero — but the question of whether to pursue is not decided. Heavily caveated as a future consideration that **may not be pursued**. Priority is explicitly lower than the existing PR queue (PR flashinfer-ai#3171, flashinfer-ai#3198, bucket-cap-fix, prealloc-fix, convergence) and Phase 1 in follow-up flashinfer-ai#11. Captured for two reasons: 1. The 5-minute pass on SGLang's `flashinfer_cutedsl.py:ensure_cutedsl_wrapper()` surfaces specific gaps in the functional API that would need to be closed before any migration. Documenting them here means future-Lee (or future-someone) doesn't have to re-derive them. 2. The thin-adapter refactor and the deprecate-the-wrapper migration are alternatives, not sequential. If migration is pursued, the thin-adapter is redundant. If it's not, the thin-adapter is the recommended cleanup. The relationship needed to be on the record. Captured fi-side prerequisites (if pursued): - Add `moe_sort_buffers`, `gemm1_out`, `gemm1_out_scale` parameters to `cute_dsl_fused_moe_nvfp4`. Forward to `_moe_core_impl` (which already accepts them). ~15 lines, additive (existing wrapper keeps working). - Verify `CuteDslFusedMoENvfp4Runner` per-call construction doesn't bust the autotune cache or add overhead. - CUDA-graph capture/replay validation of `cute_dsl_fused_moe_nvfp4` under SGLang-style preallocation. Captured benefits / risks / decision criteria. Status: NOT decided; revisit only after the existing PR queue settles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erged 2026-05-04 PR flashinfer-ai#3198 merged 2026-05-04 18:58 UTC as squash-merge commit `393e83ea`. fi's `get_max_num_tiles` formula now matches TRT-LLM's compact closed- form, eliminating the +1 tile over-allocation when (E - L) % T != 0. Updates the line-11 final-state entry and the line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updates the line-4370 PR queue annotation in follow-up flashinfer-ai#12 to reflect that convergence-patch is now upstream as PR flashinfer-ai#3226 (initially Draft, promoted to full PR same day, awaiting review). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…05-06 PR flashinfer-ai#3226 merged 2026-05-06 12:54 UTC as squash-merge commit `979644f1`. Drops 4 redundant Python-side `.fill_(-1)` / `.zero_()` calls in `moe_sort` and aligns else-branch allocations with trt-llm's `torch::empty(...)` thop pattern. Updates the line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR flashinfer-ai#3216 merged 2026-05-06 22:09 UTC as squash-commit `e6ac7cc2`. Replaces the pre-computed-tuple bucket cap with a bare-callable form that adapts to runtime input dim. Pairs with the prealloc-fix (now rebased onto post-flashinfer-ai#3216 main as HEAD `c7a81fdb`, ready to PR). Updates the line-15 final-state entry, line-51 perf-investigation- closed paragraph, and line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[Wrapper] Split prefill/decode wrapper

6abda11

This PR splits the TVM wrapper so that prefill and decode has their own wrapper function. This prevents the runtime overhead of dispatching to prefill kernel or decode kernel.

yzh119 merged commit 4430b04 into main Nov 8, 2023

MasterJH5574 deleted the split-wrapper branch November 8, 2023 19:01

yyihuang mentioned this pull request Sep 19, 2025

[bug] bmm_fp8 test error #1738

Closed

wangbo981016 pushed a commit to meituan-longcat/flashinfer that referenced this pull request Feb 5, 2026

add a pattern for none quanted layer with moe_hidden_states add_in

a3c5d11

(flashinfer-ai#12) Co-authored-by: wangbo134 <wangbo134@meituan.com>

kahyunnam mentioned this pull request May 2, 2026

DGX Spark (SM121) Current Support Audit #3170

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Wrapper] Split prefill/decode wrapper#12

[Wrapper] Split prefill/decode wrapper#12
yzh119 merged 1 commit intomainfrom
split-wrapper

MasterJH5574 commented Nov 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MasterJH5574 commented Nov 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants