Skip to content

[Wrapper] Split prefill/decode wrapper#12

Merged
yzh119 merged 1 commit intomainfrom
split-wrapper
Nov 8, 2023
Merged

[Wrapper] Split prefill/decode wrapper#12
yzh119 merged 1 commit intomainfrom
split-wrapper

Conversation

@MasterJH5574
Copy link
Copy Markdown
Collaborator

This PR splits the TVM wrapper so that prefill and decode has their own wrapper function.

This prevents the runtime overhead of dispatching to prefill kernel or decode kernel.

This PR splits the TVM wrapper so that prefill and decode
has their own wrapper function.

This prevents the runtime overhead of dispatching to prefill
kernel or decode kernel.
@yzh119 yzh119 merged commit 4430b04 into main Nov 8, 2023
@MasterJH5574 MasterJH5574 deleted the split-wrapper branch November 8, 2023 19:01
wangbo981016 pushed a commit to meituan-longcat/flashinfer that referenced this pull request Feb 5, 2026
 (flashinfer-ai#12)

Co-authored-by: wangbo134 <wangbo134@meituan.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request Apr 30, 2026
…g→functional-API migration (NOT committed)

Captures the analysis surfaced 2026-04-30 around an alternative
to the medium-term thin-adapter refactor: migrate SGLang from
`CuteDslMoEWrapper.run(...)` to `cute_dsl_fused_moe_nvfp4(...)`
directly, then deprecate the wrapper. Lee maintains both fi and
SGLang sides, so the cross-project coordination cost normally
faced by such a migration is essentially zero — but the question
of whether to pursue is not decided.

Heavily caveated as a future consideration that **may not be
pursued**. Priority is explicitly lower than the existing PR queue
(PR flashinfer-ai#3171, flashinfer-ai#3198, bucket-cap-fix, prealloc-fix, convergence) and
Phase 1 in follow-up flashinfer-ai#11.

Captured for two reasons:

1. The 5-minute pass on SGLang's `flashinfer_cutedsl.py:ensure_cutedsl_wrapper()`
   surfaces specific gaps in the functional API that would need to
   be closed before any migration. Documenting them here means
   future-Lee (or future-someone) doesn't have to re-derive them.

2. The thin-adapter refactor and the deprecate-the-wrapper migration
   are alternatives, not sequential. If migration is pursued, the
   thin-adapter is redundant. If it's not, the thin-adapter is the
   recommended cleanup. The relationship needed to be on the record.

Captured fi-side prerequisites (if pursued):

- Add `moe_sort_buffers`, `gemm1_out`, `gemm1_out_scale` parameters
  to `cute_dsl_fused_moe_nvfp4`. Forward to `_moe_core_impl` (which
  already accepts them). ~15 lines, additive (existing wrapper
  keeps working).
- Verify `CuteDslFusedMoENvfp4Runner` per-call construction doesn't
  bust the autotune cache or add overhead.
- CUDA-graph capture/replay validation of `cute_dsl_fused_moe_nvfp4`
  under SGLang-style preallocation.

Captured benefits / risks / decision criteria. Status: NOT decided;
revisit only after the existing PR queue settles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 4, 2026
…erged 2026-05-04

PR flashinfer-ai#3198 merged 2026-05-04 18:58 UTC as squash-merge commit `393e83ea`.
fi's `get_max_num_tiles` formula now matches TRT-LLM's compact closed-
form, eliminating the +1 tile over-allocation when (E - L) % T != 0.

Updates the line-11 final-state entry and the line-4369 PR queue
annotation in follow-up flashinfer-ai#12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 5, 2026
Updates the line-4370 PR queue annotation in follow-up flashinfer-ai#12 to reflect
that convergence-patch is now upstream as PR flashinfer-ai#3226 (initially Draft,
promoted to full PR same day, awaiting review).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 6, 2026
…05-06

PR flashinfer-ai#3226 merged 2026-05-06 12:54 UTC as squash-merge commit
`979644f1`. Drops 4 redundant Python-side `.fill_(-1)` / `.zero_()`
calls in `moe_sort` and aligns else-branch allocations with trt-llm's
`torch::empty(...)` thop pattern.

Updates the line-4369 PR queue annotation in follow-up flashinfer-ai#12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau added a commit to leejnau/flashinfer that referenced this pull request May 6, 2026
PR flashinfer-ai#3216 merged 2026-05-06 22:09 UTC as squash-commit `e6ac7cc2`.
Replaces the pre-computed-tuple bucket cap with a bare-callable form
that adapts to runtime input dim. Pairs with the prealloc-fix
(now rebased onto post-flashinfer-ai#3216 main as HEAD `c7a81fdb`, ready to PR).

Updates the line-15 final-state entry, line-51 perf-investigation-
closed paragraph, and line-4369 PR queue annotation in follow-up flashinfer-ai#12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants