Conversation
This PR splits the TVM wrapper so that prefill and decode has their own wrapper function. This prevents the runtime overhead of dispatching to prefill kernel or decode kernel.
wangbo981016
pushed a commit
to meituan-longcat/flashinfer
that referenced
this pull request
Feb 5, 2026
(flashinfer-ai#12) Co-authored-by: wangbo134 <wangbo134@meituan.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
Apr 30, 2026
…g→functional-API migration (NOT committed) Captures the analysis surfaced 2026-04-30 around an alternative to the medium-term thin-adapter refactor: migrate SGLang from `CuteDslMoEWrapper.run(...)` to `cute_dsl_fused_moe_nvfp4(...)` directly, then deprecate the wrapper. Lee maintains both fi and SGLang sides, so the cross-project coordination cost normally faced by such a migration is essentially zero — but the question of whether to pursue is not decided. Heavily caveated as a future consideration that **may not be pursued**. Priority is explicitly lower than the existing PR queue (PR flashinfer-ai#3171, flashinfer-ai#3198, bucket-cap-fix, prealloc-fix, convergence) and Phase 1 in follow-up flashinfer-ai#11. Captured for two reasons: 1. The 5-minute pass on SGLang's `flashinfer_cutedsl.py:ensure_cutedsl_wrapper()` surfaces specific gaps in the functional API that would need to be closed before any migration. Documenting them here means future-Lee (or future-someone) doesn't have to re-derive them. 2. The thin-adapter refactor and the deprecate-the-wrapper migration are alternatives, not sequential. If migration is pursued, the thin-adapter is redundant. If it's not, the thin-adapter is the recommended cleanup. The relationship needed to be on the record. Captured fi-side prerequisites (if pursued): - Add `moe_sort_buffers`, `gemm1_out`, `gemm1_out_scale` parameters to `cute_dsl_fused_moe_nvfp4`. Forward to `_moe_core_impl` (which already accepts them). ~15 lines, additive (existing wrapper keeps working). - Verify `CuteDslFusedMoENvfp4Runner` per-call construction doesn't bust the autotune cache or add overhead. - CUDA-graph capture/replay validation of `cute_dsl_fused_moe_nvfp4` under SGLang-style preallocation. Captured benefits / risks / decision criteria. Status: NOT decided; revisit only after the existing PR queue settles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 tasks
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
May 4, 2026
…erged 2026-05-04 PR flashinfer-ai#3198 merged 2026-05-04 18:58 UTC as squash-merge commit `393e83ea`. fi's `get_max_num_tiles` formula now matches TRT-LLM's compact closed- form, eliminating the +1 tile over-allocation when (E - L) % T != 0. Updates the line-11 final-state entry and the line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
May 5, 2026
Updates the line-4370 PR queue annotation in follow-up flashinfer-ai#12 to reflect that convergence-patch is now upstream as PR flashinfer-ai#3226 (initially Draft, promoted to full PR same day, awaiting review). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
May 6, 2026
…05-06 PR flashinfer-ai#3226 merged 2026-05-06 12:54 UTC as squash-merge commit `979644f1`. Drops 4 redundant Python-side `.fill_(-1)` / `.zero_()` calls in `moe_sort` and aligns else-branch allocations with trt-llm's `torch::empty(...)` thop pattern. Updates the line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leejnau
added a commit
to leejnau/flashinfer
that referenced
this pull request
May 6, 2026
PR flashinfer-ai#3216 merged 2026-05-06 22:09 UTC as squash-commit `e6ac7cc2`. Replaces the pre-computed-tuple bucket cap with a bare-callable form that adapts to runtime input dim. Pairs with the prealloc-fix (now rebased onto post-flashinfer-ai#3216 main as HEAD `c7a81fdb`, ready to PR). Updates the line-15 final-state entry, line-51 perf-investigation- closed paragraph, and line-4369 PR queue annotation in follow-up flashinfer-ai#12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR splits the TVM wrapper so that prefill and decode has their own wrapper function.
This prevents the runtime overhead of dispatching to prefill kernel or decode kernel.