update trtllm cutlass moe #2020
Conversation
…to feature/cutlass_moe_3xfp4
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThreads swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab and finalize-fusion flags through MOE/CUTLASS flows; adds SM90 scatter epilogue visitor; extends tile/cluster enums and SM100/SM120 candidate generation; renames many kernel namespaces to cutlass_kernels_oss; adds explicit template instantiations and launcher/signature updates. Changes
Sequence Diagram(s)sequenceDiagram
participant App
participant Runner as CutlassMoeFCRunner
participant Heuristic
participant Profiler
participant Dispatcher
Note over App,Runner: runMoe(..., swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab)
App->>Runner: runMoe(...)
Runner->>Heuristic: getTactics(gemm_id, sm, supports_finalize_fusion)
Heuristic-->>Runner: candidate CutlassGemmConfig (may include FINALIZE, swap_ab, dynamic cluster shapes)
Runner->>Profiler: profile/select (uses unpadded_hidden_size, stage-specific tactic counts)
Profiler-->>Runner: selected gemm_config
Runner->>Dispatcher: dispatch(gemm_config, router_scales, permuted_row_to_unpermuted_row, swizzled_input_sf, swap_ab)
Dispatcher-->>Runner: launches kernel (TMA warp specialized / finalize fused / scatter epilogue)
Runner-->>App: results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas to focus during review:
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @nv-yunzheq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on enhancing the TensorRT-LLM (TRTLLM) CUTLASS Mixture-of-Experts (MoE) implementation, particularly for Hopper and Blackwell architectures. The main objective is to introduce a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
[FAILED] Pipeline #37989907: 12/17 passed |
|
There are still some remaining cu126 compilation issues such as: Likely because we didn't add guard on the usage of We will retire cu126 at some point, but not now (considering cu126 is still one of the three supported cuda version of pytorch). |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
99-102: FP4 guards insufficient for CUDA 12.6 compatibilityThe guard only checks
ENABLE_FP4, but__nv_fp4_e2m1requires CUDA 12.8+. The CI failure on cu126 (reported in PR objectives) confirms this: the identifier is undefined because CUDA 12.6 doesn't provide it. Same issue exists at lines 249-253, 742-746, and 755-759.Apply guards that also check CUDA version:
-#if defined(ENABLE_FP4) +#if defined(ENABLE_FP4) && CUDA_VERSION >= 12080 cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value || #endifRepeat for all FP4 type references at lines 249-253, 742-746, and 755-759.
♻️ Duplicate comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
672-676: Fix zero-argument call to supportsTmaWarpSpecializedThis duplicates a past review concern:
isTmaWarpSpecializedcallssupportsTmaWarpSpecialized()without arguments on line 675, but the signature at lines 679-688 now requires anint smparameter. The same issue occurs at line 920 incalcMaxWorkspaceSize.Apply this diff to forward the member's
sm_:- return supportsTmaWarpSpecialized() && config_is_tma_warp_specialized; + return supportsTmaWarpSpecialized(sm_) && config_is_tma_warp_specialized;Also fix line 920:
- if (!supportsTmaWarpSpecialized()) { + if (!supportsTmaWarpSpecialized(sm_)) {Alternatively, add a const wrapper in the class:
bool supportsTmaWarpSpecialized() const { return supportsTmaWarpSpecialized(sm_); }Based on learnings
🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
953-956: Consider extending FINALIZE fusion workspace calculation beyond SM90FINALIZE fusion workspace size is currently only calculated for SM90 (line 954). If other architectures (e.g., SM100+) support finalize fusion, they should also be included in this calculation to avoid underestimating workspace requirements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h(14 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (6)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)
tensorrt_llm(63-112)std(81-95)calcMaxWorkspaceSizeTmaWarpSpecialized(490-502)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/common.h (1)
tensorrt_llm(19-34)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws_mixed_dtype.h (1)
tensorrt_llm(60-274)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (9)
tensorrt_llm(33-150)kernels(34-149)cutlass(114-116)cutlass(120-122)cutlass(127-129)cutlass(132-134)cutlass(140-142)cutlass_kernels(35-148)__nv_fp8_e5m2(91-93)csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
get_candidate_configs(638-689)get_candidate_configs(638-640)csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)
EpilogueScheduleType(197-433)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (4)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)
530-544: LGTM: Clean signature updates for finalize fusion supportThe addition of the
supports_finalize_fusionparameter to both the const member and staticgetConfigsmethods properly threads this capability flag through the config selection pipeline.
624-629: Verify SM103 FP4 config selection strategyThe code explicitly adds SM100 configs when running on SM103 with FP4. Ensure this cross-architecture config reuse is validated and doesn't cause performance regressions or compatibility issues.
631-666: Well-structured finalize fusion and swap_ab config expansionThe logic correctly:
- Duplicates configs and marks them with FINALIZE fusion type when supported (lines 631-640)
- Removes FINALIZE configs that lack epilogue SMEM (lines 642-650)
- Adds swap_ab variants for all configs (lines 653-659) with a defensive check
- Filters to swap_ab=true only for w4_groupwise mode (lines 661-666)
978-1007: Activation type dispatch looks correctThe switch statement appropriately handles the supported activation types (Relu, Gelu, Silu, Identity, Swiglu, Geglu) and throws for invalid types. Note that
Relu2from theActivationTypeenum is not handled, which appears intentional per the AI summary noting "Relu2 path removed (no longer supported)".
|
/bot run |
nvmbreughe
left a comment
There was a problem hiding this comment.
LGTM.
Perhaps just add the additional tests for DSR1 and autotuner we discussed.
| cute::make_shape(gemm_n, gemm_k, 1)); | ||
| } | ||
| if (layout_info.stride_c) { | ||
| // TODO Enable 1xN bias matrix as C |
There was a problem hiding this comment.
Does this mean we don't support batch size = 1 ?
There was a problem hiding this comment.
No, it's just the bias tensor could not be 1xN
|
[FAILED] Pipeline #38037173: 14/17 passed |
|
Per discussion offline, this PR should be ready to merge, but there are some problem shapes not covered in the backend (and the CI), and we will follow up and adding more unittests with different problem shapes in future PRs. |
<!-- .github/pull_request_template.md --> ## 📌 Description Patch sm103 for 3xfp4 moe generation ## 🔍 Related Issues Following up of #2020 #1925 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ``` $ ls csrc/nv_internal/tensorrt_llm/cutlass_instantiations/103/gemm_grouped 100 103 80 $ pytest tests/moe/test_trtllm_cutlass_fused_moe.py 22 passed, 3 skipped, 1 warning in 771.89s (0:12:51) ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for Blackwell (SM103) GPU architecture in MOE (Mixture of Experts) operations with specialized CUTLASS-optimized modules. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
| torch.cuda.get_device_capability()[0] not in [10, 11], | ||
| reason="MXFP8xMXFP4 is only supported on SM100 and SM110", | ||
| torch.cuda.get_device_capability()[0] not in [10, 11, 12], | ||
| reason="MXFP8xMXFP4 is only supported on SM100, SM110 and SM120", |
<!-- .github/pull_request_template.md --> ## 📌 Description <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * SM90 scatter-based epilogue and broader SM100/SM120 MOE/GEMM coverage; new public enum for GEMM stages and explicit runner instantiations. * **Improvements** * New runtime controls and parameters exposed: dynamic CGA, swap-AB, swizzled-input SF, unpadded hidden-size, and per-GEMM-stage tactic counts; expanded tile/cluster shape options, finalize-epilogue fusion and fusion/swap-aware dispatch; increased runtime debug logging and profiling. * **Bug Fixes** * License/namespace/header cleanups, suppressed compiler warnings, tightened assertions. * **Tests** * MXFP8×MXFP4 test now permits SM120 devices. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Yong Wu <yowu@nvidia.com> Co-authored-by: Alex Yang <aleyang@nvidia.com>
<!-- .github/pull_request_template.md --> ## 📌 Description Patch sm103 for 3xfp4 moe generation ## 🔍 Related Issues Following up of flashinfer-ai#2020 flashinfer-ai#1925 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ``` $ ls csrc/nv_internal/tensorrt_llm/cutlass_instantiations/103/gemm_grouped 100 103 80 $ pytest tests/moe/test_trtllm_cutlass_fused_moe.py 22 passed, 3 skipped, 1 warning in 771.89s (0:12:51) ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for Blackwell (SM103) GPU architecture in MOE (Mixture of Experts) operations with specialized CUTLASS-optimized modules. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests