update trtllm cutlass moe by nv-yunzheq · Pull Request #2020 · flashinfer-ai/flashinfer

nv-yunzheq · 2025-10-31T21:07:05Z

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- SM90 scatter-based epilogue and broader SM100/SM120 MOE/GEMM coverage; new public enum for GEMM stages and explicit runner instantiations.
Improvements
- New runtime controls and parameters exposed: dynamic CGA, swap-AB, swizzled-input SF, unpadded hidden-size, and per-GEMM-stage tactic counts; expanded tile/cluster shape options, finalize-epilogue fusion and fusion/swap-aware dispatch; increased runtime debug logging and profiling.
Bug Fixes
- License/namespace/header cleanups, suppressed compiler warnings, tightened assertions.
Tests
- MXFP8×MXFP4 test now permits SM120 devices.

…pgrade

…to feature/cutlass_moe_3xfp4

coderabbitai · 2025-10-31T21:07:17Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Threads swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab and finalize-fusion flags through MOE/CUTLASS flows; adds SM90 scatter epilogue visitor; extends tile/cluster enums and SM100/SM120 candidate generation; renames many kernel namespaces to cutlass_kernels_oss; adds explicit template instantiations and launcher/signature updates.

Changes

Cohort / File(s)	Summary
Fused MOE instantiations `csrc/fused_moe/cutlass_backend/cutlass_fused_moe_instantiation.cu`	Added explicit template instantiation for CutlassMoeFCRunner with FP8/uint4/BF16/FP8 combo and INSTANTIATE_FINALIZE_MOE_ROUTING(...) instantiations (half, float, conditional BF16).
Fused MOE kernels `csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`	Threaded swizzled_input_sf, padded_cols/unpadded_cols, router_scales, permuted_row_to_unpermuted_row; updated writeSF, strides, finalize kernels, expandInputRows, fusion path selection, profiler/workspace flows.
Bindings & Runner `csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_sm100_binding.cu`, `flashinfer/fused_moe/core.py`	FusedMoeRunner aggregates gemm1/gemm2 tactics, exposes tactic counts/getters, wires unpadded_hidden_size and swizzled_input_sf through profiling/init, and flashinfer tuner threads gemm_idx_for_tuning for stage-specific tactic selection.
MOE public interfaces `csrc/nv_internal/.../include/moe_kernels.h`, `.../include/moe_gemm_kernels.h`	Added MoeGemmId; expanded getTactics/getConfigs/signatures to accept gemm id/sm/fusion flags; runMoe/gemm2/computeStrides dispatch signatures extended with swizzled_input_sf, unpadded_hidden_size, router_scales, permutation pointers; added use_fused_finalize_ and profiler workspace arrays.
TMA warp inputs & workspace `.../moe_gemm/moe_tma_warp_specialized_input.cu`, `.../moe_gemm/moe_tma_warp_specialized_traits.h`	Workspace buffers increased (17→20); A/B renamed to Act/Weight pointers/strides; setFinalizeFusionParams signature changed; SM120/FP4/FP8 specialization checks reworked.
SM90 epilogue visitor `csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/.../sm90_visitor_scatter.hpp`	New SM90 scatter pointer-array epilogue visitor, reduction helpers, ScaledAccPerRow/PerColBias types, pointer-array scatter store fusion callbacks and fused-bias/scale/reduction variants.
Gemm config & heuristics `.../cutlass_extensions/gemm_configs.h`, `.../cutlass_kernels/cutlass_heuristic.{h,cpp}`	Added shape_tuple_to_enum/enum_to_shape_tuple, new TileShape/ClusterShape enums, EpilogueFusionType, dynamic/fallback cluster shapes, swap_ab field; added SM100/SM120 candidate generation and DYNAMIC_CGA-aware filtering.
Cutlass OSS namespace moves many files under `csrc/nv_internal/.../fpA_intB_gemm/`, `.../moe_gemm/`, `flashinfer/jit/gemm/cutlass/*`	Public namespace renamed to `tensorrt_llm::kernels::cutlass_kernels_oss`; updated opening/closing comments, re-exports, generated code, and call sites.
Dispatch & launchers `moe_gemm/moe_gemm_template_dispatch.h`, `moe_gemm_tma_ws_launcher.h`, `moe_gemm_tma_ws_mixed_input_launcher.`	Introduced dispatchMoeGemmFinalDispatchTmaWarpSpecialized and getDispatchFunctionForSM100; dispatchs now accept CutlassGemmConfig, dynamic/fallback cluster shapes; template parameter lists extended (EpilogueSchedule, DYNAMIC_CGA, SwapAB).
Gather utils & cuda utils `.../gather_tensor.hpp`, `csrc/nv_internal/include/tensorrt_llm/common/cudaUtils.h`	Moved IndexedGather/CustomStride into `cutlass::util` namespace; qualified cute:: types; added `template<bool VALUE> using ConstBool = ConstExprWrapper<bool, VALUE>;`.
MOE launchers & signatures `.../moe_gemm/launchers/*`	Multiple launcher signatures extended (biases, bias_is_broadcast, C output, padded/unpadded cols, gemm dims, num_experts, workspace_size, occupancy), and namespace rename to OSS variants.
Template instantiations & licenses many `moe_gemm_kernels_*.cu`	Standardized many headers to Apache‑2.0, simplified includes to `"moe_gemm_template_dispatch.h"`, added many explicit MoeGemmRunner template instantiations (various FP/BF/uint combos) and adjusted namespace boundaries.
Codegen & tests `flashinfer/jit/gemm/cutlass/generate_kernels.py`, `tests/moe/test_trtllm_cutlass_fused_moe.py`	Generator adds dynamic_cga and swap_ab flags and emits OSS namespace content including SM103/SM120 variants; test skip list expanded to include SM120 for MXFP8/MXFP4 test.

Sequence Diagram(s)

sequenceDiagram
  participant App
  participant Runner as CutlassMoeFCRunner
  participant Heuristic
  participant Profiler
  participant Dispatcher
  Note over App,Runner: runMoe(..., swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab)
  App->>Runner: runMoe(...)
  Runner->>Heuristic: getTactics(gemm_id, sm, supports_finalize_fusion)
  Heuristic-->>Runner: candidate CutlassGemmConfig (may include FINALIZE, swap_ab, dynamic cluster shapes)
  Runner->>Profiler: profile/select (uses unpadded_hidden_size, stage-specific tactic counts)
  Profiler-->>Runner: selected gemm_config
  Runner->>Dispatcher: dispatch(gemm_config, router_scales, permuted_row_to_unpermuted_row, swizzled_input_sf, swap_ab)
  Dispatcher-->>Runner: launches kernel (TMA warp specialized / finalize fused / scatter epilogue)
  Runner-->>App: results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas to focus during review:

Namespace rename consistency and re-exports across headers/implementations.
Correct propagation and argument ordering of new parameters (swizzled_input_sf, unpadded_hidden_size, router_scales, permuted_row_to_unpermuted_row, swap_ab).
FINALIZE epilogue filtering, workspace sizing, SMEM/no-SMEM compatibility.
SM100/SM120 candidate generation and DYNAMIC_CGA effects on filtering/expansion.
SM90 scatter epilogue correctness (reduction ops, pointer-array scatter, FusionCallbacks).
Workspace buffer index/size changes and renamed Act/Weight pointer usages.
New explicit instantiations and JIT symbol finalization macros.

Possibly related PRs

Feature: Add support for L40 FusedMoE in cutlass path #1973 — updates Cutlass heuristic and SM candidate selection (SM89→SM>=120 path); likely overlaps with SM100/SM120 candidate generation changes.
Fix: Verify scales are not None for Cutlass FP8 FusedMoE #1961 — edits the same fused_moe SM100 binding file; likely overlaps on tactic counts, profiling, or parameter wiring.
Bugfix: Change get() -> GetDLTensorPtr() in cutlass FusedMoE validations #1995 — touches the SM100 fused_moe binding and validation code paths; likely related to runner/profile changes.

Suggested reviewers

joker-eph
djmmoss
yongwww
cyx-6
wenscarl
IwakuraRein
kahyunnam

Poem

🐰 I hopped through headers, swizzled scales at dawn,
OSS names stitched, and tile shapes newly drawn.
Buffers grew three, scatter paths hum along,
Flags threaded true — kernels sing their song.
A rabbit cheers: compile fast, land strong.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The PR description contains only the repository's template with checkboxes, missing the actual implementation details about what changes are made and why.	Add a detailed description of the changes made, including the purpose of the MOE/CUTLASS kernel updates, key modifications to namespaces, API signatures, and any migration steps. Explain why the finalize fusion and dynamic cluster shape features were added.
Title check	❓ Inconclusive	The title 'update trtllm cutlass moe' is too vague and generic. It lacks specific details about what aspect of the CUTLASS MOE was updated.	Use a more descriptive title that highlights the primary change, such as 'Add dynamic cluster shape support and finalize fusion for TensorRT-LLM MOE kernels' or similar.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-31T21:10:01Z

Summary of Changes

Hello @nv-yunzheq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the TensorRT-LLM (TRTLLM) CUTLASS Mixture-of-Experts (MoE) implementation, particularly for Hopper and Blackwell architectures. The main objective is to introduce a new FINALIZE epilogue fusion type for TMA warp-specialized grouped GEMM, which allows for more efficient post-processing operations. Additionally, it adds support for dynamic cluster shapes on SM100, expands mixed-precision capabilities with FP8xFP4, and streamlines the codebase by removing the deprecated min-latency mode and performing general refactoring. These changes aim to improve performance and flexibility in MoE computations.

Highlights

New Epilogue Fusion Type: A FINALIZE epilogue fusion type has been introduced for TMA warp-specialized grouped GEMM operations, enabling fused post-processing steps directly within the kernel for improved efficiency.
Dynamic Cluster Shape Support: The CutlassGemmConfig now supports dynamic cluster shapes for SM100 (Blackwell) architectures, allowing for more flexible kernel configurations at runtime based on workload characteristics.
FP8xFP4 Mixed Precision Support: Added support for FP8 activation with FP4 weights (WFP4AFP8) in TMA warp-specialized GEMM, including specific handling for SM103, expanding the range of supported mixed-precision computations.
Min Latency Mode Removal: The 'Min Latency Mode' for TMA warp-specialized grouped GEMM has been removed, simplifying the codebase and focusing on more generalized optimizations.
Code Refactoring and Cleanup: Various code refactorings were performed, including renaming variables (e.g., ptr_a to ptr_act, stride_a to stride_act), updating copyright years, and removing unused code to enhance maintainability and clarity.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

flashinfer-bot · 2025-11-06T10:19:34Z

[FAILED] Pipeline #37989907: 12/17 passed

yzh119 · 2025-11-06T10:24:55Z

There are still some remaining cu126 compilation issues such as:

[2025-11-06T07:40:48.794Z] FAILED: [code=2] fused_moe_90/moe_gemm_kernels_fp8_fp8.cuda.o 
[2025-11-06T07:40:48.794Z] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output fused_moe_90/moe_gemm_kernels_fp8_fp8.cuda.o.d -DPy_LIMITED_API=0x03090000 -D_GLIBCXX_USE_CXX11_ABI=1 -I/workspace/csrc/nv_internal -I/workspace/csrc/nv_internal/include -I/workspace/csrc/nv_internal/tensorrt_llm/cutlass_extensions/include -I/workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include -I/workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels -isystem /opt/conda/envs/py312/include/python3.12 -isystem /usr/local/cuda/include -isystem /usr/local/cuda/include/cccl -isystem /tmp/build-env-vsqya_iz/lib/python3.12/site-packages/tvm_ffi/include -isystem /tmp/build-env-vsqya_iz/lib/python3.12/site-packages/tvm_ffi/include -isystem /workspace/include -isystem /workspace/csrc -isystem /workspace/3rdparty/cutlass/include -isystem /workspace/3rdparty/cutlass/tools/util/include -isystem /workspace/3rdparty/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -std=c++17 --threads=1 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -O3 -gencode=arch=compute_90a,code=sm_90a -DFLASHINFER_ENABLE_FP8_E8M0 -DFLASHINFER_ENABLE_FP4_E2M1 -DCOMPILE_HOPPER_TMA_GEMMS -DCOMPILE_HOPPER_TMA_GROUPED_GEMMS -DENABLE_BF16 -DENABLE_FP8 -DUSING_OSS_CUTLASS_MOE_GEMM -c /workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_fp8_fp8.cu -o fused_moe_90/moe_gemm_kernels_fp8_fp8.cuda.o 
[2025-11-06T07:40:48.794Z] /workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h(99): error: identifier "__nv_fp4_e2m1" is undefined
[2025-11-06T07:40:48.794Z]                     cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value ||
[2025-11-06T07:40:48.794Z]                                                            ^
[2025-11-06T07:40:48.795Z] 
[2025-11-06T07:40:48.795Z] /workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h(740): error: identifier "__nv_fp4_e2m1" is undefined
[2025-11-06T07:40:48.795Z]       if constexpr (!std::is_same_v<WeightType, __nv_fp4_e2m1>) {
[2025-11-06T07:40:48.795Z]                                                 ^
[2025-11-06T07:40:48.795Z] 
[2025-11-06T07:40:48.795Z] /workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h(748): error: identifier "__nv_fp4_e2m1" is undefined
[2025-11-06T07:40:48.795Z]       if constexpr (!std::is_same_v<WeightType, __nv_fp4_e2m1>) {
[2025-11-06T07:40:48.795Z]                                                 ^
[2025-11-06T07:40:48.795Z] 
[2025-11-06T07:40:48.795Z] 3 errors detected in the compilation of "/workspace/csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_kernels_fp8_fp8.cu".

Likely because we didn't add guard on the usage of __nv_fp4_e2m1 properly (cu126 do not support fp4).

We will retire cu126 at some point, but not now (considering cu126 is still one of the three supported cuda version of pytorch).

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
99-102: FP4 guards insufficient for CUDA 12.6 compatibility

The guard only checks ENABLE_FP4, but __nv_fp4_e2m1 requires CUDA 12.8+. The CI failure on cu126 (reported in PR objectives) confirms this: the identifier is undefined because CUDA 12.6 doesn't provide it. Same issue exists at lines 249-253, 742-746, and 755-759.

Apply guards that also check CUDA version:
-#if defined(ENABLE_FP4)
+#if defined(ENABLE_FP4) && CUDA_VERSION >= 12080
                   cutlass::platform::is_same<WeightType, __nv_fp4_e2m1>::value ||
 #endif
Repeat for all FP4 type references at lines 249-253, 742-746, and 755-759.

♻️ Duplicate comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)
672-676: Fix zero-argument call to supportsTmaWarpSpecialized

This duplicates a past review concern: isTmaWarpSpecialized calls supportsTmaWarpSpecialized() without arguments on line 675, but the signature at lines 679-688 now requires an int sm parameter. The same issue occurs at line 920 in calcMaxWorkspaceSize.

Apply this diff to forward the member's sm_:
-  return supportsTmaWarpSpecialized() && config_is_tma_warp_specialized;
+  return supportsTmaWarpSpecialized(sm_) && config_is_tma_warp_specialized;
Also fix line 920:
-  if (!supportsTmaWarpSpecialized()) {
+  if (!supportsTmaWarpSpecialized(sm_)) {
Alternatively, add a const wrapper in the class:
bool supportsTmaWarpSpecialized() const {
  return supportsTmaWarpSpecialized(sm_);
}
Based on learnings

🧹 Nitpick comments (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (1)

953-956: Consider extending FINALIZE fusion workspace calculation beyond SM90

FINALIZE fusion workspace size is currently only calculated for SM90 (line 954). If other architectures (e.g., SM100+) support finalize fusion, they should also be included in this calculation to avoid underestimating workspace requirements.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e15a96c and 33aec35.

📒 Files selected for processing (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (14 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (6)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws.h (3)

tensorrt_llm (63-112)

std (81-95)

calcMaxWorkspaceSizeTmaWarpSpecialized (490-502)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/include/common.h (1)

tensorrt_llm (19-34)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws_mixed_dtype.h (1)

tensorrt_llm (60-274)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_type_conversion.h (9)

tensorrt_llm (33-150)

kernels (34-149)

cutlass (114-116)

cutlass (120-122)

cutlass (127-129)

cutlass (132-134)

cutlass (140-142)

cutlass_kernels (35-148)

__nv_fp8_e5m2 (91-93)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

get_candidate_configs (638-689)

get_candidate_configs (638-640)

csrc/nv_internal/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm_configs.h (1)

EpilogueScheduleType (197-433)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (4)

csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch.h (4)

530-544: LGTM: Clean signature updates for finalize fusion support

The addition of the supports_finalize_fusion parameter to both the const member and static getConfigs methods properly threads this capability flag through the config selection pipeline.

624-629: Verify SM103 FP4 config selection strategy

The code explicitly adds SM100 configs when running on SM103 with FP4. Ensure this cross-architecture config reuse is validated and doesn't cause performance regressions or compatibility issues.

631-666: Well-structured finalize fusion and swap_ab config expansion

The logic correctly:

Duplicates configs and marks them with FINALIZE fusion type when supported (lines 631-640)

Removes FINALIZE configs that lack epilogue SMEM (lines 642-650)

Adds swap_ab variants for all configs (lines 653-659) with a defensive check

Filters to swap_ab=true only for w4_groupwise mode (lines 661-666)

978-1007: Activation type dispatch looks correct

The switch statement appropriately handles the supported activation types (Relu, Gelu, Silu, Identity, Swiglu, Geglu) and throws for invalid types. Note that Relu2 from the ActivationType enum is not handled, which appears intentional per the AI summary noting "Relu2 path removed (no longer supported)".

nv-yunzheq · 2025-11-06T21:25:08Z

/bot run

flashinfer-bot · 2025-11-06T21:25:54Z

GitLab MR !104 has been updated with latest changes, and the CI pipeline #38037173 is currently running. I'll report back once the pipeline job completes.

nvmbreughe

LGTM.
Perhaps just add the additional tests for DSR1 and autotuner we discussed.

nvmbreughe · 2025-11-06T21:54:11Z

+                                         cute::make_shape(gemm_n, gemm_k, 1));
+  }
  if (layout_info.stride_c) {
+    // TODO Enable 1xN bias matrix as C


Does this mean we don't support batch size = 1 ?

No, it's just the bias tensor could not be 1xN

flashinfer-bot · 2025-11-07T02:54:01Z

[FAILED] Pipeline #38037173: 14/17 passed

yzh119 · 2025-11-07T17:35:03Z

Per discussion offline, this PR should be ready to merge, but there are some problem shapes not covered in the backend (and the CI), and we will follow up and adding more unittests with different problem shapes in future PRs.

cc @pavanimajety @nv-yunzheq @nvmbreughe

## 📌 Description Patch sm103 for 3xfp4 moe generation ## 🔍 Related Issues Following up of #2020 #1925 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ``` $ ls csrc/nv_internal/tensorrt_llm/cutlass_instantiations/103/gemm_grouped 100 103 80 $ pytest tests/moe/test_trtllm_cutlass_fused_moe.py 22 passed, 3 skipped, 1 warning in 771.89s (0:12:51) ```  ## Summary by CodeRabbit * **New Features** * Added support for Blackwell (SM103) GPU architecture in MOE (Mixture of Experts) operations with specialized CUTLASS-optimized modules.

yongwww · 2025-11-20T22:21:13Z

-    torch.cuda.get_device_capability()[0] not in [10, 11],
-    reason="MXFP8xMXFP4 is only supported on SM100 and SM110",
+    torch.cuda.get_device_capability()[0] not in [10, 11, 12],
+    reason="MXFP8xMXFP4 is only supported on SM100, SM110 and SM120",


SM121 as well

## 📌 Description  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * SM90 scatter-based epilogue and broader SM100/SM120 MOE/GEMM coverage; new public enum for GEMM stages and explicit runner instantiations. * **Improvements** * New runtime controls and parameters exposed: dynamic CGA, swap-AB, swizzled-input SF, unpadded hidden-size, and per-GEMM-stage tactic counts; expanded tile/cluster shape options, finalize-epilogue fusion and fusion/swap-aware dispatch; increased runtime debug logging and profiling. * **Bug Fixes** * License/namespace/header cleanups, suppressed compiler warnings, tightened assertions. * **Tests** * MXFP8×MXFP4 test now permits SM120 devices.  --------- Co-authored-by: Yong Wu <yowu@nvidia.com> Co-authored-by: Alex Yang <aleyang@nvidia.com>

## 📌 Description Patch sm103 for 3xfp4 moe generation ## 🔍 Related Issues Following up of flashinfer-ai#2020 flashinfer-ai#1925 ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes ``` $ ls csrc/nv_internal/tensorrt_llm/cutlass_instantiations/103/gemm_grouped 100 103 80 $ pytest tests/moe/test_trtllm_cutlass_fused_moe.py 22 passed, 3 skipped, 1 warning in 771.89s (0:12:51) ```  ## Summary by CodeRabbit * **New Features** * Added support for Blackwell (SM103) GPU architecture in MOE (Mixture of Experts) operations with specialized CUTLASS-optimized modules.

yongwww and others added 21 commits August 27, 2025 09:09

Fix aot failures

f109a2b

>launcher.inl

3cfba8e

>generate_kernels.py

9047135

>generate_kernels.py

d9d7723

>launcher.inl

3166245

>moe_gemm_kernels.h

4dce85b

cutlass_fused_moe_kernels.cuh is troublesome...

d47c865

fix compilation errors in cutlass_fused_moe_kernels.cuh

307fe30

>gather_tensor.hpp

76a9220

fix compilation errors

96c0ed4

fix compilation error for sm120

af4036d

Add #if defined(ENABLE_FP4) guards

a49d1fd

fix: use FLASHINFER_ENABLE_FP8_E8M0 guard for __nv_fp8_e8m0

13d8664

fix build

4f94bf0

fix aot errors

eddb10b

Merge remote-tracking branch 'origin/main' into feature/cutlass_moe_u…

ddb1345

…pgrade

fix stale sm100 configs

2563556

Merge branch 'main' of https://github.com/flashinfer-ai/flashinfer in…

bfe2852

…to feature/cutlass_moe_3xfp4

debug..

da54367

merge

a81fbd1

remove debug stdout

0bbab20

nv-yunzheq requested review from cyx-6, djmmoss, joker-eph, kahyunnam, wenscarl, yongwww and yzh119 as code owners October 31, 2025 21:07

fix compilation error

33aec35

coderabbitai Bot reviewed Nov 6, 2025

View reviewed changes

nvmbreughe reviewed Nov 6, 2025

View reviewed changes

yzh119 merged commit 20435b4 into flashinfer-ai:main Nov 7, 2025
4 checks passed

yzh119 mentioned this pull request Nov 7, 2025

Fix moe fp8 failure for sm121 #2061

Merged

5 tasks

yongwww mentioned this pull request Nov 7, 2025

chore: upgrade cutlass moe kernel launcher to match trtllm #1925

Closed

5 tasks

yzh119 mentioned this pull request Nov 10, 2025

an illegal instruction was encountered when run moe fp4 on spark #2065

Closed

aleozlx mentioned this pull request Nov 12, 2025

Patch sm103 for 3xfp4 moe generation #2082

Merged

5 tasks

nv-yunzheq deleted the PR1925 branch November 13, 2025 18:15

weireweire mentioned this pull request Nov 17, 2025

[Tencent][FlashInfer][GroupGemm] Integrate H20 W4A8 Grouped Gemm Kernel into FlashInfer. #1987

Closed

yongwww reviewed Nov 20, 2025

View reviewed changes

trevor-m mentioned this pull request Dec 4, 2025

[Feature] Integrate new flashinfer optimizations for DeepSeekV3 sgl-project/sglang#14453

Open

aleozlx mentioned this pull request Dec 6, 2025

Fix/moe_sm110 (to be tested) #2183

Closed

5 tasks

coderabbitai Bot mentioned this pull request Jan 7, 2026

[Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels #2303

Merged

This was referenced Jan 22, 2026

feat: cuteDSL fp4 moe for better DSR1 performance. #2398

Merged

feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron #2304

Merged

coderabbitai Bot mentioned this pull request Feb 4, 2026

feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) #2464

Merged

5 tasks

samuellees mentioned this pull request Apr 4, 2026

ci(tests): waive flaky/hardware-incompatible tests to cleanup CI #2922

Closed

alexbi29 mentioned this pull request Apr 6, 2026

SM120 FP8×FP4 MoE decode 15% slower after #2020 tile filter #2992

Open

coderabbitai Bot mentioned this pull request Apr 16, 2026

perf: optimize MXFP4xBF16 & INT4xFP8 CUTLASS MoE backend for SM90 #3084

Merged

Conversation

nv-yunzheq commented Oct 31, 2025 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist Bot commented Oct 31, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

flashinfer-bot commented Nov 6, 2025

Uh oh!

yzh119 commented Nov 6, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq commented Nov 6, 2025

Uh oh!

flashinfer-bot commented Nov 6, 2025

Uh oh!

nvmbreughe left a comment

Choose a reason for hiding this comment

Uh oh!

nvmbreughe Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

nv-yunzheq Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Nov 7, 2025

Uh oh!

yzh119 commented Nov 7, 2025

Uh oh!

Uh oh!

yongwww Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

nv-yunzheq commented Oct 31, 2025 •

edited by yzh119

Loading

coderabbitai Bot commented Oct 31, 2025 •

edited

Loading

nv-yunzheq Nov 7, 2025 •

edited

Loading