feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 by danisereb · Pull Request #2707 · flashinfer-ai/flashinfer

danisereb · 2026-03-06T12:40:26Z

📌 Description

This PR adds support for TRTLLM MXFP8 non-gated MoE with ReLU2 (for Nemotron models).

A PR for TRTLLM MXFP8 gated MoE is open in vLLM:
vllm-project/vllm#35986

After this PR is merged and a new flashinfer version is released -
support for non-gated MoE will be added in vLLM.

New tests were added and all tests passed:

pytest tests/moe/test_trtllm_gen_fused_moe.py -k "mxfp8 and relu"

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Routed FP8 block-scale MoE execution path
- New MXFP8 block-scale FP8 quantization mode
Enhancements
- Activation-type now flows through MoE entry points enabling gated vs non-gated behavior (affects valid config gating)
- Dynamic top-k limits for DeepSeek routing based on expert count
- Stricter weight/activation shape validation and clearer errors
Tests
- New parity, routed, autotune, and smoke tests covering FP8/MXFP8 and activation types

gemini-code-assist · 2026-03-06T12:40:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FlashInfer library by integrating support for TRTLLM MXFP8 non-gated Mixture-of-Experts (MoE) layers, specifically tailored for models employing the ReLU2 activation function, such as Nemotron. The changes involve updating core C++ kernels and their Python bindings to correctly manage weight dimensions and activation types, ensuring proper functionality and paving the way for advanced model optimizations and broader compatibility within the vLLM ecosystem.

Highlights

TRTLLM MXFP8 Non-Gated MoE with ReLU2 Support: Implemented support for MXFP8 quantization in non-gated Mixture-of-Experts (MoE) operations, specifically utilizing the ReLU2 activation function.
Nemotron Model Compatibility: This feature is designed to support Nemotron models, enabling more efficient processing with the new MoE configuration.
Kernel and Python API Updates: Modified the C++ kernel launcher (csrc/trtllm_fused_moe_kernel_launcher.cu) and Python API (flashinfer/fused_moe/core.py) to correctly handle the new activation types and weight dimensions.
Expanded Test Coverage: Added new parameterized tests to validate the functionality of MXFP8 block-scale MoE with ReLU2, including specific configurations for Nemotron-H models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/trtllm_fused_moe_kernel_launcher.cu
- Updated GEMM1 weight dimension checks to differentiate between gated (e.g., Swiglu/Geglu) and non-gated (e.g., ReLU2) activations, allowing for intermediate_size_factor of 1 or 2.
- Modified init and getValidConfigs methods in Fp8BlockScaleLauncher to accept an activation_type parameter.
- Adjusted the moe_runner initialization to pass the new activation_type.
flashinfer/fused_moe/core.py
- Extended trtllm_fp8_block_scale_moe_op and _fake_trtllm_fp8_block_scale_moe functions to include an activation_type parameter.
- Passed the activation_type to the underlying C++ kernel calls.
- Updated docstrings for trtllm_fp8_block_scale_moe and trtllm_fp8_block_scale_routed_moe to describe the new activation_type parameter and its possible values (Swiglu, Geglu, Relu2, Identity).
- Clarified gemm1_weights shape description in docstrings to account for gated and non-gated activations.
tests/moe/test_trtllm_gen_fused_moe.py
- Imported product from itertools for parameterized testing.
- Modified quantize_weights method in FP8BlockScaleMoe to correctly determine intermediate_size_factor based on gemm1_weights and gemm2_weights shapes, supporting both gated and non-gated cases.
- Updated prepare_static_weights_for_kernel to conditionally reorder rows for gated activations only, based on the activation_type.
- Added new parameterized tests (test_mxfp8_block_scale_moe_relu2_non_gated and test_mxfp8_block_scale_moe_relu2_nemotron_h_config) to cover MXFP8 block-scale MoE with ReLU2 activation across various configurations.
- Modified call_moe to pass the activation_type to the MoE implementation.
tests/moe/utils.py
- Added QuantMode.FP8_BLOCK_SCALE_MXFP8 to NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES.

Activity

The author, danisereb, has implemented support for TRTLLM MXFP8 non-gated MoE with ReLU2, specifically for Nemotron models.
New tests have been added to validate this functionality, and all tests are reported to have passed.
This PR is noted as a prerequisite for adding non-gated MoE support in vLLM, with a related PR already open in that project.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-06T12:40:48Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Threads ActivationType through FP8/MoE launchers and public APIs, enforces gating-aware activation checks (DeepSeek FP8 limited to Swiglu), adds dynamic DeepSeekV3 top_k limits based on expert count, and expands routed FP8/MXFP8 tests and parity checks.

Changes

Cohort / File(s)	Summary
C++ Kernel Launcher Updates `csrc/trtllm_fused_moe_kernel_launcher.cu`	Add `validateAndCastActivationType`, propagate `ActivationType` through init/getValidConfigs/entrypoints, tighten weight-shape validation (intermediate_size_factor / Mn checks), and add dynamic DeepSeekV3 top_k limits; update signatures to accept `activation_type`/`act_type`.
Python MoE Core Bindings `flashinfer/fused_moe/core.py`	Add `activation_type` param (default Swiglu) to FP8/FP4/MXInt4 MoE entry points and fake ops; forward activation_type into trtllm_* launcher calls and update exports/docstrings.
FP8 Routed MoE Tests & Exports `tests/moe/test_trtllm_gen_fused_moe.py`, `tests/moe/test_trtllm_gen_routed_fused_moe.py`	Add `trtllm_fp8_block_scale_routed_moe` export; add routed FP8/MXFP8 parity and smoke tests; adjust intermediate_size/intermediate_size_factor handling, add gating-aware weight shuffling, and expand autotune scenarios.
FP8 Launcher/Config Callers & Bindings assorted bindings (`trtllm_fp8_block_scale_moe`, `trtllm_fp8_per_tensor_scale_moe`, `trtllm_fp4_block_scale_moe`, `trtllm_get_valid_moe_configs`, ...)	Update signatures to accept `act_type`/`activation_type`, enforce DeepSeek FP8 gating to Swiglu, and propagate activation_type into launcher/runner construction and config validation.
FP8 Config Getters & Runners (C++) `Fp8BlockScaleLauncher`, `Fp8PerTensorLauncher`, MoE::Runner related code (in csrc/...)	Extend getValidConfigs/init to accept activation_type; branch FP8 config logic on act_type and quantization_type (DeepSeek FP8 vs MXFP8); maintain FP4/MXInt4 paths.
Test Utilities `tests/moe/utils.py`	Add `FP8_BLOCK_SCALE_MXFP8` to `QuantMode` and include it in `NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES`.

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python API (flashinfer.fused_moe)
    participant Bind as C Bindings
    participant Launcher as Fp8BlockScaleLauncher / FusedMoeLauncher
    participant Kernel as CUDA Kernel Launcher
    participant GPU as Device

    Py->>Bind: trtllm_fp8_*_moe(..., activation_type)
    Bind->>Launcher: validateAndCastActivationType(act_type)
    Launcher->>Launcher: getValidConfigs(..., activation_type)
    Launcher->>Kernel: init(..., activation_type) / launch(configs, weights, inputs)
    Kernel->>GPU: run kernels
    GPU-->>Kernel: results
    Kernel-->>Bind: return outputs
    Bind-->>Py: outputs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark #2635: modifies how ActivationType is passed through FP4/FP8 MoE codepaths and public signatures.
bugfix: fix the enum/int type mismatch mentioned in #2507 #2508: alters activation_type representation and propagates it through fused MoE APIs and call sites.
feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron, fixed #2462: replaces GatedActType with ActivationType and updates DeepSeek/top-k and launcher wiring.

Suggested labels

op: moe-routing

Suggested reviewers

yzh119
cyx-6
IwakuraRein
bkryu
jimmyzho
nv-yunzheq
aleozlx

Poem

🐰
I threaded activation through each gate,
Swiglu leads the FP8 parade.
Experts shuffle, top-k hops high,
Tests nibble bytes and watch it fly.
Carrots for code — a joyous sigh!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding support for TRTLLM MXFP8 non-gated MoE with ReLU2, which matches the core objective demonstrated throughout the changeset.
Description check	✅ Passed	The description includes what the PR does, references related issues, completes the pre-commit checklist, notes new tests added, and provides a test command that passed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds support for non-gated MoE with ReLU2 for TRTLLM MXFP8, which is a great feature enhancement. The changes are logical, and new tests provide good coverage for the added functionality. I've found a potential issue where the updated code might not correctly handle quantization modes with different dtypes for activations and weights (like DeepSeekFp8), as it assumes they are the same. I've provided detailed comments and suggestions to make the implementation more robust for all supported quantization modes.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 1138-1142: The DeepSeek FP8 branch currently constructs
tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner using a constructor that has
no ActivationType (and MoERunnerArgs likewise lacks ActivationType), so gated
activations like Geglu/SwigluBias incorrectly share the Swiglu path; fix by
either (A) restricting the DeepSeekFp8 conditional to only the activation(s) the
current constructor supports (e.g., Swiglu) or (B) add ActivationType to
MoERunnerArgs and use the activation-aware MoE::Runner constructor (and
propagate ActivationType through the call sites that construct MoE::Runner), and
apply the same change to the other analogous DeepSeekFp8 checks in the file.

In `@tests/moe/test_trtllm_gen_routed_fused_moe.py`:
- Line 396: The routed parity test sets use_shuffled_weight=True while
gemm1_weights and gemm2_weights are not passed through the shuffle helpers,
causing the routed kernel to misinterpret raw FP8 weight layout; change
use_shuffled_weight to False in this test (or alternatively apply the same
weight/scale shuffling used in the MXFP8 test) so the routed kernel and the
reference use the same weight layout for gemm1_weights/gemm2_weights.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fbac1f5e-b044-4d25-a5d6-8c1cb1c283a7

📥 Commits

Reviewing files that changed from the base of the PR and between 65d6e4a and 0cd2233.

📒 Files selected for processing (5)

csrc/trtllm_fused_moe_kernel_launcher.cu
flashinfer/fused_moe/core.py
tests/moe/test_trtllm_gen_fused_moe.py
tests/moe/test_trtllm_gen_routed_fused_moe.py
tests/moe/utils.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/trtllm_fused_moe_kernel_launcher.cu (1)

559-575: ⚠️ Potential issue | 🟠 Major

Don't make BF16 tactic discovery activation-aware before BF16 execution is.

getValidConfigs() now builds BF16 runners with the caller's act_type, but Bf16MoeLauncher::init() on Line 468 still hard-codes ActivationType::Swiglu, and trtllm_bf16_moe() still has no activation parameter. That means tactic lookup can return/cache configs for Relu2/Gelu/etc. that the BF16 runtime will never execute. Either thread ActivationType through the BF16 runtime or reject non-Swiglu here.

Suggested guard until the BF16 runtime is activation-aware

 static Array<Array<int64_t>> getValidConfigs(int64_t top_k, int64_t hidden_size,
                                              int64_t intermediate_size, int64_t num_local_experts,
                                              int64_t num_tokens, int64_t act_type,
                                              bool use_shuffled_weight, int64_t weight_layout) {
   Array<Array<int64_t>> valid_configs;
+  auto activation_type = validateAndCastActivationType(act_type);
+  TVM_FFI_ICHECK_EQ(activation_type, ActivationType::Swiglu)
+      << "BF16 valid-config query only supports ActivationType::Swiglu.";

   std::vector<int32_t> supported_tile_nums(mSupportedTileNums.begin(), mSupportedTileNums.end());
   std::set<int32_t> selected_tile_nums =
       computeSelectedTileN(supported_tile_nums, num_tokens, top_k, num_local_experts);

   for (int32_t tile_N : selected_tile_nums) {
     auto moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
         btg::Dtype::Bfloat16,  // dtype_act
         btg::Dtype::Bfloat16,  // dtype_weights
         false,                 // useDeepSeekFp8
-        tile_N, static_cast<ActivationType>(act_type), use_shuffled_weight,
+        tile_N, activation_type, use_shuffled_weight,
         static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 559 - 575,
getValidConfigs() constructs BF16 MoE runners using the caller's act_type which
can lead to tactic entries for activations the BF16 runtime cannot run
(Bf16MoeLauncher::init() currently hard-codes ActivationType::Swiglu and
trtllm_bf16_moe() has no activation parameter); fix by making getValidConfigs()
use the runtime-supported activation or reject mismatched activations: either
always pass ActivationType::Swiglu when creating the
tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner for BF16 paths, or add a guard
that checks act_type == ActivationType::Swiglu and skip/return empty configs for
other activations, and document/update Bf16MoeLauncher::init()/trtllm_bf16_moe()
to thread activation through later when BF16 runtime becomes activation-aware.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 2215-2221: The per-tensor-FP8 branch returns configs for arbitrary
activations even though Fp8PerTensorLauncher still requires gated MOE paths (see
Fp8PerTensorLauncher::check_moe expecting output1_scales_gate_scalar and
prepare_moe allocating 2 * intermediate_size GEMM1); guard the branch that calls
Fp8PerTensorLauncher::getValidConfigs (the branch using act_type /
activation_type) by checking isGatedActivation(activation_type) so it only
returns gated-activation configs until the launcher is updated to honor
intermediate_size_factor and nongated activations.

---

Outside diff comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 559-575: getValidConfigs() constructs BF16 MoE runners using the
caller's act_type which can lead to tactic entries for activations the BF16
runtime cannot run (Bf16MoeLauncher::init() currently hard-codes
ActivationType::Swiglu and trtllm_bf16_moe() has no activation parameter); fix
by making getValidConfigs() use the runtime-supported activation or reject
mismatched activations: either always pass ActivationType::Swiglu when creating
the tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner for BF16 paths, or add a
guard that checks act_type == ActivationType::Swiglu and skip/return empty
configs for other activations, and document/update
Bf16MoeLauncher::init()/trtllm_bf16_moe() to thread activation through later
when BF16 runtime becomes activation-aware.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1aa14726-c7a5-4c36-9d66-022e0d8647b6

📥 Commits

Reviewing files that changed from the base of the PR and between b246ac7 and 3b05708.

📒 Files selected for processing (1)

csrc/trtllm_fused_moe_kernel_launcher.cu

aleozlx · 2026-03-09T03:14:13Z

/bot run

flashinfer-bot · 2026-03-09T03:15:27Z

GitLab MR !390 has been created, and the CI pipeline #45677790 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-09T07:20:54Z

[SUCCESS] Pipeline #45677790: 10/20 passed

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 2216-2227: Add the same activation validation used in
trtllm_get_valid_moe_configs to the runtime entry
trtllm_fp8_per_tensor_scale_moe(): call validateAndCastActivationType on the
incoming activation_type, then check isGatedActivation(...) and if false raise
the same NotImplementedError message so nongated per-tensor FP8 paths are
rejected before Fp8PerTensorLauncher::check_moe() or prepare_moe() run; this
prevents the code in Fp8PerTensorLauncher that assumes gated outputs
(output1_scales_gate_scalar and 2 * intermediate_size GEMM1 buffers) from being
executed for unsupported activations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0f019a8-9a24-4b53-be15-4f76e449d96f

📥 Commits

Reviewing files that changed from the base of the PR and between 3b05708 and 4bfc8e0.

📒 Files selected for processing (1)

csrc/trtllm_fused_moe_kernel_launcher.cu

danisereb · 2026-03-09T12:01:33Z

I fixed I few things, we can trigger CI again

aleozlx · 2026-03-09T23:43:18Z

/bot run

flashinfer-bot · 2026-03-09T23:43:41Z

GitLab MR !390 has been updated with latest changes, and the CI pipeline #45752288 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-10T11:48:11Z

[FAILED] Pipeline #45752288: 8/20 passed

aleozlx · 2026-03-16T16:21:54Z

tests seem good. CI also passed

pls merge

cc @yzh119

@aleozlx

) Fixes #2731. ## What's broken? When using the CUTLASS fused MoE backend with **non-gated activations** (e.g., Relu2, Gelu, Silu) and MXFP8 quantization, the fc1 weight shape validation unconditionally rejects the input — even when the shape is correct. ## Who is affected? Anyone using the **CUTLASS fused MoE** path with: - **Quantization**: `WMxfp8AMxfp8`, `WMxfp4AFp8`, or `WMxfp4AMxfp8` - **Activation**: any non-gated type (Relu2, Gelu, Silu, etc.) Not affected: gated activations (Swiglu, Geglu, SwigluBias), or other quant modes (NVFP4 already handles this correctly). ## Where is the bug? `csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu`, inside `getQuantParams()` — the fc1 weight block N-dimension check hardcodes `* 2` at three MXFP8 branches (~L898, ~L1004, ~L1063). ## Why does it happen? PR #2581 introduced MXFP8 support when only gated activations (Swiglu) existed, so `inter_size * 2` was correct. Later, non-gated activation support was added to the trtllm-gen backend (PR #2707), but the CUTLASS backend's validation was never updated. The NVFP4 path in the same file (line ~1131) already handles this correctly with an `if (isGatedActivation(...))` guard. ## How did we fix it? For each of the 3 MXFP8 quant branches: 1. Extract `int const fc1_n_mult = isGatedActivation(base_activation_type) ? 2 : 1;` 2. Replace the hardcoded `* 2` with `* fc1_n_mult` 3. Update error messages: gated shows `"inter_size * 2"`, non-gated shows `"inter_size"` **Before:** ```cpp fc1_weight_block.size(1) == alignToSfDim(inter_size, ...) * 2 ``` **After:** ```cpp int const fc1_n_mult = isGatedActivation(base_activation_type) ? 2 : 1; fc1_weight_block.size(1) == alignToSfDim(inter_size, ...) * fc1_n_mult ``` ## How do we know it works? - `pre-commit run` passes (clang-format, lint, etc.) - Gated activations (default Swiglu): `fc1_n_mult = 2` — identical to old behavior, no regression - Non-gated activations: `fc1_n_mult = 1` — shape check now accepts correct `inter_size` dimension - Full GPU test suite requires CI (`@flashinfer-bot run`) ## Related - Builds on the approach identified in #2753 (stale ~27 days, CI unresolved). - Addresses the Gemini review feedback from #2753 by extracting the multiplier to a local variable before the validation checks. cc @aleozlx @nv-yunzheq  ## Summary by CodeRabbit * **Bug Fixes** * Fixed weight block size validation for Mixture of Experts (MOE) to correctly handle both gated and non-gated activation types, ensuring proper support across different activation configurations.  Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

gemini-code-assist Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

danisereb force-pushed the moe_mxfp8_non_gated_relu branch 2 times, most recently from d9d8927 to 279d358 Compare March 8, 2026 06:08

danisereb mentioned this pull request Mar 8, 2026

Add support for ModelOpt MXFP8 MoE models vllm-project/vllm#35986

Merged

5 tasks

feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2

329ec04

danisereb force-pushed the moe_mxfp8_non_gated_relu branch from d102560 to 329ec04 Compare March 8, 2026 16:20

danisereb added 2 commits March 8, 2026 19:46

Add checks for MXFP8 MoE cases that do no support non-gated

95d2bcd

Add more MXFP8 MoE tests

0cd2233

danisereb force-pushed the moe_mxfp8_non_gated_relu branch from 62ef99d to 0cd2233 Compare March 8, 2026 19:12

danisereb marked this pull request as ready for review March 8, 2026 19:19

danisereb requested review from IwakuraRein, bkryu, cyx-6, jimmyzho, nv-yunzheq and yzh119 as code owners March 8, 2026 19:19

coderabbitai Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu

Comment thread tests/moe/test_trtllm_gen_routed_fused_moe.py Outdated

danisereb added 3 commits March 8, 2026 21:34

Undo invalid test change

ea7243d

Add code review fixes

b246ac7

Fix compile error

3b05708

coderabbitai Bot reviewed Mar 8, 2026

View reviewed changes

aleozlx added the op: moe label Mar 9, 2026

aleozlx approved these changes Mar 9, 2026

View reviewed changes

aleozlx added the run-ci label Mar 9, 2026

Fix compile error

e1ee0b3

danisereb added 2 commits March 9, 2026 12:13

Fail if non gated act is used for FP8 per-tensor

4bfc8e0

Cleanup

7977eb3

coderabbitai Bot reviewed Mar 9, 2026

View reviewed changes

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu

aleozlx added the ready label Mar 16, 2026

aleozlx merged commit d226a82 into flashinfer-ai:main Mar 19, 2026
79 of 104 checks passed

coderabbitai Bot mentioned this pull request Apr 6, 2026

Support Sigmoid (sigmoid+topk) routing function #2869

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 15, 2026

[feat] Trtllm-gen Per-token Nvfp4 MoE #3027

Merged

5 tasks

This was referenced Apr 15, 2026

Potentially superfluous check that disables non gated activations in the cutlass fused moe API #2731

Closed

fix: guard MXFP8 fc1 weight shape check for non-gated activations #3082

Merged

coderabbitai Bot mentioned this pull request Apr 17, 2026

WIP: B12x micro kernel merged #3098

Closed

Conversation

danisereb commented Mar 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx commented Mar 9, 2026

Uh oh!

flashinfer-bot commented Mar 9, 2026

Uh oh!

flashinfer-bot commented Mar 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danisereb commented Mar 9, 2026

Uh oh!

aleozlx commented Mar 9, 2026

Uh oh!

flashinfer-bot commented Mar 9, 2026

Uh oh!

flashinfer-bot commented Mar 10, 2026

Uh oh!

aleozlx commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danisereb commented Mar 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 6, 2026 •

edited

Loading