Skip to content

feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2#2707

Merged
aleozlx merged 9 commits intoflashinfer-ai:mainfrom
danisereb:moe_mxfp8_non_gated_relu
Mar 19, 2026
Merged

feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2#2707
aleozlx merged 9 commits intoflashinfer-ai:mainfrom
danisereb:moe_mxfp8_non_gated_relu

Conversation

@danisereb
Copy link
Copy Markdown
Contributor

@danisereb danisereb commented Mar 6, 2026

📌 Description

This PR adds support for TRTLLM MXFP8 non-gated MoE with ReLU2 (for Nemotron models).

A PR for TRTLLM MXFP8 gated MoE is open in vLLM:
vllm-project/vllm#35986

After this PR is merged and a new flashinfer version is released -
support for non-gated MoE will be added in vLLM.

New tests were added and all tests passed:

pytest tests/moe/test_trtllm_gen_fused_moe.py -k "mxfp8 and relu"

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Routed FP8 block-scale MoE execution path
    • New MXFP8 block-scale FP8 quantization mode
  • Enhancements

    • Activation-type now flows through MoE entry points enabling gated vs non-gated behavior (affects valid config gating)
    • Dynamic top-k limits for DeepSeek routing based on expert count
    • Stricter weight/activation shape validation and clearer errors
  • Tests

    • New parity, routed, autotune, and smoke tests covering FP8/MXFP8 and activation types

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the FlashInfer library by integrating support for TRTLLM MXFP8 non-gated Mixture-of-Experts (MoE) layers, specifically tailored for models employing the ReLU2 activation function, such as Nemotron. The changes involve updating core C++ kernels and their Python bindings to correctly manage weight dimensions and activation types, ensuring proper functionality and paving the way for advanced model optimizations and broader compatibility within the vLLM ecosystem.

Highlights

  • TRTLLM MXFP8 Non-Gated MoE with ReLU2 Support: Implemented support for MXFP8 quantization in non-gated Mixture-of-Experts (MoE) operations, specifically utilizing the ReLU2 activation function.
  • Nemotron Model Compatibility: This feature is designed to support Nemotron models, enabling more efficient processing with the new MoE configuration.
  • Kernel and Python API Updates: Modified the C++ kernel launcher (csrc/trtllm_fused_moe_kernel_launcher.cu) and Python API (flashinfer/fused_moe/core.py) to correctly handle the new activation types and weight dimensions.
  • Expanded Test Coverage: Added new parameterized tests to validate the functionality of MXFP8 block-scale MoE with ReLU2, including specific configurations for Nemotron-H models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/trtllm_fused_moe_kernel_launcher.cu
    • Updated GEMM1 weight dimension checks to differentiate between gated (e.g., Swiglu/Geglu) and non-gated (e.g., ReLU2) activations, allowing for intermediate_size_factor of 1 or 2.
    • Modified init and getValidConfigs methods in Fp8BlockScaleLauncher to accept an activation_type parameter.
    • Adjusted the moe_runner initialization to pass the new activation_type.
  • flashinfer/fused_moe/core.py
    • Extended trtllm_fp8_block_scale_moe_op and _fake_trtllm_fp8_block_scale_moe functions to include an activation_type parameter.
    • Passed the activation_type to the underlying C++ kernel calls.
    • Updated docstrings for trtllm_fp8_block_scale_moe and trtllm_fp8_block_scale_routed_moe to describe the new activation_type parameter and its possible values (Swiglu, Geglu, Relu2, Identity).
    • Clarified gemm1_weights shape description in docstrings to account for gated and non-gated activations.
  • tests/moe/test_trtllm_gen_fused_moe.py
    • Imported product from itertools for parameterized testing.
    • Modified quantize_weights method in FP8BlockScaleMoe to correctly determine intermediate_size_factor based on gemm1_weights and gemm2_weights shapes, supporting both gated and non-gated cases.
    • Updated prepare_static_weights_for_kernel to conditionally reorder rows for gated activations only, based on the activation_type.
    • Added new parameterized tests (test_mxfp8_block_scale_moe_relu2_non_gated and test_mxfp8_block_scale_moe_relu2_nemotron_h_config) to cover MXFP8 block-scale MoE with ReLU2 activation across various configurations.
    • Modified call_moe to pass the activation_type to the MoE implementation.
  • tests/moe/utils.py
    • Added QuantMode.FP8_BLOCK_SCALE_MXFP8 to NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES.
Activity
  • The author, danisereb, has implemented support for TRTLLM MXFP8 non-gated MoE with ReLU2, specifically for Nemotron models.
  • New tests have been added to validate this functionality, and all tests are reported to have passed.
  • This PR is noted as a prerequisite for adding non-gated MoE support in vLLM, with a related PR already open in that project.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Threads ActivationType through FP8/MoE launchers and public APIs, enforces gating-aware activation checks (DeepSeek FP8 limited to Swiglu), adds dynamic DeepSeekV3 top_k limits based on expert count, and expands routed FP8/MXFP8 tests and parity checks.

Changes

Cohort / File(s) Summary
C++ Kernel Launcher Updates
csrc/trtllm_fused_moe_kernel_launcher.cu
Add validateAndCastActivationType, propagate ActivationType through init/getValidConfigs/entrypoints, tighten weight-shape validation (intermediate_size_factor / Mn checks), and add dynamic DeepSeekV3 top_k limits; update signatures to accept activation_type/act_type.
Python MoE Core Bindings
flashinfer/fused_moe/core.py
Add activation_type param (default Swiglu) to FP8/FP4/MXInt4 MoE entry points and fake ops; forward activation_type into trtllm_* launcher calls and update exports/docstrings.
FP8 Routed MoE Tests & Exports
tests/moe/test_trtllm_gen_fused_moe.py, tests/moe/test_trtllm_gen_routed_fused_moe.py
Add trtllm_fp8_block_scale_routed_moe export; add routed FP8/MXFP8 parity and smoke tests; adjust intermediate_size/intermediate_size_factor handling, add gating-aware weight shuffling, and expand autotune scenarios.
FP8 Launcher/Config Callers & Bindings
assorted bindings (trtllm_fp8_block_scale_moe, trtllm_fp8_per_tensor_scale_moe, trtllm_fp4_block_scale_moe, trtllm_get_valid_moe_configs, ...)
Update signatures to accept act_type/activation_type, enforce DeepSeek FP8 gating to Swiglu, and propagate activation_type into launcher/runner construction and config validation.
FP8 Config Getters & Runners (C++)
Fp8BlockScaleLauncher, Fp8PerTensorLauncher, MoE::Runner related code (in csrc/...)
Extend getValidConfigs/init to accept activation_type; branch FP8 config logic on act_type and quantization_type (DeepSeek FP8 vs MXFP8); maintain FP4/MXInt4 paths.
Test Utilities
tests/moe/utils.py
Add FP8_BLOCK_SCALE_MXFP8 to QuantMode and include it in NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES.

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python API (flashinfer.fused_moe)
    participant Bind as C Bindings
    participant Launcher as Fp8BlockScaleLauncher / FusedMoeLauncher
    participant Kernel as CUDA Kernel Launcher
    participant GPU as Device

    Py->>Bind: trtllm_fp8_*_moe(..., activation_type)
    Bind->>Launcher: validateAndCastActivationType(act_type)
    Launcher->>Launcher: getValidConfigs(..., activation_type)
    Launcher->>Kernel: init(..., activation_type) / launch(configs, weights, inputs)
    Kernel->>GPU: run kernels
    GPU-->>Kernel: results
    Kernel-->>Bind: return outputs
    Bind-->>Py: outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

op: moe-routing

Suggested reviewers

  • yzh119
  • cyx-6
  • IwakuraRein
  • bkryu
  • jimmyzho
  • nv-yunzheq
  • aleozlx

Poem

🐰
I threaded activation through each gate,
Swiglu leads the FP8 parade.
Experts shuffle, top-k hops high,
Tests nibble bytes and watch it fly.
Carrots for code — a joyous sigh!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding support for TRTLLM MXFP8 non-gated MoE with ReLU2, which matches the core objective demonstrated throughout the changeset.
Description check ✅ Passed The description includes what the PR does, references related issues, completes the pre-commit checklist, notes new tests added, and provides a test command that passed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for non-gated MoE with ReLU2 for TRTLLM MXFP8, which is a great feature enhancement. The changes are logical, and new tests provide good coverage for the added functionality. I've found a potential issue where the updated code might not correctly handle quantization modes with different dtypes for activations and weights (like DeepSeekFp8), as it assumes they are the same. I've provided detailed comments and suggestions to make the implementation more robust for all supported quantization modes.

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated
Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated
Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated
Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated
@danisereb danisereb force-pushed the moe_mxfp8_non_gated_relu branch 2 times, most recently from d9d8927 to 279d358 Compare March 8, 2026 06:08
@danisereb danisereb force-pushed the moe_mxfp8_non_gated_relu branch from d102560 to 329ec04 Compare March 8, 2026 16:20
@danisereb danisereb force-pushed the moe_mxfp8_non_gated_relu branch from 62ef99d to 0cd2233 Compare March 8, 2026 19:12
@danisereb danisereb marked this pull request as ready for review March 8, 2026 19:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 1138-1142: The DeepSeek FP8 branch currently constructs
tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner using a constructor that has
no ActivationType (and MoERunnerArgs likewise lacks ActivationType), so gated
activations like Geglu/SwigluBias incorrectly share the Swiglu path; fix by
either (A) restricting the DeepSeekFp8 conditional to only the activation(s) the
current constructor supports (e.g., Swiglu) or (B) add ActivationType to
MoERunnerArgs and use the activation-aware MoE::Runner constructor (and
propagate ActivationType through the call sites that construct MoE::Runner), and
apply the same change to the other analogous DeepSeekFp8 checks in the file.

In `@tests/moe/test_trtllm_gen_routed_fused_moe.py`:
- Line 396: The routed parity test sets use_shuffled_weight=True while
gemm1_weights and gemm2_weights are not passed through the shuffle helpers,
causing the routed kernel to misinterpret raw FP8 weight layout; change
use_shuffled_weight to False in this test (or alternatively apply the same
weight/scale shuffling used in the MXFP8 test) so the routed kernel and the
reference use the same weight layout for gemm1_weights/gemm2_weights.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fbac1f5e-b044-4d25-a5d6-8c1cb1c283a7

📥 Commits

Reviewing files that changed from the base of the PR and between 65d6e4a and 0cd2233.

📒 Files selected for processing (5)
  • csrc/trtllm_fused_moe_kernel_launcher.cu
  • flashinfer/fused_moe/core.py
  • tests/moe/test_trtllm_gen_fused_moe.py
  • tests/moe/test_trtllm_gen_routed_fused_moe.py
  • tests/moe/utils.py

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu
Comment thread tests/moe/test_trtllm_gen_routed_fused_moe.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_kernel_launcher.cu (1)

559-575: ⚠️ Potential issue | 🟠 Major

Don't make BF16 tactic discovery activation-aware before BF16 execution is.

getValidConfigs() now builds BF16 runners with the caller's act_type, but Bf16MoeLauncher::init() on Line 468 still hard-codes ActivationType::Swiglu, and trtllm_bf16_moe() still has no activation parameter. That means tactic lookup can return/cache configs for Relu2/Gelu/etc. that the BF16 runtime will never execute. Either thread ActivationType through the BF16 runtime or reject non-Swiglu here.

Suggested guard until the BF16 runtime is activation-aware
 static Array<Array<int64_t>> getValidConfigs(int64_t top_k, int64_t hidden_size,
                                              int64_t intermediate_size, int64_t num_local_experts,
                                              int64_t num_tokens, int64_t act_type,
                                              bool use_shuffled_weight, int64_t weight_layout) {
   Array<Array<int64_t>> valid_configs;
+  auto activation_type = validateAndCastActivationType(act_type);
+  TVM_FFI_ICHECK_EQ(activation_type, ActivationType::Swiglu)
+      << "BF16 valid-config query only supports ActivationType::Swiglu.";

   std::vector<int32_t> supported_tile_nums(mSupportedTileNums.begin(), mSupportedTileNums.end());
   std::set<int32_t> selected_tile_nums =
       computeSelectedTileN(supported_tile_nums, num_tokens, top_k, num_local_experts);

   for (int32_t tile_N : selected_tile_nums) {
     auto moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
         btg::Dtype::Bfloat16,  // dtype_act
         btg::Dtype::Bfloat16,  // dtype_weights
         false,                 // useDeepSeekFp8
-        tile_N, static_cast<ActivationType>(act_type), use_shuffled_weight,
+        tile_N, activation_type, use_shuffled_weight,
         static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@csrc/trtllm_fused_moe_kernel_launcher.cu` around lines 559 - 575,
getValidConfigs() constructs BF16 MoE runners using the caller's act_type which
can lead to tactic entries for activations the BF16 runtime cannot run
(Bf16MoeLauncher::init() currently hard-codes ActivationType::Swiglu and
trtllm_bf16_moe() has no activation parameter); fix by making getValidConfigs()
use the runtime-supported activation or reject mismatched activations: either
always pass ActivationType::Swiglu when creating the
tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner for BF16 paths, or add a guard
that checks act_type == ActivationType::Swiglu and skip/return empty configs for
other activations, and document/update Bf16MoeLauncher::init()/trtllm_bf16_moe()
to thread activation through later when BF16 runtime becomes activation-aware.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 2215-2221: The per-tensor-FP8 branch returns configs for arbitrary
activations even though Fp8PerTensorLauncher still requires gated MOE paths (see
Fp8PerTensorLauncher::check_moe expecting output1_scales_gate_scalar and
prepare_moe allocating 2 * intermediate_size GEMM1); guard the branch that calls
Fp8PerTensorLauncher::getValidConfigs (the branch using act_type /
activation_type) by checking isGatedActivation(activation_type) so it only
returns gated-activation configs until the launcher is updated to honor
intermediate_size_factor and nongated activations.

---

Outside diff comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 559-575: getValidConfigs() constructs BF16 MoE runners using the
caller's act_type which can lead to tactic entries for activations the BF16
runtime cannot run (Bf16MoeLauncher::init() currently hard-codes
ActivationType::Swiglu and trtllm_bf16_moe() has no activation parameter); fix
by making getValidConfigs() use the runtime-supported activation or reject
mismatched activations: either always pass ActivationType::Swiglu when creating
the tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner for BF16 paths, or add a
guard that checks act_type == ActivationType::Swiglu and skip/return empty
configs for other activations, and document/update
Bf16MoeLauncher::init()/trtllm_bf16_moe() to thread activation through later
when BF16 runtime becomes activation-aware.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1aa14726-c7a5-4c36-9d66-022e0d8647b6

📥 Commits

Reviewing files that changed from the base of the PR and between b246ac7 and 3b05708.

📒 Files selected for processing (1)
  • csrc/trtllm_fused_moe_kernel_launcher.cu

@aleozlx aleozlx added the op: moe label Mar 9, 2026
@aleozlx aleozlx added the run-ci label Mar 9, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 9, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !390 has been created, and the CI pipeline #45677790 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #45677790: 10/20 passed

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 2216-2227: Add the same activation validation used in
trtllm_get_valid_moe_configs to the runtime entry
trtllm_fp8_per_tensor_scale_moe(): call validateAndCastActivationType on the
incoming activation_type, then check isGatedActivation(...) and if false raise
the same NotImplementedError message so nongated per-tensor FP8 paths are
rejected before Fp8PerTensorLauncher::check_moe() or prepare_moe() run; this
prevents the code in Fp8PerTensorLauncher that assumes gated outputs
(output1_scales_gate_scalar and 2 * intermediate_size GEMM1 buffers) from being
executed for unsupported activations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0f019a8-9a24-4b53-be15-4f76e449d96f

📥 Commits

Reviewing files that changed from the base of the PR and between 3b05708 and 4bfc8e0.

📒 Files selected for processing (1)
  • csrc/trtllm_fused_moe_kernel_launcher.cu

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu
@danisereb
Copy link
Copy Markdown
Contributor Author

I fixed I few things, we can trigger CI again

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 9, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !390 has been updated with latest changes, and the CI pipeline #45752288 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #45752288: 8/20 passed

@aleozlx aleozlx added the ready label Mar 16, 2026
@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented Mar 16, 2026

tests seem good. CI also passed

pls merge

cc @yzh119

@aleozlx aleozlx merged commit d226a82 into flashinfer-ai:main Mar 19, 2026
79 of 104 checks passed
aleozlx pushed a commit that referenced this pull request Apr 24, 2026
)

Fixes #2731.

## What's broken?

When using the CUTLASS fused MoE backend with **non-gated activations**
(e.g., Relu2, Gelu, Silu) and MXFP8 quantization, the fc1 weight shape
validation unconditionally rejects the input — even when the shape is
correct.

## Who is affected?

Anyone using the **CUTLASS fused MoE** path with:
- **Quantization**: `WMxfp8AMxfp8`, `WMxfp4AFp8`, or `WMxfp4AMxfp8`
- **Activation**: any non-gated type (Relu2, Gelu, Silu, etc.)

Not affected: gated activations (Swiglu, Geglu, SwigluBias), or other
quant modes (NVFP4 already handles this correctly).

## Where is the bug?


`csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu`,
inside `getQuantParams()` — the fc1 weight block N-dimension check
hardcodes `* 2` at three MXFP8 branches (~L898, ~L1004, ~L1063).

## Why does it happen?

PR #2581 introduced MXFP8 support when only gated activations (Swiglu)
existed, so `inter_size * 2` was correct. Later, non-gated activation
support was added to the trtllm-gen backend (PR #2707), but the CUTLASS
backend's validation was never updated. The NVFP4 path in the same file
(line ~1131) already handles this correctly with an `if
(isGatedActivation(...))` guard.

## How did we fix it?

For each of the 3 MXFP8 quant branches:
1. Extract `int const fc1_n_mult =
isGatedActivation(base_activation_type) ? 2 : 1;`
2. Replace the hardcoded `* 2` with `* fc1_n_mult`
3. Update error messages: gated shows `"inter_size * 2"`, non-gated
shows `"inter_size"`

**Before:**
```cpp
fc1_weight_block.size(1) == alignToSfDim(inter_size, ...) * 2
```

**After:**
```cpp
int const fc1_n_mult = isGatedActivation(base_activation_type) ? 2 : 1;
fc1_weight_block.size(1) == alignToSfDim(inter_size, ...) * fc1_n_mult
```

## How do we know it works?

- `pre-commit run` passes (clang-format, lint, etc.)
- Gated activations (default Swiglu): `fc1_n_mult = 2` — identical to
old behavior, no regression
- Non-gated activations: `fc1_n_mult = 1` — shape check now accepts
correct `inter_size` dimension
- Full GPU test suite requires CI (`@flashinfer-bot run`)

## Related

- Builds on the approach identified in #2753 (stale ~27 days, CI
unresolved).
- Addresses the Gemini review feedback from #2753 by extracting the
multiplier to a local variable before the validation checks.

cc @aleozlx @nv-yunzheq


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Fixed weight block size validation for Mixture of Experts (MOE) to
correctly handle both gated and non-gated activation types, ensuring
proper support across different activation configurations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants