[MoE Refactor] MXFP4 Cutlass Experts to MK by zyongye · Pull Request #34542 · vllm-project/vllm

zyongye · 2026-02-13T22:00:06Z

Purpose

Refactor MXFP4 cutlass backend for ongoing moe refactor

Also adding testing infrastructure.

Test Plan

Test GPQA benchmarks, with medium reasoning effort

gpt-oss-120b TP=2 on gb200 with tested kernel on

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 vllm serve openai/gpt-oss-120b -tp 2

gpt-oss-120b TEP=2 on gb200 with tested kernel on

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 vllm serve openai/gpt-oss-120b -tp 2 -ep

gpt-oss-20b TP=2 on H200 with tested kernel on

VLLM_USE_FLASHINFER_MOE_MXFP4_BF16=1 vllm serve openai/gpt-oss-20b -tp 2

Test command: Follow the recipe

Test Result

gb200: GPQA with medium reasoning effort on 120b: 0.727. Match the recipe.

H200: GPQA with medium reasoning effort on 20b: 0.6641. Match the recipe.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request refactors the MXFP4 cutlass backend for MoE layers, improving modularity and adding support for new quantization schemes. The changes are well-structured and consistent across the modified files. The refactoring in vllm/model_executor/layers/quantization/mxfp4.py to use the modular kernel framework is a significant improvement. I've identified one high-severity performance issue related to object instantiation within the forward pass and provided a suggestion to address it.

gemini-code-assist · 2026-02-13T22:02:07Z

+        from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import (
+            FlashInferExperts,
+        )
+        from vllm.model_executor.layers.fused_moe.prepare_finalize import (
+            MoEPrepareAndFinalizeNoEP,
        )

-        return output
+        self.moe_quant_config = self.get_fused_moe_quant_config(layer)
+        assert self.moe_quant_config is not None
+        self.kernel = mk.FusedMoEModularKernel(
+            MoEPrepareAndFinalizeNoEP(),
+            FlashInferExperts(moe_config=self.moe, quant_config=self.moe_quant_config),
+            shared_experts=None,
+        )
+        return self.kernel(
+            hidden_states=x,
+            w1=layer.w13_weight,
+            w2=layer.w2_weight,
+            topk_weights=topk_weights,
+            topk_ids=topk_ids,
+        )


Creating the FusedMoEModularKernel on every forward pass can introduce unnecessary overhead. It's better to initialize it once and cache it for subsequent calls. This can be done with lazy initialization within the apply method.

if not hasattr(self, "_kernel"): self._kernel = None if self._kernel is None: from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( FlashInferExperts, ) from vllm.model_executor.layers.f

zyongye · 2026-02-15T17:32:09Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the MXFP4 cutlass backend to use the modular kernel interface, which is a positive step towards better code organization and maintainability. The changes primarily involve moving backend-specific logic into dedicated classes and leveraging the FusedMoEModularKernel. While the overall direction is good, I've identified two critical issues that could lead to incorrect behavior due to the refactoring. One issue involves incorrect data type casting for weights on a specific backend path, and the other relates to a quantization parameter not being correctly propagated in the new generic implementation. Please address these points to ensure the correctness of the refactored code.

gemini-code-assist · 2026-02-15T17:41:03Z

+            fc1_expert_weights = w1.view(torch.long)
+            fc2_expert_weights = w2.view(torch.long)
+            if self.quant_dtype == "mxfp8":
+                fake_input_scale = torch.ones(
+                    self.moe_config.num_experts, device=hidden_states.device
+                )
+                quant_scales = [
+                    self.w1_scale.view(torch.int32),
+                    fake_input_scale,
+                    self.w2_scale.view(torch.int32),
+                    fake_input_scale,
+                ]
+                use_mxfp8_act_scaling = True
+            else:
+                assert hidden_states.dtype == torch.bfloat16
+                quant_scales = [
+                    self.w1_scale,
+                    self.w2_scale,
+                ]
+                a1q_scale = None
+                use_w4_group_scaling = True


The weight tensors w1 and w2 are unconditionally cast to torch.long. However, this cast should only occur when self.quant_dtype == 'mxfp8' (i.e., for the use_mxfp8_act_scaling=True path). For the else branch (use_w4_group_scaling=True), the weights should be passed as-is (as torch.uint8), as was done in the previous implementation. This incorrect casting can lead to runtime errors or incorrect computation in the kernel.

if self.quant_dtype == "mxfp8": fc1_expert_weights = w1.view(torch.long) fc2_expert_weights = w2.view(torch.long) fake_input_scale = torch.ones( self.moe_config.num_experts, device=hidden_states.device ) quant_scales = [ self.w1_scale.view(torch.int32), fake_input_scale, self.w2_scale.view(torch.int32), fake_input_scale, ] use_mxfp8_act_scaling = True else: fc1_expert_weights = w1 fc2_expert_weights = w2 assert hidden_states.dtype == torch.bfloat16 quant_scales = [ self.w1_scale, self.w2_scale, ] a1q_scale = None use_w4_group_scaling = True

gemini-code-assist · 2026-02-15T17:41:03Z

+        return self.kernel(
+            hidden_states=x,
+            w1=layer.w13_weight,
+            w2=layer.w2_weight,
+            topk_weights=topk_weights,
+            topk_ids=topk_ids,
        )


This refactoring seems to have introduced a potential issue. The previous implementation for the SM100_FI_MXFP4_MXFP8_CUTLASS backend called mxfp8_quantize(x, True, 32), with is_sf_swizzled_layout=True. The new modular kernel path, via MoEPrepareAndFinalizeNoEP and moe_kernel_quantize_input, effectively calls mxfp8_quantize with is_sf_swizzled_layout=False. This discrepancy might lead to incorrect behavior or performance degradation for this backend, as the activation scales will not have the swizzled layout expected by the kernel when weight scales are swizzled. Please ensure the is_sf_swizzled_layout flag is correctly propagated or handled for this backend.

zyongye · 2026-02-16T02:08:38Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the MXFP4 CUTLASS backend for MoE layers to use the modular kernel framework, which improves code organization and maintainability. It also introduces a comprehensive testing infrastructure for GPQA evaluation, making the tests more robust and easier to configure. The changes are well-structured and the refactoring correctly encapsulates the kernel-specific logic within the FlashInferExperts class. The new testing setup is a great addition for ensuring correctness and performance on different hardware. Overall, this is a solid improvement to the codebase.

mergify · 2026-02-16T02:13:03Z

Hi @zyongye, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

robertgshaw2-redhat · 2026-02-16T17:09:54Z

+
+- label: GPQA Eval (GPT-OSS) (H200)
+  timeout_in_minutes: 120
+  device: h200


switch to h100 due to resource contraints (we have much more h100 in the ci)

ditto, I think H200 is only for 8xH200

robertgshaw2-redhat · 2026-02-16T17:15:11Z

@@ -635,6 +660,9 @@ def mxfp4_w4a16_moe_quant_config(
    w2_scale: Union[torch.Tensor, "PrecisionConfig"],
    w1_bias: torch.Tensor | None = None,
    w2_bias: torch.Tensor | None = None,
+    gemm1_alpha: torch.Tensor | None = None,


These are just hardcoded values right? (AFAICT they are: https://github.com/zyongye/vllm/blob/d7d68c3127bc27d97b20ceb901068e709c430bd5/vllm/model_executor/layers/quantization/mxfp4.py#L419-L430)

In that case, I think we should avoid passing these via the quant config and instead just having these parameters in the Kernel itself. This will make things clearer and reduce the surface of the API contract

I moved inside the FlashInferExperts into the init phase.

robertgshaw2-redhat · 2026-02-16T17:19:29Z

@@ -746,6 +746,30 @@ def _interleave_mxfp4_cutlass_sm90(w):
                layer.w2_weight_scale = torch.nn.Parameter(
                    w2_scales_interleaved, requires_grad=False
                )
+
+            assert not self.moe.use_ep, (


I dont think this is needed. I think this kernel does support EP

NOTE: the noEP thing here is misnamed. It should be NoDPEP

The kernel interface actually dispatch to multiple kernels. It will error out when I run EP.

robertgshaw2-redhat · 2026-02-16T17:27:38Z

@@ -193,13 +193,10 @@ def _mxfp4_quantize(
 def _mxfp8_e4m3_quantize(
    A: torch.Tensor,
    A_scale: torch.Tensor | None,
-    per_act_token_quant: bool,
-    block_shape: list[int] | None = None,
+    is_sf_swizzled_layout: bool,


I think we should preserve the existing args to just avoid future footguns

I changed it back. Earlier I thought we should align this with nxfp4 quantization function signature.

robertgshaw2-redhat · 2026-02-16T17:29:25Z

this looks good. minor nits other than the stuff about the gemm_alpha

mergify · 2026-02-17T20:44:37Z

Hi @zyongye, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mgoin

LGTM! I kicked off the GPQA Eval tests manually now to see that they work

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners February 13, 2026 22:00

zyongye marked this pull request as draft February 13, 2026 22:00

mergify bot added the nvidia label Feb 13, 2026

github-project-automation bot added this to NVIDIA Feb 13, 2026

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

zyongye marked this pull request as ready for review February 15, 2026 17:32

gemini-code-assist bot reviewed Feb 15, 2026

View reviewed changes

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 15, 2026

mergify bot added ci/build gpt-oss Related to GPT-OSS models labels Feb 16, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Feb 16, 2026

github-project-automation bot added this to gpt-oss Issues & Enhancements Feb 16, 2026

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

robertgshaw2-redhat reviewed Feb 16, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/utils/quant_utils.py

robertgshaw2-redhat reviewed Feb 16, 2026

View reviewed changes

robertgshaw2-redhat changed the title ~~Mxfp4 refactor cutlass experts~~ [MoE Refactor] MXFP4 Cutlass Experts to MK Feb 16, 2026

zyongye force-pushed the mxfp4_refactor_cutlass_experts branch from ca4fc4c to f744783 Compare February 17, 2026 20:40

zyongye removed the ready ONLY add when PR is ready to merge/full CI is needed label Feb 17, 2026

zyongye added 12 commits February 25, 2026 21:28

add back dep interface

a258129

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fixing trtllm moe and pre commit

b353697

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

assert not using dep

cd93ad9

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

bring back dep

70b025f

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

pre-commit

03e0a58

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

update ci tests

8ec4e1d

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

update device to use in moe config

6de17ac

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

move fake scale into init

275db17

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add dtype into scales

91f8c70

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

unifing moe_mk interface

a501137

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

adding activation type to experts

c17109a

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fix typos and update tests

2618960

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the mxfp4_refactor_cutlass_experts branch from 014d4ef to 2618960 Compare February 25, 2026 21:38

mgoin approved these changes Feb 26, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 26, 2026

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Feb 26, 2026

vllm-bot merged commit 1976356 into vllm-project:main Feb 26, 2026
70 of 71 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Feb 26, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 26, 2026

haanjack pushed a commit to haanjack/vllm that referenced this pull request Feb 26, 2026

[MoE Refactor] MXFP4 Cutlass Experts to MK (vllm-project#34542)

b23e2a1

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[MoE Refactor] MXFP4 Cutlass Experts to MK (vllm-project#34542)

f111a25

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye deleted the mxfp4_refactor_cutlass_experts branch March 12, 2026 21:14

Uh oh!

Conversation

zyongye commented Feb 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye commented Feb 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye commented Feb 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robertgshaw2-redhat Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Feb 16, 2026

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zyongye commented Feb 13, 2026 •

edited by github-actions bot

Loading

robertgshaw2-redhat Feb 16, 2026 •

edited

Loading

mgoin left a comment •

edited

Loading