[Refactor] Move FusedMoE hidden_size roundup to quant_method by BowenBao · Pull Request #34285 · vllm-project/vllm

BowenBao · 2026-02-10T23:32:35Z

Refactor hidden_size and intermediate roundup logic to be handled by QuantMethod.
Store padded and unpadded sizes under MoeConfig.
~~Update ROCm padding logic to improve performance on mi300x. Thanks for suggestion and evaluation from @Rohan138.~~ moved to separate pr {ROCm]: gpt-oss fusion/padding fixes #38043
Enable Quark MXFP4 MoE with aiter backend running with padded intermediate_size.

BowenBao · 2026-02-10T23:34:07Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the logic for rounding up the hidden_size in FusedMoE layers, moving the responsibility from the generic FusedMoE layer to the specific quantization methods. This is a good architectural improvement. My main feedback is about code duplication and a potential bug in QuarkOCP_MX_MoEMethod where the roundup logic is applied unconditionally for gpt_oss models, even for non-MXFP4 quantization types.

vllm/model_executor/layers/quantization/quark/quark_moe.py

gemini-code-assist

Code Review

This pull request refactors the logic for rounding up the hidden size in FusedMoE layers by moving it from the generic layer.py to the specific quant_method implementations. This is a good architectural improvement, as it places quantization-specific logic where it belongs. The changes in fused_moe_method_base.py and layer.py are correct. However, this refactoring has introduced code duplication in mxfp4.py and quark_moe.py for handling gpt_oss models. I've added comments with suggestions to address this.

vllm/model_executor/layers/quantization/mxfp4.py

vllm/model_executor/layers/quantization/quark/quark_moe.py

BowenBao · 2026-02-11T00:35:07Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the hidden_size roundup logic for FusedMoE layers by moving it into the quant_method. This is a good architectural improvement as it localizes quantization-specific logic. The changes are well-structured. I've found one issue where a function is called with incorrect arguments, which I've detailed in a specific comment.

vllm/model_executor/layers/quantization/quark/quark_moe.py

mergify · 2026-02-11T00:58:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BowenBao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Rohan138 · 2026-02-11T22:04:38Z

FYI #32307 might be relevant; I'm not sure what the pad size for gpt-oss on MI300 should be i.e. 128 or 256. Needs further investigation, haven't had time to run proper perf unfortunately.

BowenBao · 2026-02-16T22:50:44Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the hidden_size and intermediate_size rounding logic in the FusedMoE layer by moving it into the quant_method. This is a significant improvement in maintainability as it centralizes quantization-specific alignment requirements (especially for MXFP4 backends) within the quantization methods themselves, rather than having brittle model-type checks in the core layer logic. The changes ensure that both the moe_config and the actual weight tensors are created with consistent, correctly padded dimensions across different hardware platforms and quantization schemes.

vllm/model_executor/layers/fused_moe/layer.py

robertgshaw2-redhat · 2026-02-18T23:42:26Z

this is a nice simplification. I wonder if we can go even further by making the layer just unaware of the hidden size / intermediate size? WDYT?

BowenBao · 2026-02-19T01:08:09Z

I think should be do-able, see #34285 (comment), unless there are other use cases of layer.hidden_size that I'm unaware of.

vllm/model_executor/layers/fused_moe/layer.py

mergify · 2026-02-24T00:56:56Z

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-26T00:32:53Z

Hi @BowenBao, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao · 2026-03-26T02:39:23Z

On latest commit

MiniMax TP4 GSM8K Results:

flexible-extract: 90.98%
strict-match: 91.05%

gpt-oss-120b TP4 GPQA: 65.72%
gpt-oss-120b-w-mxfp4-a-fp8 TP4 GPQA: 66.04%

If no more concerns, @tjtanaa could you approve the PR?

Signed-off-by: Bowen Bao <bowenbao@amd.com>

vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py

Signed-off-by: Bowen Bao <bowenbao@amd.com>

AndreasKaratzas · 2026-03-26T03:44:51Z

Kernels (B200) is a known test failure as of 3/25 nightly: https://buildkite.com/vllm/ci/builds/58103/steps/canvas?sid=019d26cd-7ecd-40ff-b3d4-b07dadd7578b&tab=output

gshtras

The padding change moved to its own PR at #38043
Could you please remove it here to avoid conflicts. I think it's GTG otherwise

Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao · 2026-03-26T18:36:13Z

@gshtras done

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

BowenBao changed the title ~~Refactor FusedMoE hidden_size roundup~~ [Refactor] Move FusedMoE hidden_size roundup to quant_method Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

vllm/model_executor/layers/quantization/quark/quark_moe.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Feb 11, 2026

BowenBao mentioned this pull request Feb 13, 2026

[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE #30647

Merged

5 tasks

bnellnm self-requested a review February 13, 2026 19:35

BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from 8b8fcbd to 6e4c34c Compare February 16, 2026 22:50

mergify bot removed the needs-rebase label Feb 16, 2026

gemini-code-assist bot reviewed Feb 16, 2026

View reviewed changes

BowenBao marked this pull request as ready for review February 16, 2026 23:01

BowenBao requested review from mgoin, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth and yewentao256 as code owners February 16, 2026 23:01

BowenBao mentioned this pull request Feb 18, 2026

[Feature]: Refactor Quark MoE and mxfp4 MoE to align with MoE oracle/MK #34851

Open

10 tasks

robertgshaw2-redhat reviewed Feb 18, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

BowenBao mentioned this pull request Feb 23, 2026

[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales #30357

Merged

bnellnm reviewed Feb 23, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

Rohan138 mentioned this pull request Feb 24, 2026

fix pad_align for gfx942 #32307

Closed

5 tasks

BowenBao force-pushed the bowenbao/move_mxfp4_moe_roundup branch from daa5f22 to 7a71c20 Compare March 26, 2026 00:28

mergify bot removed the needs-rebase label Mar 26, 2026

BowenBao added 2 commits March 26, 2026 00:53

remove duplicate

42e1849

Signed-off-by: Bowen Bao <bowenbao@amd.com>

clean-up padding

1dab403

Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao mentioned this pull request Mar 26, 2026

[ROCm][CI] Fix CK MXFP4 MoE GEMM crash for unaligned intermediate_size_per_partition #38166

Draft

remove similar padding; update shape assertion

38c1fd7

Signed-off-by: Bowen Bao <bowenbao@amd.com>

BowenBao commented Mar 26, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py Show resolved Hide resolved

fix segfault for w4a8

1101a24

Signed-off-by: Bowen Bao <bowenbao@amd.com>

gshtras reviewed Mar 26, 2026

View reviewed changes

revert 128->256 padding

0dc607c

Signed-off-by: Bowen Bao <bowenbao@amd.com>

gshtras approved these changes Mar 26, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 26, 2026

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Mar 26, 2026

vllm-bot merged commit 0ae89f1 into vllm-project:main Mar 27, 2026
66 of 71 checks passed

github-project-automation bot moved this from Todo to Done in AMD Mar 27, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 27, 2026

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Mar 27, 2026

nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method (vllm-pr…

5145670

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method (vllm-pr…

32839f8

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method (vllm-pr…

24ac515

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method (vllm-pr…

938c4fa

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method (vllm-pr…

823fad7

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com>

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026

[Refactor] Move FusedMoE hidden_size roundup to quant_method (vllm-pr…

d015ab4

…oject#34285) Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

Uh oh!

Conversation

BowenBao commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BowenBao commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

BowenBao commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

Rohan138 commented Feb 11, 2026

Uh oh!

BowenBao commented Feb 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 18, 2026

Uh oh!

BowenBao commented Feb 19, 2026

Uh oh!

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Mar 26, 2026

Uh oh!

BowenBao commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AndreasKaratzas commented Mar 26, 2026

Uh oh!

gshtras left a comment

Choose a reason for hiding this comment

Uh oh!

BowenBao commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

BowenBao commented Feb 10, 2026 •

edited

Loading

BowenBao commented Mar 26, 2026 •

edited

Loading