[UX] Add `--moe-backend` arg for explicit kernel selection by mgoin · Pull Request #33807 · vllm-project/vllm

mgoin · 2026-02-04T17:07:59Z

Purpose

Adds --moe-backend argument for explicit MoE kernel selection, allowing users to override the automatic backend selection logic (e.g., --moe-backend triton, --moe-backend marlin, --moe-backend flashinfer_trtllm)

Supports all three oracle paths currently implemented: unquantized, FP8, and NVFP4
If MoEBackend is specified by the user and isn't valid for the given quantization format, it will error. Currently it doesn't include CPU, XPU, etc where there are only one backend available per platform.

Updated many of the e2e evaluation tests using the environment variables to select MoE backend to now use the new argument.

Test Plan

Tested manually on a few models. Then we will trigger moe refactor CI to see if the arguments work there.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a --moe-backend argument, allowing users to explicitly select a kernel for Mixture-of-Experts (MoE) models. The changes are well-implemented, propagating the new configuration from the command line down to the kernel selection logic in the MoE oracles for different quantization types (FP8, NvFP4, and unquantized).

My review includes a suggestion to improve the user experience by providing more specific error messages when a user-selected MoE backend is not available for the current configuration. This will help users debug their setups more effectively.

robertgshaw2-redhat · 2026-02-05T00:45:15Z

Do you think we should use moe-backend of fp8-moe-backend + nvfp4-moe-backend + mxfp8-moe-backend?

We could have it be more programatic if we map directly to the Oracle and Backends in the Oracle

Pros and cons of course

mgoin · 2026-02-08T17:29:10Z

I think it's valuable to have an overall moe-backend for sure since we have so many backends that are effectively shared across many precisions. We already have a2a-backend for exampple.
Since behaviorally I would like moe-backend=flashinfer-trtllm to fail if one of the moe layers cannot use that backend i.e. non-uniform quantization, I definitely see the value of having per-precision specification but I'd like that to an optional level of specificity. We could either do that as separate args like you proposed or allow moe-backend to also take a dict config like moe-backend={'fp8':'cutlass', 'nvfp4':'marlin'}. Either way, I see that as follow up work if we agree on a general selection first

mergify · 2026-02-11T16:39:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify · 2026-02-18T17:44:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <mgoin64@gmail.com>

### What this PR does / why we need it? break: - vllm-project/vllm#34102 Disable_full param replaced with valid_modes/invalid_modes API - vllm-project/vllm#35503 Now must return float compilation_time - vllm-project/vllm#35564 New sequence_lengths param added - vllm-project/vllm#33807 A check was performed (if runner_backend != "auto") - vllm-project/vllm#34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - vllm-project/vllm#35274 **Important change:** - vllm-project/vllm#28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>

…ect#33807) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

Add --moe-backend arg for explicit kernel selection

60fc1aa

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners February 4, 2026 17:08

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated

ProExpertProg reviewed Feb 8, 2026

View reviewed changes

Comment thread vllm/config/parallel.py Outdated

mergify bot added the needs-rebase label Feb 11, 2026

Merge branch 'main' into add-moe-backend-arg

dc520cc

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot removed the needs-rebase label Feb 11, 2026

mergify bot added the needs-rebase label Feb 18, 2026

mgoin added 2 commits February 20, 2026 19:18

Merge branch 'main' into add-moe-backend-arg

2b7ac6b

Signed-off-by: mgoin <mgoin64@gmail.com>

Update error message

8b290b1

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot removed the needs-rebase label Feb 20, 2026

Update tests with new arg and add normalization

05ebfca

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot added the nvidia label Feb 20, 2026

github-project-automation bot added this to NVIDIA Feb 20, 2026

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 20, 2026

mgoin and others added 2 commits February 20, 2026 16:38

Merge branch 'main' into add-moe-backend-arg

8b59997

Merge branch 'main' into add-moe-backend-arg

b738a2c

robertgshaw2-redhat reviewed Feb 23, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/config.py Outdated

robertgshaw2-redhat reviewed Feb 23, 2026

View reviewed changes

Comment thread vllm/config/parallel.py Outdated

Switch to use KernelConfig

14e77b2

Signed-off-by: mgoin <mgoin64@gmail.com>

vllm-bot merged commit de527e1 into vllm-project:main Feb 26, 2026
62 of 65 checks passed

github-project-automation bot moved this to Done in NVIDIA Feb 26, 2026

mgoin deleted the add-moe-backend-arg branch February 26, 2026 01:44

MrZ20 mentioned this pull request Mar 3, 2026

[Main2Main] Upgrade vLLM to 0303 vllm-project/vllm-ascend#6944

Merged

hmellor mentioned this pull request Mar 6, 2026

Add a flag to use FusedMoE kernel in compressed quantization #23442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UX] Add `--moe-backend` arg for explicit kernel selection#33807

[UX] Add `--moe-backend` arg for explicit kernel selection#33807
vllm-bot merged 8 commits intovllm-project:mainfrom
neuralmagic:add-moe-backend-arg

mgoin commented Feb 4, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 5, 2026

Uh oh!

mgoin commented Feb 8, 2026

Uh oh!

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mgoin commented Feb 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

robertgshaw2-redhat commented Feb 5, 2026

Uh oh!

mgoin commented Feb 8, 2026

Uh oh!

Uh oh!

mergify bot commented Feb 11, 2026

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mgoin commented Feb 4, 2026 •

edited by github-actions bot

Loading