Skip to content

[UX] Add --moe-backend arg for explicit kernel selection#33807

Merged
vllm-bot merged 8 commits intovllm-project:mainfrom
neuralmagic:add-moe-backend-arg
Feb 26, 2026
Merged

[UX] Add --moe-backend arg for explicit kernel selection#33807
vllm-bot merged 8 commits intovllm-project:mainfrom
neuralmagic:add-moe-backend-arg

Conversation

@mgoin
Copy link
Copy Markdown
Member

@mgoin mgoin commented Feb 4, 2026

Purpose

Adds --moe-backend argument for explicit MoE kernel selection, allowing users to override the automatic backend selection logic (e.g., --moe-backend triton, --moe-backend marlin, --moe-backend flashinfer_trtllm)

Supports all three oracle paths currently implemented: unquantized, FP8, and NVFP4
If MoEBackend is specified by the user and isn't valid for the given quantization format, it will error. Currently it doesn't include CPU, XPU, etc where there are only one backend available per platform.

Updated many of the e2e evaluation tests using the environment variables to select MoE backend to now use the new argument.

Test Plan

Tested manually on a few models. Then we will trigger moe refactor CI to see if the arguments work there.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a --moe-backend argument, allowing users to explicitly select a kernel for Mixture-of-Experts (MoE) models. The changes are well-implemented, propagating the new configuration from the command line down to the kernel selection logic in the MoE oracles for different quantization types (FP8, NvFP4, and unquantized).

My review includes a suggestion to improve the user experience by providing more specific error messages when a user-selected MoE backend is not available for the current configuration. This will help users debug their setups more effectively.

Comment thread vllm/model_executor/layers/fused_moe/oracle/unquantized.py Outdated
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Do you think we should use moe-backend of fp8-moe-backend + nvfp4-moe-backend + mxfp8-moe-backend?

We could have it be more programatic if we map directly to the Oracle and Backends in the Oracle

Pros and cons of course

@mgoin
Copy link
Copy Markdown
Member Author

mgoin commented Feb 8, 2026

I think it's valuable to have an overall moe-backend for sure since we have so many backends that are effectively shared across many precisions. We already have a2a-backend for exampple.
Since behaviorally I would like moe-backend=flashinfer-trtllm to fail if one of the moe layers cannot use that backend i.e. non-uniform quantization, I definitely see the value of having per-precision specification but I'd like that to an optional level of specificity. We could either do that as separate args like you proposed or allow moe-backend to also take a dict config like moe-backend={'fp8':'cutlass', 'nvfp4':'marlin'}. Either way, I see that as follow up work if we agree on a general selection first

Comment thread vllm/config/parallel.py Outdated
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 11, 2026
Signed-off-by: mgoin <mgoin64@gmail.com>
@mergify mergify bot removed the needs-rebase label Feb 11, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 18, 2026
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
@mergify mergify bot removed the needs-rebase label Feb 20, 2026
Signed-off-by: mgoin <mgoin64@gmail.com>
@mergify mergify bot added the nvidia label Feb 20, 2026
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 20, 2026
Comment thread vllm/model_executor/layers/fused_moe/config.py Outdated
Comment thread vllm/config/parallel.py Outdated
Signed-off-by: mgoin <mgoin64@gmail.com>
@vllm-bot vllm-bot merged commit de527e1 into vllm-project:main Feb 26, 2026
62 of 65 checks passed
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Feb 26, 2026
@mgoin mgoin deleted the add-moe-backend-arg branch February 26, 2026 01:44
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Mar 6, 2026
### What this PR does / why we need it?
break:
- vllm-project/vllm#34102 
Disable_full param replaced with valid_modes/invalid_modes API
- vllm-project/vllm#35503
Now must return float compilation_time
- vllm-project/vllm#35564
New sequence_lengths param added
- vllm-project/vllm#33807
A check was performed (if runner_backend != "auto")
- vllm-project/vllm#34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- vllm-project/vllm#35274

**Important change:**
- vllm-project/vllm#28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@15d76f7

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
break:
- vllm-project/vllm#34102 
Disable_full param replaced with valid_modes/invalid_modes API
- vllm-project/vllm#35503
Now must return float compilation_time
- vllm-project/vllm#35564
New sequence_lengths param added
- vllm-project/vllm#33807
A check was performed (if runner_backend != "auto")
- vllm-project/vllm#34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- vllm-project/vllm#35274

**Important change:**
- vllm-project/vllm#28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
vllm-project/vllm@15d76f7

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…ect#33807)

Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants