[UX] Add --moe-backend arg for explicit kernel selection#33807
[UX] Add --moe-backend arg for explicit kernel selection#33807vllm-bot merged 8 commits intovllm-project:mainfrom
--moe-backend arg for explicit kernel selection#33807Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a --moe-backend argument, allowing users to explicitly select a kernel for Mixture-of-Experts (MoE) models. The changes are well-implemented, propagating the new configuration from the command line down to the kernel selection logic in the MoE oracles for different quantization types (FP8, NvFP4, and unquantized).
My review includes a suggestion to improve the user experience by providing more specific error messages when a user-selected MoE backend is not available for the current configuration. This will help users debug their setups more effectively.
|
Do you think we should use We could have it be more programatic if we map directly to the Oracle and Backends in the Oracle Pros and cons of course |
|
I think it's valuable to have an overall |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <mgoin64@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
### What this PR does / why we need it? break: - vllm-project/vllm#34102 Disable_full param replaced with valid_modes/invalid_modes API - vllm-project/vllm#35503 Now must return float compilation_time - vllm-project/vllm#35564 New sequence_lengths param added - vllm-project/vllm#33807 A check was performed (if runner_backend != "auto") - vllm-project/vllm#34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - vllm-project/vllm#35274 **Important change:** - vllm-project/vllm#28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>
### What this PR does / why we need it? break: - vllm-project/vllm#34102 Disable_full param replaced with valid_modes/invalid_modes API - vllm-project/vllm#35503 Now must return float compilation_time - vllm-project/vllm#35564 New sequence_lengths param added - vllm-project/vllm#33807 A check was performed (if runner_backend != "auto") - vllm-project/vllm#34861 `BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to check process group state - vllm-project/vllm#35274 **Important change:** - vllm-project/vllm#28672 `matcher_utils` directly accesses `torch.ops._C.*` during the import phase. In the Ascend environment, some unregistered ops trigger `AttributeError`, causing e2e initialization failure. https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323 https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29 This PR adds temporary compatibility placeholders (rms_norm, fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant, silu_and_mul) to `vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to ensure no crashes during the import phase. Upstream repairs will be considered later. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@15d76f7 --------- Signed-off-by: MrZ20 <2609716663@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com> Co-authored-by: Claude Code <noreply@anthropic.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com>
…ect#33807) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Purpose
Adds
--moe-backendargument for explicit MoE kernel selection, allowing users to override the automatic backend selection logic (e.g.,--moe-backend triton,--moe-backend marlin,--moe-backend flashinfer_trtllm)Supports all three oracle paths currently implemented:
unquantized,FP8, andNVFP4If MoEBackend is specified by the user and isn't valid for the given quantization format, it will error. Currently it doesn't include CPU, XPU, etc where there are only one backend available per platform.
Updated many of the e2e evaluation tests using the environment variables to select MoE backend to now use the new argument.
Test Plan
Tested manually on a few models. Then we will trigger moe refactor CI to see if the arguments work there.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.