Background
We aim to optimize the code structure of MoE modules in SGLang to enhance extensibility. Currently, there are three main MoE modules: FusedMoE, EPMoE, and DeepEPMoE (along with recent additions like FlashInferEPMoE and FlashInferFusedMoE as of this document's preparation). Their implementations suffer from several issues:
- Inconsistent logic flow. Computation logics vary across modules. For instance,
FusedMoE computes select_experts within its forward function, while DeepEPMoE handles it externally. Similarly, some forward functions manage routed_scaling_factor internally, but others do not.
- Poor extensibility. We plan to support multiple all-to-all communication backends under EP (e.g., DeepEP, PPLX, etc) and grouped-GEMM backends (e.g., Triton, DeepGEMM, Triton Kernels, FlashInfer MoE, etc). The current design requires a dedicated forward function for each backend combination, leading to redundancy.
- Lengthy and duplicated code. Common variable combinations are repeated across functions. For example, over 10 MoE quantization methods each handle about 15 nearly identical inputs in their
apply functions. DeepEP dispatch outputs (8 in total) are duplicated in multiple model files.
Design
To streamline the code structure, we will deprecate all MoE modules except FusedMoE and gradually merge existing functionalities into it. Below is an overview of the target code structure:
[input_hidden_states]
|
v
TopK.forward -> `select_experts` / `triton_kernels.routing` / bypass
|
V
[TopKOutput]
|
v
FusedMoE.forward -> Dispatcher.dispatch -> DeepEP / PPLX / bypass
| |
| v
| [DispatchOutput]
| |
| v
| quant_mothod.apply -> MoeRunner.forward -
| | |
| | v
| | pre-permute + grouped_gemm + post-permute
| | |
| |--------------------------------
| v
| [CombineInput]
| |
| v
| Dispatcher.combine -> DeepEP / PPLX / bypass
| |
|---------------------
v
[final_hidden_states]
In addition to existing arguments like --quantization, we will introduce --moe-a2a-backend and --moe-runner-backend to allow users to select the optimal dispatching and grouped-GEMM backends for their use cases.
If a developer wants to support a new backend, they only need to implement the Dispatcher or grouped-GEMM logic and define the input/output formats. A PermuteMethodPool will automatically select appropriate pre-permute and post-permute functions for layout conversions (if required). Developers can also register new permute functions for unsupported layouts. The TopK forward method will be automatically determined based on the backend arguments.
Tasks
The refactoring process is divided into three stages around MoeRunner.forward: preparation, implementation, and adoption.
Stage 1: Preparation
This stage focuses on unifying computation structures across all MoE modules and their forward functions, while wrapping dependent variables for better organization.
Stage 2: Implementation
In this stage, we will implement the MoeRunner framework.
Stage 3: Adoption
The third stage gradually adopts the new framework and replaces existing implementations with the unified structure. This incremental approach allows new grouped-GEMM backends to be merged during refactoring, as long as they are functional and non-invasive.
For MoE backends implemented in quantization files, we need to check the apply method (or apply_with_router_logits / apply_without_routing_weights) and distribute the implementation to the corresponding MoE backend files. Here is the tentative plan for reorganizing the current implementation.
Some MoE backends are implemented as a separate NN module. Their implementation should be scattered into the corresponding MoE backend and quantization files.
Background
We aim to optimize the code structure of MoE modules in SGLang to enhance extensibility. Currently, there are three main MoE modules:
FusedMoE,EPMoE, andDeepEPMoE(along with recent additions likeFlashInferEPMoEandFlashInferFusedMoEas of this document's preparation). Their implementations suffer from several issues:FusedMoEcomputesselect_expertswithin its forward function, whileDeepEPMoEhandles it externally. Similarly, some forward functions managerouted_scaling_factorinternally, but others do not.applyfunctions. DeepEP dispatch outputs (8 in total) are duplicated in multiple model files.Design
To streamline the code structure, we will deprecate all MoE modules except
FusedMoEand gradually merge existing functionalities into it. Below is an overview of the target code structure:In addition to existing arguments like
--quantization, we will introduce--moe-a2a-backendand--moe-runner-backendto allow users to select the optimal dispatching and grouped-GEMM backends for their use cases.If a developer wants to support a new backend, they only need to implement the
Dispatcheror grouped-GEMM logic and define the input/output formats. APermuteMethodPoolwill automatically select appropriatepre-permuteandpost-permutefunctions for layout conversions (if required). Developers can also register new permute functions for unsupported layouts. The TopK forward method will be automatically determined based on the backend arguments.Tasks
The refactoring process is divided into three stages around
MoeRunner.forward: preparation, implementation, and adoption.Stage 1: Preparation
This stage focuses on unifying computation structures across all MoE modules and their forward functions, while wrapping dependent variables for better organization.
select_expertscomputations outside MoE modules. [1/N] MoE Refactor: refactorselect_experts#7966routed_scaling_factormultiplications inside MoE modules.FusedMoEandEPMoE. [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE #8515use_grouped_topk,renormalize) and TopK output. [1/N] MoE Refactor: refactorselect_experts#7966--moe-a2a-backend. [5/N] MoE Refactor: Update MoE parallelism arguments #8658Stage 2: Implementation
In this stage, we will implement the
MoeRunnerframework.activation,no_combine). [6/N] MoE Refactor: Cleanup MoE-related configs #8849input_scale). [7/N] MoE Refactor: the implementation of new framework #9269--moe-runner-backend.Stage 3: Adoption
The third stage gradually adopts the new framework and replaces existing implementations with the unified structure. This incremental approach allows new grouped-GEMM backends to be merged during refactoring, as long as they are functional and non-invasive.
For MoE backends implemented in quantization files, we need to check the
applymethod (orapply_with_router_logits/apply_without_routing_weights) and distribute the implementation to the corresponding MoE backend files. Here is the tentative plan for reorganizing the current implementation.awq.pymarlin.pyRefactor Marlin MoeRunner #14554triton.py[7/N] MoE Refactor: the implementation of new framework #9269fp8.pyintel_amx.pyaiter.pycutlass.pyRefactor Cutlass MoE runner integration #12023triton.py[7/N] MoE Refactor: the implementation of new framework #9269flashinfer_trtllm.pyMoE Refactor: Refactorfp8.py->flashinfer_trllm.py#15151gptq.pymarlin.pyRefactor Marlin MoeRunner #14554modelopt_quant.pytriton.py[7/N] MoE Refactor: the implementation of new framework #9269flashinfer_trtllm.pyMoE Refactor: Refactormodelopt_quant.py->flashinfer_trllm.py#16685flashinfer_cutlass.pycutlass.pyRefactor Cutlass MoE runner integration #12023flashinfer_cutedsl.pymoe_wna16.pytriton.py[7/N] MoE Refactor: the implementation of new framework #9269mxfp4.pyflashinfer_trtllm.pytriton_kernels.pyRefactor Triton-kernel MoE runner integration #11795triton.py[7/N] MoE Refactor: the implementation of new framework #9269aiter.pyunquant.pytriton_kernels.pyRefactor Triton-kernel MoE runner integration #11795aiter.pytriton.py[7/N] MoE Refactor: the implementation of new framework #9269intel_amx.pytorch_native.pynpu.pycutlass.pyRefactor Cutlass MoE runner integration #12023w8a8_fp8.pytriton.py[7/N] MoE Refactor: the implementation of new framework #9269w8a8_int8.pyintel_amx.pytriton.py[7/N] MoE Refactor: the implementation of new framework #9269npu.pySome MoE backends are implemented as a separate NN module. Their implementation should be scattered into the corresponding MoE backend and quantization files.
FlashInferFusedMoE.forward->flashinfer_trtllm.py+fp8.pyFlashInferFP4MoE.forward->flashinfer_trtllm.py+modelopt_quant.pyEPMoE.forward_deepgemm->deep_gemm.py+fp8.py[8/N] MoE Refactor: deprecateEPMoE#11211DeepEPMoE.forward_*deep_gemm.py+fp8.py[10/N] MoE Refactor: reorganize deepgemm runner in DeepEPMoE #12054aiter.py+fp8.pyflashinfer_cutedsl.py+modelopt_quant.pynpu.py+fp8.py