Skip to content

[6/N] MoE Refactor: Cleanup MoE-related configs#8849

Merged
ch-wan merged 84 commits intomainfrom
cheng/refactor/ep-framework
Aug 15, 2025
Merged

[6/N] MoE Refactor: Cleanup MoE-related configs#8849
ch-wan merged 84 commits intomainfrom
cheng/refactor/ep-framework

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented Aug 6, 2025

Motivation

  • Adding --moe-runner-backend and deprecating --enable-triton-kernel-moe, --enable-flashinfer-cutlass-moe, and --enable-flashinfer-trtllm-moe.
  • Adding TopKOutputChecker and DispatchOutputChecker to make pylint happy.
  • Adding some util functions to avoid calling global_server_args in moe-related logics.
  • Adding MoeRunnerConfig to wrap up moe runner configs.
  • Some minor cleanup

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ch-wan ch-wan marked this pull request as draft August 6, 2025 06:12
@ch-wan ch-wan changed the title [5/N] MoE Refactor: Cleanup MoE-related configs [6/N] MoE Refactor: Cleanup MoE-related configs Aug 6, 2025
@ch-wan ch-wan marked this pull request as ready for review August 6, 2025 06:36
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large-scale refactoring of MoE-related configurations. It introduces MoeRunnerConfig and TopKConfig to encapsulate parameters, adds a new --moe-runner-backend argument to unify several flags, and centralizes configuration logic to avoid direct use of global state. These changes significantly improve code clarity, maintainability, and organization. The implementation appears correct and consistent with the stated goals. I've found one potential issue with in-place modification of a configuration object, which could lead to unexpected behavior.

Comment thread python/sglang/srt/layers/moe/topk.py Outdated
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request significantly refactors the Mixture-of-Experts (MoE) related configurations and their usage throughout the codebase. My primary goal was to centralize MoE runner settings into a new MoeRunnerConfig dataclass and introduce a unified --moe-runner-backend argument, deprecating several older, specific flags. This change streamlines how MoE backends are selected and configured. Additionally, I've cleaned up direct accesses to global server arguments by introducing dedicated utility functions, improving modularity and maintainability. The changes also include the introduction of TopKConfig for clearer expert selection parameters and new checker classes for output format validation.

Highlights

  • Consolidated MoE Backend Selection: I've introduced a new --moe-runner-backend command-line argument that centralizes the selection of MoE backend implementations. This replaces the older, more specific flags like --enable-triton-kernel-moe, --enable-flashinfer-cutlass-moe, and --enable-flashinfer-trtllm-moe, simplifying configuration.
  • Centralized MoE Runner Configuration: I've created a MoeRunnerConfig dataclass to encapsulate all MoE runner-related configurations, such as activation type, in-place operations, and routed scaling factors. This improves code organization and makes it easier to pass MoE settings consistently across different functions and layers.
  • Improved MoE Configuration Access: I've refactored MoE-related logic to reduce direct dependencies on the global_server_args_dict. Instead, new utility functions like get_moe_a2a_backend, get_moe_runner_backend, and get_deepep_mode are now used, promoting cleaner code and better separation of concerns.
  • Structured Top-K Configuration: I've introduced a TopKConfig dataclass to specifically manage parameters related to top-k routing in MoE layers. This makes the configuration of expert selection more explicit and maintainable.
  • Pylint Compliance and Type Checking: To address pylint warnings and improve code quality, I've added TopKOutputChecker and DispatchOutputChecker classes. These provide static methods for type-safe validation of output formats, enhancing code robustness.
Changelog
  • benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py
    • Imported override_config, MoeRunnerConfig, and TopKConfig.
    • Updated calls to select_experts to utilize the new TopKConfig object.
    • Modified the fused_moe function call to pass a moe_runner_config object instead of an inplace boolean.
  • docs/advanced_features/server_arguments.md
    • Replaced deprecated command-line arguments (--enable-flashinfer-cutlass-moe, --enable-flashinfer-trtllm-moe, --enable-triton-kernel-moe) with a single, unified --moe-runner-backend argument.
    • Updated the description for the --ep-dispatch-algorithm argument.
  • python/sglang/bench_one_batch.py
    • Removed imports for DeepEPMode and MoeA2ABackend.
    • Eliminated parameters related to enable_two_batch_overlap, enable_deepep_moe, and deepep_mode from ForwardBatch initialization.
  • python/sglang/srt/eplb/expert_distribution.py
    • Removed the import of global_server_args_dict.
  • python/sglang/srt/layers/communicator.py
    • Changed import torch.distributed to import torch.
    • Imported get_moe_a2a_backend.
    • Updated the _compute_mlp_mode function to use get_moe_a2a_backend().is_standard() for checking scatter mode.
  • python/sglang/srt/layers/moe/init.py
    • Added a new __init__.py file to the moe directory, exposing new MoE-related classes and utility functions for centralized access.
  • python/sglang/srt/layers/moe/ep_moe/layer.py
    • Imported new MoE utility functions: get_deepep_mode, get_moe_a2a_backend, get_moe_runner_backend, and should_use_flashinfer_trtllm_moe.
    • Removed direct imports of DeepEPMode, Fp8MoEMethod, and get_tile_tokens_dim.
    • Refactored __init__ parameters, replacing activation_alpha with alpha, swiglu_limit with limit, and removing tp_size and deepep_mode.
    • Updated internal logic to consistently use self.moe_runner_config.activation and self.moe_runner_config.routed_scaling_factor.
    • Modified deepep_mode assignment to retrieve its value via get_deepep_mode().
    • Updated moe_impl to leverage DispatchOutputChecker for robust format validation.
    • Adjusted get_moe_impl_class to dynamically determine the MoE implementation based on get_moe_a2a_backend() and get_moe_runner_backend().
  • python/sglang/srt/layers/moe/fused_moe_native.py
    • Imported MoeRunnerConfig and StandardTopKOutput.
    • Updated fused_moe_forward_native and moe_forward_native function signatures to accept a moe_runner_config object, centralizing MoE runner parameters.
  • python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py
    • Imported MoeRunnerConfig and StandardTopKOutput.
    • Updated function signatures to consistently use moe_runner_config and StandardTopKOutput.
    • Replaced activation_alpha with alpha and swiglu_limit with limit for activation parameters.
    • Ensured all relevant logic now accesses MoE runner properties (e.g., inplace, no_combine, activation, apply_router_weight_on_input, routed_scaling_factor, alpha, limit) directly from the moe_runner_config object.
  • python/sglang/srt/layers/moe/fused_moe_triton/layer.py
    • Removed unused imports such as datetime, glob, os, and sys.
    • Imported MoeRunnerConfig, get_moe_runner_backend, and TopKOutputChecker.
    • Initialized self.moe_runner_config directly from constructor arguments, centralizing MoE runner configuration.
    • Eliminated redundant direct attributes like activation_alpha, swiglu_limit, routed_scaling_factor, activation, apply_router_weight_on_input, inplace, and no_combine, as these are now encapsulated within moe_runner_config.
    • Updated checks for use_triton_kernels, use_flashinfer_mxfp4_moe, and enable_flashinfer_cutlass_moe to utilize get_moe_runner_backend().
    • Refactored the forward method to use TopKOutputChecker and moe_runner_config for improved clarity and consistency.
    • Streamlined the __init__ and forward methods of FlashInferFusedMoE and FlashInferFP4MoE by leveraging the new moe_runner_config and TopKOutputChecker.
  • python/sglang/srt/layers/moe/fused_moe_triton/triton_kernels_moe.py
    • Imported MoeRunnerConfig.
    • Updated triton_kernel_moe_forward and triton_kernel_moe_with_bias_forward to accept a moe_runner_config object.
    • Applied TopKOutputChecker for asserting the format of topk_output.
    • Ensured activation, alpha, and limit properties are now accessed via moe_runner_config.
  • python/sglang/srt/layers/moe/moe_runner.py
    • Added a new file defining the MoeRunnerConfig dataclass, which centralizes MoE runner parameters for better organization.
  • python/sglang/srt/layers/moe/token_dispatcher/init.py
    • Imported DispatchOutputChecker and StandardDispatchOutput.
    • Added AscendDeepEPLLOutput to the __all__ export list.
  • python/sglang/srt/layers/moe/token_dispatcher/base_dispatcher.py
    • Removed the MoEA2ABackend enum, as it has been relocated to utils.py.
    • Introduced the DispatchOutputChecker dataclass, providing static methods for type-safe validation of DispatchOutput formats.
    • Converted DispatchOutputFormat enum values to uppercase for consistency.
  • python/sglang/srt/layers/moe/token_dispatcher/deepep.py
    • Imported DeepEPMode, get_deepep_config, and is_tbo_enabled from sglang.srt.layers.moe.
    • Updated the format property of DeepEPNormalOutput, DeepEPLLOutput, and AscendDeepEPLLOutput to use uppercase enum values.
    • Modified get_deepep_buffer to use default values for num_max_dispatch_tokens_per_rank and num_experts, and to check is_tbo_enabled().
    • Updated DeepEPConfig to retrieve its configuration using get_deepep_config().
  • python/sglang/srt/layers/moe/token_dispatcher/standard.py
    • Updated the format property of StandardDispatchOutput to use an uppercase enum value.
  • python/sglang/srt/layers/moe/topk.py
    • Introduced the TopKConfig dataclass to encapsulate top-k routing parameters.
    • Added the TopKOutputChecker dataclass with static methods for type-safe validation of TopKOutput formats.
    • Defined BypassedTopKOutput as a new NamedTuple for specific top-k output scenarios.
    • Extended the TopKOutputFormat enum to include BYPASSED.
    • Refactored the TopK class to internally use the new TopKConfig.
    • Updated the select_experts function to accept a TopKConfig object.
    • Modified forward_cuda and forward_npu to leverage TopKConfig and TopKOutputChecker for parameter handling and validation.
  • python/sglang/srt/layers/moe/utils.py
    • Removed the should_use_flashinfer_trtllm_moe function, as its logic is now integrated with get_moe_runner_backend.
    • Redefined the MoeA2ABackend enum for clarity.
    • Introduced a new MoeRunnerBackend enum to categorize different MoE runner types.
    • Implemented global variables and an initialize_moe_config function to centralize MoE configuration management.
    • Added getter functions (get_moe_a2a_backend, get_moe_runner_backend, get_deepep_mode, get_deepep_config, is_tbo_enabled, get_tbo_token_distribution_threshold) for accessing MoE configurations.
    • Updated the should_use_flashinfer_trtllm_moe function to rely on get_moe_runner_backend().
  • python/sglang/srt/layers/quantization/awq.py
    • Imported MoeRunnerConfig.
    • Updated the apply method to accept moe_runner_config and use its activation property.
  • python/sglang/srt/layers/quantization/base_config.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config.
  • python/sglang/srt/layers/quantization/blockwise_int8.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and pass it to fused_experts.
  • python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and pass it to fused_experts.
    • Ensured moe_runner_config.activation is used for activation assertions.
  • python/sglang/srt/layers/quantization/fp4.py
    • Imported MoeRunnerConfig.
    • Refactored MxFp4MoEMethod to directly inherit from FusedMoEMethodBase.
    • Updated the apply method signature to accept moe_runner_config and use its activation property.
  • python/sglang/srt/layers/quantization/fp8.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and utilize its properties.
    • Modified apply_with_router_logits to use moe_runner_config and TopKOutputChecker for parameter handling.
  • python/sglang/srt/layers/quantization/fp8_utils.py
    • Added a comment to clarify that MoE backends should be defined via --moe-runner-backend.
  • python/sglang/srt/layers/quantization/gptq.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and use its activation property.
  • python/sglang/srt/layers/quantization/marlin_utils.py
    • Imported FusedMoE.
    • Updated check_moe_marlin_supports_layer to accept FusedMoE and access MoE runner properties via moe_runner_config.
  • python/sglang/srt/layers/quantization/modelopt_quant.py
    • Imported should_use_flashinfer_trtllm_moe from sglang.srt.layers.moe.
    • Imported FusedMoE and MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and use its properties.
    • Modified the enable_flashinfer_cutlass_moe property to retrieve its value from get_moe_runner_backend().
  • python/sglang/srt/layers/quantization/moe_wna16.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and use its properties.
    • Modified moe_wna16_weight_loader to correctly use layer.moe_tp_size.
  • python/sglang/srt/layers/quantization/mxfp4.py
    • Imported get_moe_runner_backend and MoeRunnerConfig.
    • Updated the __init__ method to use get_moe_runner_backend() for initializing use_triton_kernels and use_flashinfer.
    • Modified the apply method signature to accept moe_runner_config and utilize its properties.
  • python/sglang/srt/layers/quantization/unquant.py
    • Imported MoeRunnerConfig.
    • Updated the method signatures of apply, forward_cuda, forward_cpu, and forward_npu to accept moe_runner_config and use its properties.
  • python/sglang/srt/layers/quantization/w4afp8.py
    • Updated the TopKOutput import to StandardTopKOutput.
    • Modified the apply method signature to accept StandardTopKOutput.
  • python/sglang/srt/layers/quantization/w8a8_fp8.py
    • Imported MoeRunnerConfig.
    • Updated the TopKOutput import to StandardTopKOutput.
    • Modified the apply method signature to accept StandardTopKOutput and moe_runner_config.
  • python/sglang/srt/layers/quantization/w8a8_int8.py
    • Imported MoeRunnerConfig.
    • Updated the apply method signature to accept moe_runner_config and use its properties.
  • python/sglang/srt/managers/schedule_batch.py
    • Imported is_tbo_enabled.
    • Removed several global server arguments from global_server_args_dict that are now managed by the new MoE configuration system.
    • Updated get_model_worker_batch to use is_tbo_enabled() for two-batch overlap checks.
  • python/sglang/srt/managers/scheduler.py
    • Imported new MoE utility functions: get_deepep_mode, get_moe_a2a_backend, initialize_moe_config, and is_tbo_enabled.
    • Added an init_moe_config method to initialize global MoE configurations.
    • Removed enable_two_batch_overlap, enable_deepep_moe, and deepep_mode parameters from prepare_mlp_sync_batch and prepare_mlp_sync_batch_raw.
    • Updated prepare_mlp_sync_batch_raw to retrieve MoE configurations using the new getter functions.
  • python/sglang/srt/model_executor/model_runner.py
    • Removed imports of DeepEPMode and MoeA2ABackend.
    • Eliminated moe_a2a_backend and deepep_mode from self.model_config.extra_args.
  • python/sglang/srt/models/dbrx.py
    • Imported MoeRunnerConfig and TopK.
    • Initialized self.topk and self.moe_runner_config.
    • Updated the fused_moe call to use topk_output and moe_runner_config.
    • Changed the return type of the forward method to Tuple[torch.Tensor, torch.Tensor].
  • python/sglang/srt/models/deepseek.py
    • Imported MoeRunnerConfig.
    • Updated the fused_moe call to use moe_runner_config.
  • python/sglang/srt/models/deepseek_v2.py
    • Removed imports of get_local_attention_dp_size and should_use_flashinfer_trtllm_moe.
    • Imported get_deepep_mode, get_moe_a2a_backend, and FusedMoE.
    • Eliminated deepep_mode, enable_flashinfer_cutlass_moe, renormalize, use_grouped_topk, num_expert_group, topk_group, and correction_bias from experts initialization.
    • Updated deepep_mode and _enable_deepep_moe assignments to use the new getter functions.
    • Simplified forward_normal_dual_stream and forward_normal by removing conditional topk_output assignments.
    • Modified make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/ernie4.py
    • Imported FusedMoE.
    • Updated make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/glm4_moe.py
    • Imported get_deepep_mode, get_moe_a2a_backend, and FusedMoE.
    • Removed the should_use_flashinfer_trtllm_moe import.
    • Removed the model_forward_maybe_tbo import.
    • Simplified the initialization of self.topk.
    • Eliminated deepep_mode, enable_flashinfer_cutlass_moe, renormalize, use_grouped_topk, num_expert_group, num_fused_shared_experts, topk_group, and correction_bias from experts initialization.
    • Updated _enable_deepep_moe assignment to use the getter function.
    • Simplified forward_normal_dual_stream and forward_normal by removing conditional topk_output assignments.
    • Modified make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/glm4v_moe.py
    • Removed imports related to parallel_state, tensor_model_parallel_all_reduce, get_attention_tp_rank, get_attention_tp_size, and get_local_attention_dp_size.
    • Imported FusedMoE.
    • Removed the initialization of self.dp_size.
    • Updated make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/gpt_oss.py
    • Removed imports of get_local_attention_dp_size and DeepEPMode.
    • Imported get_moe_a2a_backend and FusedMoE.
    • Simplified the initialization of self.topk.
    • Removed enable_flashinfer_cutlass_moe from extra_kwargs.
    • Updated activation_alpha and swiglu_limit to alpha and limit respectively in experts initialization.
    • Removed deepep_mode from experts initialization.
    • Updated the forward method to use get_moe_a2a_backend().
    • Removed the initialization of self.local_dp_size.
    • Modified make_expert_params_mapping_fused to use FusedMoE.
  • python/sglang/srt/models/granitemoe.py
    • Removed tp_size from FusedMoE initialization.
  • python/sglang/srt/models/grok.py
    • Removed tp_size from FusedMoE initialization.
  • python/sglang/srt/models/interns1.py
    • Imported FusedMoE.
    • Updated make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/internvl.py
    • Imported FusedMoE.
    • Updated make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/llama4.py
    • Removed imports of get_local_attention_dp_size.
    • Removed the initialization of self.local_dp_size.
  • python/sglang/srt/models/minicpm3.py
    • Removed the import of global_server_args_dict.
  • python/sglang/srt/models/mixtral.py
    • Removed the import of global_server_args_dict.
    • Removed tp_size from FusedMoE initialization.
  • python/sglang/srt/models/olmoe.py
    • Removed tp_size from FusedMoE initialization.
  • python/sglang/srt/models/qwen2_moe.py
    • Removed imports of dataclass, Enum, and auto.
    • Eliminated imports related to ExpertDistributionRecorder, LogitsProcessorOutput, EPMoE, and get_local_attention_dp_size.
    • Removed enable_flashinfer_cutlass_moe from FusedMoE initialization.
    • Removed the initialization of self.local_dp_size.
  • python/sglang/srt/models/qwen3_moe.py
    • Removed numerous imports related to distributed training, activation functions, attention, linear layers, and batch management.
    • Imported get_moe_a2a_backend and FusedMoE.
    • Eliminated deepep_mode and enable_flashinfer_cutlass_moe from experts initialization.
    • Updated the forward method to use get_moe_a2a_backend().
    • Removed the initialization of self.local_dp_size.
    • Modified make_expert_params_mapping to use FusedMoE.
  • python/sglang/srt/models/step3_vl.py
    • Imported get_moe_a2a_backend.
    • Updated the forward method to use get_moe_a2a_backend().
  • python/sglang/srt/models/xverse_moe.py
    • Imported MoeRunnerConfig and TopK.
    • Initialized self.moe_runner_config and self.topk.
    • Updated the fused_moe call to use topk_output and moe_runner_config.
  • python/sglang/srt/server_args.py
    • Replaced individual enable_flashinfer_cutlass_moe and enable_flashinfer_trtllm_moe flags with a single moe_runner_backend argument, offering a wider range of choices.
    • Removed enable_triton_kernel_moe and enable_flashinfer_mxfp4_moe from the primary arguments.
    • Added enable_flashinfer_cutlass_moe, enable_flashinfer_trtllm_moe, and enable_triton_kernel_moe to the list of deprecated arguments.
    • Implemented logic in __post_init__ to issue warnings for deprecated flags and automatically map them to the new moe_runner_backend.
    • Updated validation logic for flashinfer_cutlass_moe and flashinfer_trtllm_moe to rely on moe_runner_backend.
    • Modified add_cli_args to reflect the changes in arguments.
    • Updated model_specific_adjustments to use moe_runner_backend.
  • python/sglang/srt/two_batch_overlap.py
    • Imported get_tbo_token_distribution_threshold and is_tbo_enabled.
    • Updated _is_two_chunk_split_enabled to use get_tbo_token_distribution_threshold().
    • Modified capture_one_batch_size and MaybeTboDeepEPDispatcher to use is_tbo_enabled().
    • Added dp_padding_mode to the arguments of filter_batch.
  • python/sglang/test/test_block_fp8.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • python/sglang/test/test_block_fp8_ep.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • python/sglang/test/test_cutlass_w4a8_moe.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • python/sglang/test/test_fp4_moe.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • scripts/ci/ci_install_dependency.sh
    • Added a command to clear the torch compilation cache, improving CI build reliability.
  • test/srt/quant/test_block_int8.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • test/srt/quant/test_int8_kernel.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • test/srt/test_fused_moe.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • test/srt/test_triton_moe_channel_fp8_kernel.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
  • test/srt/test_triton_moe_wna16.py
    • Imported TopKConfig.
    • Updated calls to select_experts to use TopKConfig.
Activity
  • The bot indicated that it had reached its daily quota limit.
  • The author, ch-wan, requested a gemini review.
  • The author, ch-wan, requested a gemini summary.
  • A bot review comment was made on python/sglang/srt/layers/moe/topk.py, advising against in-place modification of self.topk_config and suggesting logging warnings instead.

@ch-wan ch-wan force-pushed the cheng/refactor/ep-framework branch from 003ccbf to cb251b1 Compare August 12, 2025 23:55
@ch-wan ch-wan merged commit 2958951 into main Aug 15, 2025
17 of 62 checks passed
@ch-wan ch-wan deleted the cheng/refactor/ep-framework branch August 15, 2025 04:14
@ch-wan ch-wan restored the cheng/refactor/ep-framework branch August 15, 2025 04:14
@ch-wan ch-wan deleted the cheng/refactor/ep-framework branch August 15, 2025 04:15
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
elfiegg added a commit to elfiegg/sglang that referenced this pull request Aug 18, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants