You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2. Please use English, otherwise it will be closed.
Motivation
Background
Deep EPLB
With Expert Parallelism, experts are assigned to different GPUs at startup time. The load of experts may vary depending on the scenario and is dynamically changing, which could imbalance GPU loads. DeepSeek adopted a redundant experts strategy and EPLB (https://github.com/deepseek-ai/EPLB) is the load balancing algorithm to achieve the balancing goal.
Expert Distribution
With this PR: #4435 we did some initial analysis of expert distribution. We ran the DeepSeek-R1 model with EPMoE enabled on an 8 * H20 server. Both EP and TP are set to 8. For a sample request of "What is the capital of France?", and the expert usage heat map for the first MoE layer is as below. The summary statistics shows leveraging eplb could balance the workload for each GPU better.
With the above data, we tried to run against eplb.py. Take the first layer count number raw data as an example:
Overloaded experts like expert 109 (count = 671) and expert 139 (count = 713) get several replicas on different GPUs, which makes sense.
With EPLB, experts are duplicated when needed and are packed into groups according to the estimated workload for each expert. To estimate the benefit of using EPLB, one can first calculate average workload for each physical expert and then sum up the expert workload assigned on each GPU. We compared GPU workload distributions using a naive baseline with default contiguous expert-to-GPU mapping against the EPLB strategy:
Expected GPU workload with EPLB (assume experts with same index will split the workload evenly, unrounded):
[3725.33, 3728.92, 3724.42, 3730.83, 3729.66, 3731.5, 3724.42, 3728.92]
Mean: 3728, Std: 2.867
Proposed Changes
EPLB provides a potentially better way of assigning experts to each GPU, all within just ~160 lines of Python code. In order to have the benefits of load balancing, we will need to integrate the algorithm into the inference framework and complete the necessary features needed in other components. The feature will involve changes from multiple components.
EPLB Worker
Weight Loader
python/sglang/srt/layers/moe/ep_moe/layer.py, EPMoE implementation class. The changes being proposed will not have conflict with the DeepEP implementations which focus more on communications.
The current implementation manages the weights by expert_id, meaning the loading of an expert will load all its layers into the GPU memory. However, EPLB provides finer granularity at layers, and EPLB will mix the layers from different experts and place them together into the same GPU.
a. create_weights(). In the specific Quantization Method implementation, we will need to allocate additional memory to store the weights of replica expert layers:
b. weight_loader(). Current implementation loads the weight to an array indexed by expert_id. Experts are continuous, and described by start_expert_id and end_expert_id. With EPLB, the experts and their layers being handled by each GPU (TP Worker) are totally shuffled. So the function will need to be enhanced to have two level lookup tables.
e.g. expert_id_mappings is a two-level lookup table, which stores Expert "E" Layer "L" in an array of [GPU "G1" OFFSET "K1", GPU "G2" OFFSET "G2"]. e.g. expert_id_mappings[1, 58] = [(3, 23), (5, 1)] means layer 58 of expert 1 is at two locations: GPU 3 at offset 23, and GPU 5 at offset 1. This mapping is maintained globally and dynamically changed by the weight_loader function to reflect the latest status (i.e. registration).
The weight loader will need to have "layer_id" passed in, and loads the [Expert ID, Layer ID] into the correct GPU memory slot. Instead of being called only once at startup time, the function needs to be called frequently when EPLB rebalancing is taking in place.
Future Features: Zero-overhead expert movement. Loading from disk is slow, so a potential solution is to load the weights in the background, and only activate/register to MOEGatings once the weight is completely loaded.
MOE Gating
python/sglang/srt/layers/moe/ep_moe/layer.py
The first three layers of DeepSeek v3/R1 model are MLP layers, and starting from the 4th layer are MOE layers. MOEGating will select the expert with select_experts().
select_experts() gives the target expert for the current MoE layer, and the output of topk_ids contains only the experts. Before the EPLB, the target expert is uniquely stored in designated GPUs. With EPLB, the layers of each expert are shuffled with replicas, so MOEGating needs to determine the exact node among multiple replicas, following certain preferences or algorithms. Below pseudo code demonstrates the idea:
current_expert_id_mapping is dynamically changing when EPLB rebalancing is happening, but it will reach its final state and match the output derived from the EPLB output when the weight action is complete. With the current mapping, the MOE Gating choose_best_replica() will select the best replicas for each topk being selected to be the final, and communicate via all-reduce or all-to-all to the target GPU node which has the target weight. The algorithm will need to consider multiple factors to make optimal decisions, including the distances (local > cross-GPU > cross-node), one communication contains most of the tensors.
EPLB Manager
python/sglang/srt/managers/eplb_manager.py
EPLB needs to be run periodically to evaluate expert distributions to make replication decisions. It could be triggered by a certain threshold or simply a timer. The EPLB manager is designed to:
Evaluated the differences between the current running profile against the EPLB results. By comparing the deltas, make decisions to start the replica adjustment actions or not.
Checklist
Motivation
Background
Deep EPLB
With Expert Parallelism, experts are assigned to different GPUs at startup time. The load of experts may vary depending on the scenario and is dynamically changing, which could imbalance GPU loads. DeepSeek adopted a redundant experts strategy and EPLB (https://github.com/deepseek-ai/EPLB) is the load balancing algorithm to achieve the balancing goal.
Expert Distribution
With this PR: #4435 we did some initial analysis of expert distribution. We ran the DeepSeek-R1 model with EPMoE enabled on an 8 * H20 server. Both EP and TP are set to 8. For a sample request of "What is the capital of France?", and the expert usage heat map for the first MoE layer is as below. The summary statistics shows leveraging eplb could balance the workload for each GPU better.
With the above data, we tried to run against eplb.py. Take the first layer count number raw data as an example:
phy2log for the first layer is:
Overloaded experts like expert 109 (count = 671) and expert 139 (count = 713) get several replicas on different GPUs, which makes sense.
With EPLB, experts are duplicated when needed and are packed into groups according to the estimated workload for each expert. To estimate the benefit of using EPLB, one can first calculate average workload for each physical expert and then sum up the expert workload assigned on each GPU. We compared GPU workload distributions using a naive baseline with default contiguous expert-to-GPU mapping against the EPLB strategy:
GPU workload without EPLB: [5645, 4342, 4264, 4586, 3702, 2563, 2799, 1923]
Mean: 3728, Std: 1227.908
Expected GPU workload with EPLB (assume experts with same index will split the workload evenly, unrounded):
[3725.33, 3728.92, 3724.42, 3730.83, 3729.66, 3731.5, 3724.42, 3728.92]
Mean: 3728, Std: 2.867
Proposed Changes
EPLB provides a potentially better way of assigning experts to each GPU, all within just ~160 lines of Python code. In order to have the benefits of load balancing, we will need to integrate the algorithm into the inference framework and complete the necessary features needed in other components. The feature will involve changes from multiple components.
EPLB Worker
python/sglang/srt/layers/moe/ep_moe/layer.py, EPMoE implementation class. The changes being proposed will not have conflict with the DeepEP implementations which focus more on communications.
The current implementation manages the weights by expert_id, meaning the loading of an expert will load all its layers into the GPU memory. However, EPLB provides finer granularity at layers, and EPLB will mix the layers from different experts and place them together into the same GPU.
a. create_weights(). In the specific Quantization Method implementation, we will need to allocate additional memory to store the weights of replica expert layers:
b. weight_loader(). Current implementation loads the weight to an array indexed by expert_id. Experts are continuous, and described by start_expert_id and end_expert_id. With EPLB, the experts and their layers being handled by each GPU (TP Worker) are totally shuffled. So the function will need to be enhanced to have two level lookup tables.
e.g. expert_id_mappings is a two-level lookup table, which stores Expert "E" Layer "L" in an array of [GPU "G1" OFFSET "K1", GPU "G2" OFFSET "G2"]. e.g. expert_id_mappings[1, 58] = [(3, 23), (5, 1)] means layer 58 of expert 1 is at two locations: GPU 3 at offset 23, and GPU 5 at offset 1. This mapping is maintained globally and dynamically changed by the weight_loader function to reflect the latest status (i.e. registration).
The weight loader will need to have "layer_id" passed in, and loads the [Expert ID, Layer ID] into the correct GPU memory slot. Instead of being called only once at startup time, the function needs to be called frequently when EPLB rebalancing is taking in place.
Future Features: Zero-overhead expert movement. Loading from disk is slow, so a potential solution is to load the weights in the background, and only activate/register to MOEGatings once the weight is completely loaded.
python/sglang/srt/layers/moe/ep_moe/layer.py
The first three layers of DeepSeek v3/R1 model are MLP layers, and starting from the 4th layer are MOE layers. MOEGating will select the expert with select_experts().
select_experts() gives the target expert for the current MoE layer, and the output of topk_ids contains only the experts. Before the EPLB, the target expert is uniquely stored in designated GPUs. With EPLB, the layers of each expert are shuffled with replicas, so MOEGating needs to determine the exact node among multiple replicas, following certain preferences or algorithms. Below pseudo code demonstrates the idea:
current_expert_id_mapping is dynamically changing when EPLB rebalancing is happening, but it will reach its final state and match the output derived from the EPLB output when the weight action is complete. With the current mapping, the MOE Gating choose_best_replica() will select the best replicas for each topk being selected to be the final, and communicate via all-reduce or all-to-all to the target GPU node which has the target weight. The algorithm will need to consider multiple factors to make optimal decisions, including the distances (local > cross-GPU > cross-node), one communication contains most of the tensors.
EPLB Manager
python/sglang/srt/managers/eplb_manager.py
EPLB needs to be run periodically to evaluate expert distributions to make replication decisions. It could be triggered by a certain threshold or simply a timer. The EPLB manager is designed to:
Note: The community has introduced an enhanced feature for recording expert distribution without overhead for EPLB, available at PR Expert distribution recording without overhead for EPLB #4957 (Expert distribution recording without overhead for EPLB #4957.). The manager will utilize the latest available methods to collect distribution data and process it through the EPLB algorithm.
Related resources
No response