You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current data parallel attention routing and load balance have the following problems:
Confusing naming: In PD mode, the round_robin balance method is not round robin. It should be called "follow_bootstrap_room" (see code), since it uses bootstrap_room to assign target ranks.
Limited load balancing methods: It leads to load imbalance in many cases.
In PD mode, it only supports "follow_bootstrap_room" method.
In non-PD model, the DataParallelController (DPC) only supports round robin or balancing by number of requests, but does not support balancing by the number of tokens.
Limited compatibility with external routing: we would like to use the the rust router (model gateway) to directly assign dp ranks but it is not supported.
Tasks
Fix confusing names
Introduce a new load balancing method, "follow_bootstrap_room". Move the current code to this new method instead of incorrectly reusing "round_robin".
To make things backward compatible, we can change the default value of "--load-balance-method" to "auto" and set it in ServerArgs:__post_init__ with the following rules:
If it is non-PD, set it to "round_robin".
If it is PD, set P workers to "follow_bootstrap_room" and set D workers to "round_robin".
Remove the balance method "decode_round_robin" and all its related code. It is not useful anymore. Basically revert this PR add decode round robin policy #15164.
Remove the argument "--prefill-round-robin-balance" and its related code. It is very confusing. If we want to do the check. We should let the router do the check.
Support methods other than "follow_bootstrap_room" in PD mode. We want to support the following two mode:
a) The router directly add prefill and decode dp ranks in GenerateReqInput. In this case, DPC just follows the assignment.
b) The router does not assign dp ranks and DPC assign prefill ranks. In this case, the decode worker needs an additional communication to know the prefill dp rank. See (Use bootstrap server to sync prefill dp rank #14726).
In this way, we can support flexible balance method in prefill worker. The old behavior falls in the category of (a), since router knows bootstrap_room, it will directly set prefill dp rank in GenerateReqInput as bootstrap_room % dp_size.
DP controller refactors
Unify the naming. Rename shortest_queue -> total_requests. Rename minimum_tokens -> total_tokens. These are the two basic methods that balance by the total number of requests or total number of tokens.
Implement both total_requests and total_tokens correctly in DPC with the piggyback load reporting. Simplify the implementation. For example, this function
is over complicated. We do not need to do ahead-of-time planning. We can simply do the following: a) when we get piggyback load report from a worker, we forcibly update the load of the worker. b). For each new request, we dispatch it to the worker with minimal load and increment the load of that worker by one.
Correctly set the default balance method with the guideline below:
For prefill, we typically use balance by total_tokens or cache aware
For decode, we look at the limiting resources. If KV cache is enough and latency requirement is tight, balance by total_requests (this is the normal case for online serving). If KV cache is very limited, balance by total_tokens. In the future, we might explore cache aware as well.
External Scheduling
Support external assigned DP rank with router.
Support dp-rank aware load balancing with other instance-level load balance strategies, e.g., cache aware.
Motivation
Current data parallel attention routing and load balance have the following problems:
round_robinbalance method is not round robin. It should be called "follow_bootstrap_room" (see code), since it uses bootstrap_room to assign target ranks.Tasks
Fix confusing names
ServerArgs:__post_init__with the following rules:"--prefill-round-robin-balance"and its related code. It is very confusing. If we want to do the check. We should let the router do the check.Support methods other than "follow_bootstrap_room" in PD mode. We want to support the following two mode:
a) The router directly add prefill and decode dp ranks in
GenerateReqInput. In this case, DPC just follows the assignment.b) The router does not assign dp ranks and DPC assign prefill ranks. In this case, the decode worker needs an additional communication to know the prefill dp rank. See (Use bootstrap server to sync prefill dp rank #14726).
In this way, we can support flexible balance method in prefill worker. The old behavior falls in the category of (a), since router knows bootstrap_room, it will directly set prefill dp rank in
GenerateReqInputasbootstrap_room % dp_size.DP controller refactors
shortest_queue->total_requests. Renameminimum_tokens->total_tokens. These are the two basic methods that balance by the total number of requests or total number of tokens.total_requestsandtotal_tokenscorrectly in DPC with the piggyback load reporting. Simplify the implementation. For example, this functionsglang/python/sglang/srt/managers/data_parallel_controller.py
Line 98 in 88f3de2
External Scheduling
Related Issue: #13052