[Feature] Load Balance Refactor for DP-Attention

### Motivation

Current data parallel attention routing and load balance have the following problems:

- Confusing naming: In PD mode, the `round_robin` balance method is not round robin. It should be called "follow_bootstrap_room" (see [code](https://github.com/sgl-project/sglang/blob/88f3de25148d620f10eec3f9c8fdde7b4e695750/python/sglang/srt/managers/data_parallel_controller.py#L530)), since it uses bootstrap_room to assign target ranks.
- Limited load balancing methods: It leads to load imbalance in many cases.
    - In PD mode, it only supports "follow_bootstrap_room" method.
    - In non-PD model, the DataParallelController (DPC) only supports round robin or balancing by number of requests, but does not support balancing by the number of tokens.
- Limited compatibility with external routing: we would like to use the the rust router (model gateway) to directly assign dp ranks but it is not supported.

### Tasks

1. Fix confusing names
    - Introduce a new load balancing method, "follow_bootstrap_room". Move the current [code](https://github.com/sgl-project/sglang/blob/b5d9fc873bf238ff1172b8050dfa1fbb665f0f4d/python/sglang/srt/managers/data_parallel_controller.py#L517-L531) to this new method instead of incorrectly reusing "round_robin".
    - To make things backward compatible, we can change the default value of "--load-balance-method" to "auto" and set it in `ServerArgs:__post_init__` with the following rules:
        - If it is non-PD, set it to "round_robin".
        - If it is PD, set P workers to "follow_bootstrap_room" and set D workers to "round_robin".
     - https://github.com/sgl-project/sglang/pull/16110
     - Remove the balance method "decode_round_robin" and all its related code. It is not useful anymore. Basically revert this PR https://github.com/sgl-project/sglang/pull/15164.
     - Remove the argument `"--prefill-round-robin-balance"` and its related code. It is very confusing. If we want to do the check. We should let the router do the check.

2. Support methods other than "follow_bootstrap_room" in PD mode. We want to support the following two mode:
   a) The router directly add prefill and decode dp ranks in `GenerateReqInput`. In this case, DPC just follows the assignment.
   b)  The router does not assign dp ranks and DPC assign prefill ranks. In this case, the decode worker needs an additional communication to know the prefill dp rank. See (https://github.com/sgl-project/sglang/pull/14726).
  In this way, we can support flexible balance method in prefill worker.  The old behavior falls in the category of  (a), since router knows bootstrap_room, it will directly set prefill dp rank in `GenerateReqInput` as `bootstrap_room % dp_size`.

3. DP controller refactors
   - Unify the naming. Rename `shortest_queue` -> `total_requests`. Rename `minimum_tokens` -> `total_tokens`. These are the two basic methods that balance by the total number of requests or total number of tokens. 
   - Implement both `total_requests` and `total_tokens` correctly in DPC with the piggyback load reporting. Simplify the implementation. For example, this function https://github.com/sgl-project/sglang/blob/88f3de25148d620f10eec3f9c8fdde7b4e695750/python/sglang/srt/managers/data_parallel_controller.py#L98 is over complicated. We do not need to do ahead-of-time planning. We can simply do the following: a) when we get piggyback load report from a worker, we forcibly update the load of the worker.   b). For each new request, we dispatch it to the worker with minimal load and increment the load of that worker by one.   
   - Correctly set the default balance method with the guideline below:
       - For prefill, we typically use balance by total_tokens or cache aware
       - For decode, we look at the limiting resources. If KV cache is enough and latency requirement is tight, balance by total_requests (this is the normal case for online serving). If KV cache is very limited, balance by total_tokens. In the future, we might explore cache aware as well. 

4. External Scheduling
- [x] Support external assigned DP rank with router.
- [ ] Support dp-rank aware load balancing with other instance-level load balance strategies, e.g., cache aware.

Related Issue: https://github.com/sgl-project/sglang/issues/13052




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Load Balance Refactor for DP-Attention #16080

Motivation

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Load Balance Refactor for DP-Attention #16080

Description

Motivation

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions