Skip to content

[Feature] Support all DP load balance methods for PD-Disaggregation mode #13052

@changhuaixin

Description

@changhuaixin

Checklist

Motivation

In the current PD-disaggregation implementation, decode server uses bootstrap_room to determine which dp rank in prefill server to communicate with, as #10174 describes. Thus, the prefill server is limited to use round-robin load balance method only. Also, shortest_queue and minimum_tokens methods are not supported in decode server too.

We are trying to let decode server known the right dp rank in prefill server first. Possible solutions are also mentioned in #10174 . Then we will try to support shortest_queue and minimum_tokens load balance methods. Potential PRs are as follows:

Design: get prefill dp rank via bootstrap server

I have implemented a POC version and tested its impact on performance.

Image

In the POC version I have added a check_bootstraped logic in the for loop of pop_preallocated. And the decode requests whose bootstrap info are not found via GET from bootstrap server, are stopped from allocating caches and handshake with prefill server in decode_req.kv_receiver.init(). By doing this, TTFT is affected for those delayed decode requests.

I have tested with Qwen3-235B with ISL/OSL 1500:50, and seen mean TTFT increasing by around 200ms. Also, I haved checked that added delay is the reason for the incresement in TTFT, and no throughput impact is seen.

With my poc

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 400
Successful requests:                     400
Benchmark duration (s):                  84.29
Total input tokens:                      600000
Total input text tokens:                 600000
Total input vision tokens:               0
Total generated tokens:                  20000
Total generated tokens (retokenized):    19988
Request throughput (req/s):              4.75
Input token throughput (tok/s):          7118.30
Output token throughput (tok/s):         237.28
Total token throughput (tok/s):          7355.58
Concurrency:                             12.81
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2698.45
Median E2E Latency (ms):                 2655.88
---------------Time to First Token----------------
Mean TTFT (ms):                          1747.57
Median TTFT (ms):                        1676.68
P99 TTFT (ms):                           3026.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.41
Median TPOT (ms):                        19.69
P99 TPOT (ms):                           27.31
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.40
Median ITL (ms):                         24.25
P95 ITL (ms):                            29.78
P99 ITL (ms):                            48.40
Max ITL (ms):                            146.91
==================================================

without my poc

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 400
Successful requests:                     400
Benchmark duration (s):                  84.04
Total input tokens:                      600000
Total input text tokens:                 600000
Total input vision tokens:               0
Total generated tokens:                  20000
Total generated tokens (retokenized):    19988
Request throughput (req/s):              4.76
Input token throughput (tok/s):          7139.54
Output token throughput (tok/s):         237.98
Total token throughput (tok/s):          7377.53
Concurrency:                             12.39
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2603.27
Median E2E Latency (ms):                 2510.23
---------------Time to First Token----------------
Mean TTFT (ms):                          1595.98
Median TTFT (ms):                        1479.86
P99 TTFT (ms):                           3063.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.56
Median TPOT (ms):                        21.28
P99 TPOT (ms):                           29.45
---------------Inter-Token Latency----------------
Mean ITL (ms):                           20.56
Median ITL (ms):                         25.01
P95 ITL (ms):                            31.10
P99 ITL (ms):                            49.55
Max ITL (ms):                            97.39
==================================================

client

python -m sglang.bench_serving --backend sglang --model /models/Qwen3-235B-A22B-Instruct-2507-FP8 --pd-separated --host localhost --port 8000 --dataset-name random --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split.json --random-input-len 1500 --random-output-len 50 --random-range-ratio 1 --request-rate 5 --num-prompts 400 --max-concurrency 400

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions