DP: support piggyback server load report by changhuaixin · Pull Request #11469 · sgl-project/sglang

changhuaixin · 2025-10-11T11:49:52Z

Motivation

This PR implements a piggyback load reporting mechanism, as issue 11186 has mentioned.
Also，I have tested this piggyback load report mechanism upon shortest_queue scheduler and used a new method to count request queue length.

Modifications

piggyback load reporting

Load information is generated inside process_batch_result, and send back all the way to tokenizer manager via stream_output. The original watch load thread is removed now.

Shortest queue method implement

Use num_reqs to update dp budget

WatchLoadUpdateReq is received in seperate dp rank, and DP Budget is update together. Timestamp is used to identify if those requests come from the same batch. Also, I have added an environment variable SGLANG_DATA_PARALLEL_BUDGET_INTERVAL to control DPBudget update interval to make shortest queue method more stable.

request queue length accounting (Deprecated)

This method is deprecated because it uses request accounting to get queue length for dp ranks. And this is too heavy for sglang data_parallel_controller.

Instead of return queue length of running batch, a new method is used to count queue length in dp controller.

Queue length increments on dispatch
Queue length decreases on check finished in process_batch_result

This method is used because there is an interval between request dispatch and running batch accounting which can be two ITLs at most and result in DP load imbalance.

I have observed such situation when requests of previous round finishes and a new round of requests come with short_queue scheduler with max concurrency 100. A decode server with DP size 4 is used, so queue length for each DP rank is expected to be at most 25. However queue length is seen to be 28 or higher.

Imbalance is caused during round 3 ~ 4, a total of 28 requests is dispatched to DP rank 2 and these requests are not reported untiled round 5.

After using the new method, 100 requests are balanced across all DP ranks.

[2025-10-23 18:15:01 DP2 TP2 EP2] Decode batch. time.time()=1761214501.6170907 #running-req: 25, #token: 17075, token usage: 0.32, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.75, #queue-req: 0,
[2025-10-23 18:15:01 DP1 TP1 EP1] Decode batch. time.time()=1761214501.6171055 #running-req: 25, #token: 17278, token usage: 0.33, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.53, #queue-req: 0,
[2025-10-23 18:15:01 DP0 TP0 EP0] Decode batch. time.time()=1761214501.6171064 #running-req: 25, #token: 17180, token usage: 0.33, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.55, #queue-req: 0,
[2025-10-23 18:15:01 DP3 TP3 EP3] Decode batch. time.time()=1761214501.6171079 #running-req: 25, #token: 17154, token usage: 0.33, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.59, #queue-req: 0,

Accuracy Tests

Benchmarking and Profiling

In my case, using shortest_queue for decode server improves throughput by 5.8%, and mean TPOT and ITL are seen decreased.

with this patch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.0
Max request concurrency:                 100
Successful requests:                     400
Benchmark duration (s):                  315.92
Total input tokens:                      200000
Total input text tokens:                 200000
Total input vision tokens:               0
Total generated tokens:                  600000
Total generated tokens (retokenized):    599435
Request throughput (req/s):              1.27
Input token throughput (tok/s):          633.08
Output token throughput (tok/s):         1899.23
Peak output token throughput (tok/s):    2151.00
Peak concurrent requests:                112
Total token throughput (tok/s):          2532.31
Concurrency:                             93.32
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   73704.41
Median E2E Latency (ms):                 74142.19
---------------Time to First Token----------------
Mean TTFT (ms):                          596.81
Median TTFT (ms):                        496.91
P99 TTFT (ms):                           1638.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          48.77
Median TPOT (ms):                        49.13
P99 TPOT (ms):                           49.42
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.77
Median ITL (ms):                         48.52
P95 ITL (ms):                            54.05
P99 ITL (ms):                            86.82
Max ITL (ms):                            188.67
==================================================

without this

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.0
Max request concurrency:                 100
Successful requests:                     400
Benchmark duration (s):                  334.35
Total input tokens:                      200000
Total input text tokens:                 200000
Total input vision tokens:               0
Total generated tokens:                  600000
Total generated tokens (retokenized):    598918
Request throughput (req/s):              1.20
Input token throughput (tok/s):          598.17
Output token throughput (tok/s):         1794.51
Peak output token throughput (tok/s):    2145.00
Peak concurrent requests:                119
Total token throughput (tok/s):          2392.69
Concurrency:                             89.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75117.38
Median E2E Latency (ms):                 74928.22
---------------Time to First Token----------------
Mean TTFT (ms):                          1222.55
Median TTFT (ms):                        485.67
P99 TTFT (ms):                           14702.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          49.30
Median TPOT (ms):                        49.51
P99 TPOT (ms):                           59.46
---------------Inter-Token Latency----------------
Mean ITL (ms):                           49.30
Median ITL (ms):                         48.84
P95 ITL (ms):                            51.82
P99 ITL (ms):                            88.51
Max ITL (ms):                            17349.65
==================================================

prefill server：
python3 -m sglang.launch_server --model-path /models/Qwen3-235B-A22B-Instruct-2507-FP8 --port 30001 --base-gpu-id 0 --disaggregation-mode prefill --disable-radix-cache --disaggregation-bootstrap-port 8991 --host=172.26.228.70 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --cuda-graph-max-bs 128 --chunked-prefill-size 160000 --load-balance-method round_robin

decode server：
python3 -m sglang.launch_server --model-path /models/Qwen3-235B-A22B-Instruct-2507-FP8 --port 31001 --base-gpu-id 4 --disaggregation-mode decode --disable-radix-cache --host=172.26.228.70 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --attention-backend flashinfer --cuda-graph-max-bs 128 --load-balance-method shortest_queue --prefill-round-robin-balance --decode-log-interval 1

benchmark
python -m sglang.bench_serving --backend sglang --model /models/Qwen3-235B-A22B-Instruct-2507-FP8 --pd-separated --host localhost --port 8000 --dataset-name random --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split.json --random-input-len 500 --random-output-len 1500 --random-range-ratio 1 --request-rate 4 --num-prompts 240 --max-concurrency 60

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Summary by CodeRabbit

New Features
- Streaming outputs now include real-time load metrics, enabling more transparent monitoring during generation.
Refactor
- Reworked data-parallel scheduling to balance requests more evenly across workers, reducing latency spikes and improving overall throughput and responsiveness.

gemini-code-assist · 2025-10-11T11:50:09Z

Summary of Changes

Hello @changhuaixin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the data parallel system by implementing a 'piggyback' mechanism for server load reports. By embedding real-time load metrics directly into batch output objects, the system gains a more granular and immediate understanding of the workload distribution across its data parallel workers. This improved load awareness enables the DataParallelController to make more intelligent and dynamic decisions when balancing requests, ultimately leading to better performance and resource allocation.

Highlights

Load Information Propagation: Batch output objects (BatchTokenIDOutput and BatchStrOutput) now include a GetLoadReqOutput field, allowing server load data to be 'piggybacked' and transmitted along with processing results.
Enhanced Data Parallel Budgeting: The DPBudget class in DataParallelController has been updated to track the load of all data parallel ranks. It is initialized with the total data parallel size and maintains a comprehensive view of each rank's load.
Improved Load Balancing Logic: The update_budget method within DPBudget now leverages the detailed load information from all ranks to more accurately identify underloaded workers and distribute new requests efficiently, aiming for better resource utilization.
Optimized Load Reporting: Load updates are now sent to the scheduler once per batch of processed requests, utilizing the piggybacked load information. This provides a more timely and efficient reporting mechanism compared to solely relying on periodic polling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism for piggybacking server load reports to enable more responsive load balancing in a data-parallel setup. The changes are logical and well-structured. I've identified a potential issue where load updates might be missed in a specific configuration, which could impact load balancing. I have also provided a suggestion to improve code conciseness.

hnyls2002 · 2025-10-20T10:56:35Z

    raise ValueError(f"Invalid object: {obj}")
ValueError: Invalid object: WatchLoadUpdateReq(rid=None, loads=[GetLoadReqOutput(rid=None, dp_rank=None, num_reqs=1, num_waiting_reqs=0, num_tokens=7)])

a4zhangfei · 2025-11-10T08:15:19Z

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

changhuaixin · 2025-11-10T11:33:59Z

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

It is not implemented for multiple tokenizer cases. With this patch, you can run with tokenizer-worker-num > 1, and no load is reported back. I can support this case. Can you describe your case a little bit.

changhuaixin · 2025-11-10T11:35:54Z

    raise ValueError(f"Invalid object: {obj}")
ValueError: Invalid object: WatchLoadUpdateReq(rid=None, loads=[GetLoadReqOutput(rid=None, dp_rank=None, num_reqs=1, num_waiting_reqs=0, num_tokens=7)])

fixed

a4zhangfei · 2025-11-10T11:40:50Z

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

It is not implemented for multiple tokenizer cases. With this patch, you can run with tokenizer-worker-num > 1, and no load is reported back. I can support this case. Can you describe your case a little bit.

When deploying the DeepSeek model, we use multiple tokenizer instances to avoid tokenizer bottlenecks.

changhuaixin · 2025-11-11T07:09:58Z

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

It is not implemented for multiple tokenizer cases. With this patch, you can run with tokenizer-worker-num > 1, and no load is reported back. I can support this case. Can you describe your case a little bit.

When deploying the DeepSeek model, we use multiple tokenizer instances to avoid tokenizer bottlenecks.

Hi, I have add this support in issue #13052

stmatengss · 2025-11-26T15:50:22Z

I will review and test this PR tomorrow @hnyls2002

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

This reverts commit 0c39730.

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

changhuaixin requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners October 11, 2025 11:49

gemini-code-assist Bot reviewed Oct 11, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/tokenizer_manager.py Outdated

Comment thread python/sglang/srt/managers/data_parallel_controller.py Outdated

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from 3cb098d to da91fd7 Compare October 12, 2025 04:48

stmatengss added the run-ci label Oct 12, 2025

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from da91fd7 to 740c045 Compare October 13, 2025 09:44

hnyls2002 requested changes Oct 13, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/tokenizer_manager.py Outdated

Comment thread python/sglang/srt/managers/io_struct.py Outdated

Comment thread python/sglang/srt/managers/io_struct.py Outdated

changhuaixin requested a review from hnyls2002 October 14, 2025 01:26

sgl-project deleted a comment from coderabbitai Bot Oct 20, 2025

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch 2 times, most recently from 5447c83 to 41423c7 Compare October 26, 2025 11:41

hnyls2002 mentioned this pull request Nov 8, 2025

Development Roadmap (2026 Q1) #12780

Open

changhuaixin requested a review from zhyncs as a code owner November 10, 2025 03:04

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from be2c365 to 3468399 Compare November 10, 2025 07:05

changhuaixin mentioned this pull request Nov 11, 2025

[Feature] Support all DP load balance methods for PD-Disaggregation mode #13052

Closed

6 tasks

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from d158ff0 to e61bfc6 Compare November 13, 2025 09:40

hnyls2002 assigned stmatengss, changhuaixin and hnyls2002 Nov 26, 2025

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from e61bfc6 to d253071 Compare December 4, 2025 08:36

github-actions Bot added the documentation Improvements or additions to documentation label Dec 4, 2025

DP: support piggyback server load report

239dd2c

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from 07549f1 to 239dd2c Compare December 16, 2025 02:15

Merge branch 'main' into sufeng-buaa/dp_balance

7e5b988

hnyls2002 merged commit 0c39730 into sgl-project:main Dec 25, 2025
243 of 259 checks passed

Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025

DP: support piggyback server load report (sgl-project#11469)

ff79245

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025

DP: support piggyback server load report (sgl-project#11469)

63e2862

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

whybeyoung mentioned this pull request Dec 25, 2025

[Bug] Main decode deepseek startup failure #15823

Closed

5 tasks

whybeyoung added a commit that referenced this pull request Dec 25, 2025

Revert "DP: support piggyback server load report (#11469)"

b9081a4

This reverts commit 0c39730.

whybeyoung mentioned this pull request Dec 25, 2025

Revert "DP: support piggyback server load report" #15833

Closed

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

DP: support piggyback server load report (sgl-project#11469)

b98401c

Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DP: support piggyback server load report#11469

DP: support piggyback server load report#11469
hnyls2002 merged 2 commits intosgl-project:mainfrom
openanolis:sufeng-buaa/dp_balance

changhuaixin commented Oct 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Oct 11, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hnyls2002 commented Oct 20, 2025

Uh oh!

a4zhangfei commented Nov 10, 2025

Uh oh!

changhuaixin commented Nov 10, 2025

Uh oh!

changhuaixin commented Nov 10, 2025

Uh oh!

a4zhangfei commented Nov 10, 2025

Uh oh!

changhuaixin commented Nov 11, 2025

Uh oh!

stmatengss commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

changhuaixin commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

piggyback load reporting

Shortest queue method implement

request queue length accounting (Deprecated)

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Oct 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hnyls2002 commented Oct 20, 2025

Uh oh!

a4zhangfei commented Nov 10, 2025

Uh oh!

changhuaixin commented Nov 10, 2025

Uh oh!

changhuaixin commented Nov 10, 2025

Uh oh!

a4zhangfei commented Nov 10, 2025

Uh oh!

changhuaixin commented Nov 11, 2025

Uh oh!

stmatengss commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

changhuaixin commented Oct 11, 2025 •

edited

Loading