Skip to content

DP: support piggyback server load report#11469

Merged
hnyls2002 merged 2 commits intosgl-project:mainfrom
openanolis:sufeng-buaa/dp_balance
Dec 25, 2025
Merged

DP: support piggyback server load report#11469
hnyls2002 merged 2 commits intosgl-project:mainfrom
openanolis:sufeng-buaa/dp_balance

Conversation

@changhuaixin
Copy link
Copy Markdown
Contributor

@changhuaixin changhuaixin commented Oct 11, 2025

Motivation

This PR implements a piggyback load reporting mechanism, as issue 11186 has mentioned.
Also,I have tested this piggyback load report mechanism upon shortest_queue scheduler and used a new method to count request queue length.

Modifications

piggyback load reporting

Load information is generated inside process_batch_result, and send back all the way to tokenizer manager via stream_output. The original watch load thread is removed now.

Shortest queue method implement

Use num_reqs to update dp budget

WatchLoadUpdateReq is received in seperate dp rank, and DP Budget is update together. Timestamp is used to identify if those requests come from the same batch. Also, I have added an environment variable SGLANG_DATA_PARALLEL_BUDGET_INTERVAL to control DPBudget update interval to make shortest queue method more stable.

request queue length accounting (Deprecated)

This method is deprecated because it uses request accounting to get queue length for dp ranks. And this is too heavy for sglang data_parallel_controller.

Instead of return queue length of running batch, a new method is used to count queue length in dp controller.

  • Queue length increments on dispatch
  • Queue length decreases on check finished in process_batch_result

This method is used because there is an interval between request dispatch and running batch accounting which can be two ITLs at most and result in DP load imbalance.

I have observed such situation when requests of previous round finishes and a new round of requests come with short_queue scheduler with max concurrency 100. A decode server with DP size 4 is used, so queue length for each DP rank is expected to be at most 25. However queue length is seen to be 28 or higher.

image

Imbalance is caused during round 3 ~ 4, a total of 28 requests is dispatched to DP rank 2 and these requests are not reported untiled round 5.

After using the new method, 100 requests are balanced across all DP ranks.

[2025-10-23 18:15:01 DP2 TP2 EP2] Decode batch. time.time()=1761214501.6170907 #running-req: 25, #token: 17075, token usage: 0.32, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.75, #queue-req: 0,
[2025-10-23 18:15:01 DP1 TP1 EP1] Decode batch. time.time()=1761214501.6171055 #running-req: 25, #token: 17278, token usage: 0.33, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.53, #queue-req: 0,
[2025-10-23 18:15:01 DP0 TP0 EP0] Decode batch. time.time()=1761214501.6171064 #running-req: 25, #token: 17180, token usage: 0.33, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.55, #queue-req: 0,
[2025-10-23 18:15:01 DP3 TP3 EP3] Decode batch. time.time()=1761214501.6171079 #running-req: 25, #token: 17154, token usage: 0.33, pre-allocated usage: 0.00, #prealloc-req: 0, #transfer-req: 0, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 520.59, #queue-req: 0,

Accuracy Tests

Benchmarking and Profiling

In my case, using shortest_queue for decode server improves throughput by 5.8%, and mean TPOT and ITL are seen decreased.

with this patch

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.0
Max request concurrency:                 100
Successful requests:                     400
Benchmark duration (s):                  315.92
Total input tokens:                      200000
Total input text tokens:                 200000
Total input vision tokens:               0
Total generated tokens:                  600000
Total generated tokens (retokenized):    599435
Request throughput (req/s):              1.27
Input token throughput (tok/s):          633.08
Output token throughput (tok/s):         1899.23
Peak output token throughput (tok/s):    2151.00
Peak concurrent requests:                112
Total token throughput (tok/s):          2532.31
Concurrency:                             93.32
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   73704.41
Median E2E Latency (ms):                 74142.19
---------------Time to First Token----------------
Mean TTFT (ms):                          596.81
Median TTFT (ms):                        496.91
P99 TTFT (ms):                           1638.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          48.77
Median TPOT (ms):                        49.13
P99 TPOT (ms):                           49.42
---------------Inter-Token Latency----------------
Mean ITL (ms):                           48.77
Median ITL (ms):                         48.52
P95 ITL (ms):                            54.05
P99 ITL (ms):                            86.82
Max ITL (ms):                            188.67
==================================================

without this

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.0
Max request concurrency:                 100
Successful requests:                     400
Benchmark duration (s):                  334.35
Total input tokens:                      200000
Total input text tokens:                 200000
Total input vision tokens:               0
Total generated tokens:                  600000
Total generated tokens (retokenized):    598918
Request throughput (req/s):              1.20
Input token throughput (tok/s):          598.17
Output token throughput (tok/s):         1794.51
Peak output token throughput (tok/s):    2145.00
Peak concurrent requests:                119
Total token throughput (tok/s):          2392.69
Concurrency:                             89.87
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75117.38
Median E2E Latency (ms):                 74928.22
---------------Time to First Token----------------
Mean TTFT (ms):                          1222.55
Median TTFT (ms):                        485.67
P99 TTFT (ms):                           14702.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          49.30
Median TPOT (ms):                        49.51
P99 TPOT (ms):                           59.46
---------------Inter-Token Latency----------------
Mean ITL (ms):                           49.30
Median ITL (ms):                         48.84
P95 ITL (ms):                            51.82
P99 ITL (ms):                            88.51
Max ITL (ms):                            17349.65
==================================================
prefill server:
python3 -m sglang.launch_server --model-path /models/Qwen3-235B-A22B-Instruct-2507-FP8 --port 30001 --base-gpu-id 0 --disaggregation-mode prefill --disable-radix-cache --disaggregation-bootstrap-port 8991 --host=172.26.228.70 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --cuda-graph-max-bs 128 --chunked-prefill-size 160000 --load-balance-method round_robin

decode server:
python3 -m sglang.launch_server --model-path /models/Qwen3-235B-A22B-Instruct-2507-FP8 --port 31001 --base-gpu-id 4 --disaggregation-mode decode --disable-radix-cache --host=172.26.228.70 --mem-fraction-static 0.75 --tp-size 4 --ep-size 4 --enable-dp-attention --dp-size 4 --moe-a2a-backend deepep --attention-backend flashinfer --cuda-graph-max-bs 128 --load-balance-method shortest_queue --prefill-round-robin-balance --decode-log-interval 1

benchmark
python -m sglang.bench_serving --backend sglang --model /models/Qwen3-235B-A22B-Instruct-2507-FP8 --pd-separated --host localhost --port 8000 --dataset-name random --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split.json --random-input-len 500 --random-output-len 1500 --random-range-ratio 1 --request-rate 4 --num-prompts 240 --max-concurrency 60

Checklist

Summary by CodeRabbit

  • New Features
    • Streaming outputs now include real-time load metrics, enabling more transparent monitoring during generation.
  • Refactor
    • Reworked data-parallel scheduling to balance requests more evenly across workers, reducing latency spikes and improving overall throughput and responsiveness.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @changhuaixin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the data parallel system by implementing a 'piggyback' mechanism for server load reports. By embedding real-time load metrics directly into batch output objects, the system gains a more granular and immediate understanding of the workload distribution across its data parallel workers. This improved load awareness enables the DataParallelController to make more intelligent and dynamic decisions when balancing requests, ultimately leading to better performance and resource allocation.

Highlights

  • Load Information Propagation: Batch output objects (BatchTokenIDOutput and BatchStrOutput) now include a GetLoadReqOutput field, allowing server load data to be 'piggybacked' and transmitted along with processing results.
  • Enhanced Data Parallel Budgeting: The DPBudget class in DataParallelController has been updated to track the load of all data parallel ranks. It is initialized with the total data parallel size and maintains a comprehensive view of each rank's load.
  • Improved Load Balancing Logic: The update_budget method within DPBudget now leverages the detailed load information from all ranks to more accurately identify underloaded workers and distribute new requests efficiently, aiming for better resource utilization.
  • Optimized Load Reporting: Load updates are now sent to the scheduler once per batch of processed requests, utilizing the piggybacked load information. This provides a more timely and efficient reporting mechanism compared to solely relying on periodic polling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism for piggybacking server load reports to enable more responsive load balancing in a data-parallel setup. The changes are logical and well-structured. I've identified a potential issue where load updates might be missed in a specific configuration, which could impact load balancing. I have also provided a suggestion to improve code conciseness.

Comment thread python/sglang/srt/managers/tokenizer_manager.py Outdated
Comment thread python/sglang/srt/managers/data_parallel_controller.py Outdated
Comment thread python/sglang/srt/managers/tokenizer_manager.py Outdated
Comment thread python/sglang/srt/managers/io_struct.py Outdated
Comment thread python/sglang/srt/managers/io_struct.py Outdated
@sgl-project sgl-project deleted a comment from coderabbitai Bot Oct 20, 2025
@hnyls2002
Copy link
Copy Markdown
Collaborator

    raise ValueError(f"Invalid object: {obj}")
ValueError: Invalid object: WatchLoadUpdateReq(rid=None, loads=[GetLoadReqOutput(rid=None, dp_rank=None, num_reqs=1, num_waiting_reqs=0, num_tokens=7)])

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch 2 times, most recently from 5447c83 to 41423c7 Compare October 26, 2025 11:41
@changhuaixin changhuaixin requested a review from zhyncs as a code owner November 10, 2025 03:04
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from be2c365 to 3468399 Compare November 10, 2025 07:05
@a4zhangfei
Copy link
Copy Markdown
Contributor

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

@changhuaixin
Copy link
Copy Markdown
Contributor Author

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

It is not implemented for multiple tokenizer cases. With this patch, you can run with tokenizer-worker-num > 1, and no load is reported back. I can support this case. Can you describe your case a little bit.

@changhuaixin
Copy link
Copy Markdown
Contributor Author

    raise ValueError(f"Invalid object: {obj}")
ValueError: Invalid object: WatchLoadUpdateReq(rid=None, loads=[GetLoadReqOutput(rid=None, dp_rank=None, num_reqs=1, num_waiting_reqs=0, num_tokens=7)])

fixed

@a4zhangfei
Copy link
Copy Markdown
Contributor

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

It is not implemented for multiple tokenizer cases. With this patch, you can run with tokenizer-worker-num > 1, and no load is reported back. I can support this case. Can you describe your case a little bit.

When deploying the DeepSeek model, we use multiple tokenizer instances to avoid tokenizer bottlenecks.

@changhuaixin
Copy link
Copy Markdown
Contributor Author

Is it compatible with multiple instances of the tokenizer, i.e., when tokenizer-worker-num > 1?

It is not implemented for multiple tokenizer cases. With this patch, you can run with tokenizer-worker-num > 1, and no load is reported back. I can support this case. Can you describe your case a little bit.

When deploying the DeepSeek model, we use multiple tokenizer instances to avoid tokenizer bottlenecks.

Hi, I have add this support in issue #13052

@stmatengss
Copy link
Copy Markdown
Collaborator

I will review and test this PR tomorrow @hnyls2002

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from e61bfc6 to d253071 Compare December 4, 2025 08:36
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Dec 4, 2025
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/dp_balance branch from 07549f1 to 239dd2c Compare December 16, 2025 02:15
@hnyls2002 hnyls2002 merged commit 0c39730 into sgl-project:main Dec 25, 2025
243 of 259 checks passed
Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
whybeyoung added a commit that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants