DP: support piggyback server load report#11469
Conversation
Summary of ChangesHello @changhuaixin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the data parallel system by implementing a 'piggyback' mechanism for server load reports. By embedding real-time load metrics directly into batch output objects, the system gains a more granular and immediate understanding of the workload distribution across its data parallel workers. This improved load awareness enables the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism for piggybacking server load reports to enable more responsive load balancing in a data-parallel setup. The changes are logical and well-structured. I've identified a potential issue where load updates might be missed in a specific configuration, which could impact load balancing. I have also provided a suggestion to improve code conciseness.
3cb098d to
da91fd7
Compare
da91fd7 to
740c045
Compare
|
5447c83 to
41423c7
Compare
be2c365 to
3468399
Compare
|
Is it compatible with multiple instances of the tokenizer, i.e., when |
It is not implemented for multiple tokenizer cases. With this patch, you can run with |
fixed |
When deploying the DeepSeek model, we use multiple tokenizer instances to avoid tokenizer bottlenecks. |
Hi, I have add this support in issue #13052 |
d158ff0 to
e61bfc6
Compare
|
I will review and test this PR tomorrow @hnyls2002 |
e61bfc6 to
d253071
Compare
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
07549f1 to
239dd2c
Compare
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
This reverts commit 0c39730.
Signed-off-by: Chang Huaixin (OpenAnolis) <changhuaixin@linux.alibaba.com>
Motivation
This PR implements a piggyback load reporting mechanism, as issue 11186 has mentioned.
Also,I have tested this piggyback load report mechanism upon shortest_queue scheduler and used a new method to count request queue length.
Modifications
piggyback load reporting
Load information is generated inside process_batch_result, and send back all the way to tokenizer manager via stream_output. The original watch load thread is removed now.
Shortest queue method implement
Use num_reqs to update dp budget
WatchLoadUpdateReq is received in seperate dp rank, and DP Budget is update together. Timestamp is used to identify if those requests come from the same batch. Also, I have added an environment variable SGLANG_DATA_PARALLEL_BUDGET_INTERVAL to control DPBudget update interval to make shortest queue method more stable.
request queue length accounting(Deprecated)This method is deprecated because it uses request accounting to get queue length for dp ranks. And this is too heavy for sglang data_parallel_controller.
Instead of return queue length of running batch, a new method is used to count queue length in dp controller.
This method is used because there is an interval between request dispatch and running batch accounting which can be two ITLs at most and result in DP load imbalance.
I have observed such situation when requests of previous round finishes and a new round of requests come with short_queue scheduler with max concurrency 100. A decode server with DP size 4 is used, so queue length for each DP rank is expected to be at most 25. However queue length is seen to be 28 or higher.
Imbalance is caused during round 3 ~ 4, a total of 28 requests is dispatched to DP rank 2 and these requests are not reported untiled round 5.
After using the new method, 100 requests are balanced across all DP ranks.
Accuracy Tests
Benchmarking and Profiling
In my case, using shortest_queue for decode server improves throughput by 5.8%, and mean TPOT and ITL are seen decreased.
with this patch
without this
Checklist
Summary by CodeRabbit