[Feature] improve TBO: two chunk overlap#8144
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @House-West, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant performance enhancement by implementing a 'two-chunk-overlap' strategy within the existing Two-Batch Overlap (TBO) system. This optimization aims to improve throughput for large language model serving by more efficiently distributing tokens across micro-batches, particularly beneficial for specific hardware setups. The changes include new configuration options, refined batch splitting algorithms, and general code improvements to support this new overlapping mechanism.
Highlights
- Performance Optimization: Introduced 'two-chunk-overlap' functionality within the Two-Batch Overlap (TBO) mechanism to improve throughput, especially for specific hardware configurations like 2x8xH800. Benchmarks show up to 12.56% throughput improvement for special cases (one request with a length of 3072 per dp) and 5.15% for general variable-length inputs (seq_len 0-3k).
- Configurability: Added a new server argument
--two-batch-token-distribution-threshlod(defaulting to 0.48) to control the token distribution between the two batches in micro-batch overlap. This threshold determines whether to perform a standard two-batch overlap or the new two-chunk overlap, with a value of 0 disabling two-chunk overlap. - Batch Splitting Logic Enhancement: Refactored and enhanced the batch splitting logic within the TBO system. The system now intelligently determines whether to split sequences by count or by token distribution (aiming for a half-sum split) based on the new
two_batch_token_distribution_threshlod, ensuring optimal token allocation for overlapped processing. - Code Refactoring for Position Computation: Centralized the
compute_positionlogic into a single helper function inforward_batch_info.py. This change abstracts away the conditional choice between Triton and PyTorch implementations, reducing code duplication and improving maintainability.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
6014097 to
7f7c4e0
Compare
|
core is not reviewed yet, firstly glanced at the non-core things |
7af6263 to
7b4b329
Compare
0625971 to
2a56702
Compare
fzyzcjy
left a comment
There was a problem hiding this comment.
(I will check again later)
|
only that tiny nit and then ready to merge |
|
LGTM, now only need to wait for CI green |
|
@zhyncs Could you view this pr, waiting approval to merge |
|
@House-West Nice work! For the first case, I wonder to know where the performance gain comes from. Is two chunk overlap equivalent to two batch overlap under that case? Thanks. |
@ch-wan as mentioned in #6328. When each dp has only one request, two-chunk-overlap and two- batch-overlap are not equivalent. In two- batch-overlap:
In two- chunk-overlap:
Compared to two-batch-overlap, the latency of group gemm, dispatch, and combine operations of the two micro batches is close in two-chunk-overlap, which is better for overlapping. |
ch-wan
left a comment
There was a problem hiding this comment.
Can we add a small CI test to test_dp_attention.py?
I saw the test case of TBO in |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@House-West @fzyzcjy @ch-wan I'd like to ask if you achieved performance improvements using TBO on two machines (1P1D)? Or did you achieve results without separating the PD (processing and data) on different machines? On my H20 system, performance decreased both with and without PD separation (1P1D). I've seen some publicly available experimental results online where TBO showed improvements, but they used many machines, such as 4P9D and 4P16D. I also tried your parameters, but the performance decreased significantly. Did I do something wrong? Or does TBO require a significant number of machines to achieve performance improvements? My rough analysis suggests that at least three machines are needed for the overhead and communication costs of TBO to be offset by the computational overlap, and perhaps five machines are needed to see any improvement. Looking forward to your reply! |
@programmer-lxj I had tested on h800 previously , and there was no performance improvement with TBO on 1P1D (one machine for prefill, one machine for decode). Because intranode communication uses NVLink, which is relatively fast. In most cases, TBO can bring performance improvements when the communication time more than 30% of end-to-end time, such as 2P9D or 4P9D. For 1P1D , you can try SBO(single batch overlap). |
|
@House-West Thank you very much! I will try SBO. |
|
@House-West I tried using SBO, but I found that this parameter can only be added to the P node; adding it to the D node results in an error. I checked the code and found that the |


Motivation
mentioned in #6328
I run some benchmark on 2 x 8 x H800.
1. Special Case (one request with a length of 3072 per dp)
Experiment setup
For baseline and this PR, change
tbo-token-distribution-threshold, 0.0 indicates disable two-chunk-overlap, threshold > 0 indicates enable two-chunk-overlap.two-chunk-overlapwill trigger chunk prefill,SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=1can improve performance.The bench-serving script is repeated 5 times, and throw away the 1st run (because it contains JIT compilation etc).
Experiment result
Throughput
On average, it improves 12.56% throughput.
2. General Case (variable length inputs (30-3072))
Experiment setup
Experiment result
Throughput
On average, it improves 5.15% throughput. The range of performance improvement depends on the distribution of input lengths
Modifications
Checklist