[Feature] improve TBO: two chunk overlap by House-West · Pull Request #8144 · sgl-project/sglang

House-West · 2025-07-18T09:19:21Z

Motivation

mentioned in #6328

I run some benchmark on 2 x 8 x H800.

1. Special Case (one request with a length of 3072 per dp)

Experiment setup

SGLANG_TBO_DEBUG=1 SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=1 python3 -m sglang.launch_server \
    --model-path /dev/shm/DeepSeek-V3-0324 --tp 16 --dp 16 --dist-init-addr $MASTER_IP:6676 \
    --chunked-prefill-size 65536 --max-prefill-tokens 170000 --nnodes 2 --node-rank $FED_POD_INDEX \
    --trust-remote-code --deepep-mode normal --enable-dp-attention --enable-deepep-moe \
    --enable-two-batch-overlap --disable-overlap-schedule --disable-radix-cache --disable-cuda-graph \
    --max-total-tokens 34000 --mem-fraction-static 0.90  --max-running-requests 16 \
    --tbo-token-distribution-threshold 0.48

python3 -m sglang.bench_serving \
        --backend sglang --host 127.0.0.1 --port 30000 \
        --dataset-name random \
        --num-prompt 1024 \
        --random-input 3072 --random-output 1 --random-range-ratio 1 \
        --max-concurrency 1024

For baseline and this PR, change tbo-token-distribution-threshold, 0.0 indicates disable two-chunk-overlap, threshold > 0 indicates enable two-chunk-overlap.

two-chunk-overlap will trigger chunk prefill, SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=1 can improve performance.

The bench-serving script is repeated 5 times, and throw away the 1st run (because it contains JIT compilation etc).

Experiment result

Throughput

baseline: 64820.55, 64901.24, 63900.35, 63931.12
ours: 72391.92, 72845.28, 71511.55, 73152.61

On average, it improves 12.56% throughput.

2. General Case (variable length inputs (30-3072))

Experiment setup

SGLANG_TBO_DEBUG=1 SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=1 python3 -m sglang.launch_server \
    --model-path /dev/shm/DeepSeek-V3-0324 --tp 16 --dp 16 --dist-init-addr $MASTER_IP:6676 \
    --chunked-prefill-size 65536 --max-prefill-tokens 170000 --nnodes 2 --node-rank $FED_POD_INDEX \
    --trust-remote-code --deepep-mode normal --enable-dp-attention --enable-deepep-moe \
    --enable-two-batch-overlap --disable-overlap-schedule --disable-radix-cache --disable-cuda-graph \
    --max-total-tokens 34000 --mem-fraction-static 0.90  --max-running-requests 512 \
    --tbo-token-distribution-threshold 0.48

python3 -m sglang.bench_serving \
        --backend sglang --host 127.0.0.1 --port 30000 \
        --dataset-name random \
        --num-prompt 2048 \
        --random-input 3072 --random-output 1 --random-range-ratio 0.01 \
        --max-concurrency 1024

Experiment result

Throughput

baseline: 71695.06, 68585.39, 75784.86, 77115.82
ours: 77532.36, 77059.32, 75654.72, 78041.45

On average, it improves 5.15% throughput. The range of performance improvement depends on the distribution of input lengths

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @House-West, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement by implementing a 'two-chunk-overlap' strategy within the existing Two-Batch Overlap (TBO) system. This optimization aims to improve throughput for large language model serving by more efficiently distributing tokens across micro-batches, particularly beneficial for specific hardware setups. The changes include new configuration options, refined batch splitting algorithms, and general code improvements to support this new overlapping mechanism.

Highlights

Performance Optimization: Introduced 'two-chunk-overlap' functionality within the Two-Batch Overlap (TBO) mechanism to improve throughput, especially for specific hardware configurations like 2x8xH800. Benchmarks show up to 12.56% throughput improvement for special cases (one request with a length of 3072 per dp) and 5.15% for general variable-length inputs (seq_len 0-3k).
Configurability: Added a new server argument --two-batch-token-distribution-threshlod (defaulting to 0.48) to control the token distribution between the two batches in micro-batch overlap. This threshold determines whether to perform a standard two-batch overlap or the new two-chunk overlap, with a value of 0 disabling two-chunk overlap.
Batch Splitting Logic Enhancement: Refactored and enhanced the batch splitting logic within the TBO system. The system now intelligently determines whether to split sequences by count or by token distribution (aiming for a half-sum split) based on the new two_batch_token_distribution_threshlod, ensuring optimal token allocation for overlapped processing.
Code Refactoring for Position Computation: Centralized the compute_position logic into a single helper function in forward_batch_info.py. This change abstracts away the conditional choice between Triton and PyTorch implementations, reducing code duplication and improving maintainability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes introduce a new feature to improve TBO by overlapping two chunks, and also includes some unit tests. There are a few typos that need to be addressed.

fzyzcjy · 2025-07-19T13:54:32Z

core is not reviewed yet, firstly glanced at the non-core things

fzyzcjy

(I will check again later)

fzyzcjy · 2025-07-23T00:53:35Z

only that tiny nit and then ready to merge

fzyzcjy · 2025-07-23T05:23:57Z

LGTM, now only need to wait for CI green

House-West · 2025-07-25T06:24:49Z

@zhyncs Could you view this pr, waiting approval to merge

ch-wan · 2025-07-30T05:58:00Z

@House-West Nice work! For the first case, I wonder to know where the performance gain comes from. Is two chunk overlap equivalent to two batch overlap under that case? Thanks.

House-West · 2025-07-30T06:10:10Z

@House-West Nice work! For the first case, I wonder to know where the performance gain comes from. Is two chunk overlap equivalent to two batch overlap under that case? Thanks.

@ch-wan as mentioned in #6328. When each dp has only one request, two-chunk-overlap and two- batch-overlap are not equivalent.
such as batch_size = 1, extend_seq_len = [3072], extend_prefix_len = [0]

In two- batch-overlap:

micro batch0: extend_seq_len = [3072], extend_prefix_len = [0]
micro batch1 (idle batch): extend_seq_len = [0], extend_prefix_len = [0]

In two- chunk-overlap:

micro batch0: extend_seq_len = [1536], extend_prefix_len = [0]
micro batch1: extend_seq_len = [1536], extend_prefix_len = [1536]

Compared to two-batch-overlap, the latency of group gemm, dispatch, and combine operations of the two micro batches is close in two-chunk-overlap, which is better for overlapping.

ch-wan

Can we add a small CI test to test_dp_attention.py?

House-West · 2025-08-04T06:23:03Z

Can we add a small CI test to test_dp_attention.py?

I saw the test case of TBO in test_deepep_small. If TBO is ebabled, the two-chunk-overlap will be enabled in most cases. I think the test case of two-chunk-overlap can reuse the TBO, without the need for additional additions. And I have updated the parameter descriptions in server_arguments.md

gemini-code-assist · 2025-08-06T04:11:03Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

programmer-lxj · 2025-12-16T10:00:06Z

@House-West @fzyzcjy @ch-wan I'd like to ask if you achieved performance improvements using TBO on two machines (1P1D)? Or did you achieve results without separating the PD (processing and data) on different machines? On my H20 system, performance decreased both with and without PD separation (1P1D). I've seen some publicly available experimental results online where TBO showed improvements, but they used many machines, such as 4P9D and 4P16D. I also tried your parameters, but the performance decreased significantly. Did I do something wrong? Or does TBO require a significant number of machines to achieve performance improvements? My rough analysis suggests that at least three machines are needed for the overhead and communication costs of TBO to be offset by the computational overlap, and perhaps five machines are needed to see any improvement. Looking forward to your reply！

House-West · 2025-12-16T10:57:25Z

@House-West @fzyzcjy @ch-wan I'd like to ask if you achieved performance improvements using TBO on two machines (1P1D)? Or did you achieve results without separating the PD (processing and data) on different machines? On my H20 system, performance decreased both with and without PD separation (1P1D). I've seen some publicly available experimental results online where TBO showed improvements, but they used many machines, such as 4P9D and 4P16D. I also tried your parameters, but the performance decreased significantly. Did I do something wrong? Or does TBO require a significant number of machines to achieve performance improvements? My rough analysis suggests that at least three machines are needed for the overhead and communication costs of TBO to be offset by the computational overlap, and perhaps five machines are needed to see any improvement. Looking forward to your reply！

@programmer-lxj I had tested on h800 previously , and there was no performance improvement with TBO on 1P1D (one machine for prefill, one machine for decode). Because intranode communication uses NVLink, which is relatively fast. In most cases, TBO can bring performance improvements when the communication time more than 30% of end-to-end time, such as 2P9D or 4P9D. For 1P1D , you can try SBO(single batch overlap).

programmer-lxj · 2025-12-16T12:16:17Z

@House-West Thank you very much! I will try SBO.

programmer-lxj · 2025-12-17T03:42:11Z

@House-West I tried using SBO, but I found that this parameter can only be added to the P node; adding it to the D node results in an error. I checked the code and found that the _combine_core function in sglang's deeprp.py passes the overlap_args dictionary parameter, but the Buff.low_latency_combine function in the deepep library doesn't have this parameter, which causes the error. Does this mean SBO can only be added to the P node and not the D node? Also, when using 1P1D, adding SBO only to the P node (and not the D node) resulted in a slight performance decrease, not an improvement. I wonder if there are other techniques I should be aware of.

House-West requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners July 18, 2025 09:19

gemini-code-assist Bot reviewed Jul 18, 2025

View reviewed changes

fzyzcjy self-assigned this Jul 18, 2025

fzyzcjy self-requested a review July 18, 2025 09:23

gemini-code-assist Bot reviewed Jul 18, 2025

View reviewed changes

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

Comment thread python/sglang/srt/server_args.py Outdated

Comment thread python/sglang/srt/server_args.py Outdated

House-West force-pushed the two_chunk_overlap branch from 6014097 to 7f7c4e0 Compare July 18, 2025 10:04

House-West mentioned this pull request Jul 18, 2025

[Feature] two-chunk overlap for DeepSeekV3/R1 #6328

Closed

2 tasks

fzyzcjy reviewed Jul 19, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/schedule_batch.py Outdated

Comment thread python/sglang/srt/server_args.py Outdated

Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

House-West force-pushed the two_chunk_overlap branch from 7af6263 to 7b4b329 Compare July 19, 2025 17:36

fzyzcjy reviewed Jul 20, 2025

View reviewed changes

House-West force-pushed the two_chunk_overlap branch 7 times, most recently from 0625971 to 2a56702 Compare July 20, 2025 15:49

House-West requested a review from fzyzcjy July 21, 2025 03:31

fzyzcjy reviewed Jul 21, 2025

View reviewed changes

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

fzyzcjy reviewed Jul 22, 2025

View reviewed changes

Comment thread test/srt/test_two_batch_overlap.py Outdated

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

House-West added 3 commits July 22, 2025 13:19

support two chunk overlap

ff0eafc

fix lint

814b36d

modify according to the comments

9bd5152

fzyzcjy approved these changes Jul 23, 2025

View reviewed changes

Comment thread python/sglang/srt/two_batch_overlap.py Outdated

fzyzcjy and others added 3 commits July 23, 2025 08:53

Merge branch 'main' into two_chunk_overlap

b0abf6c

Merge branch 'main' into two_chunk_overlap

b6cc6d7

fix

be7493e

House-West added 3 commits July 25, 2025 14:24

Merge branch 'main' into two_chunk_overlap

fdd4c88

Merge branch 'main' into two_chunk_overlap

d4cd69a

Merge branch 'main' into two_chunk_overlap

ca35aee

zhyncs assigned ch-wan Jul 30, 2025

Merge branch 'main' into two_chunk_overlap

2167243

ch-wan added the ready-to-merge The PR is ready to merge after the CI is green. label Aug 1, 2025

ch-wan reviewed Aug 1, 2025

View reviewed changes

House-West and others added 2 commits August 4, 2025 14:16

Merge branch 'main' into two_chunk_overlap

e45c823

update server_arguments.md

6eb9eb0

House-West requested a review from ch-wan August 4, 2025 06:23

ch-wan approved these changes Aug 6, 2025

View reviewed changes

ch-wan merged commit ca47e24 into sgl-project:main Aug 6, 2025

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[Feature] improve TBO: two chunk overlap (sgl-project#8144)

9d73b32

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[Feature] improve TBO: two chunk overlap (sgl-project#8144)

6860d49

Conversation

House-West commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

1. Special Case (one request with a length of 3072 per dp)

2. General Case (variable length inputs (30-3072))

Modifications

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Jul 23, 2025

Uh oh!

fzyzcjy commented Jul 23, 2025

Uh oh!

House-West commented Jul 25, 2025

Uh oh!

ch-wan commented Jul 30, 2025

Uh oh!

House-West commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

House-West commented Aug 4, 2025

Uh oh!

gemini-code-assist Bot commented Aug 6, 2025

Uh oh!

programmer-lxj commented Dec 16, 2025

Uh oh!

House-West commented Dec 16, 2025

Uh oh!

programmer-lxj commented Dec 16, 2025

Uh oh!

programmer-lxj commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

House-West commented Jul 18, 2025 •

edited

Loading

House-West commented Jul 30, 2025 •

edited

Loading