[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM by SageMoore · Pull Request #23693 · vllm-project/vllm

SageMoore · 2025-08-26T23:00:57Z

Purpose

This PR adds support for Dual-Batch Overlap in VLLM. In it's current state it will only be abled when a user provides the --enable-microbatching flag. Furthermore, it will only be used when all DP groups are running full-decode batches. This PR supports running DBO with full cudagraphs, which is essential for minimizing the CPU overhead and getting performance from this feature.

To implement Dual-Batch Overlap (DBO), at a high level, we split the batch into two microbatches. Then using two threads and two cuda streams, one for communication and one for computation, to overlap the dispatch and combine all-to-all kernels of one microbatch with the compute kernels of the other microbatch.

When microbatching is enabled and supported, the GPUModelRunner will split the batch into two token_slices. These token_slices are then passed into the attention meta data builders during _prepare_inputs to generate one attention metadata object per-microbatch. When actually running the model, the model runner will spawn off two microbatching threads that will each communicate with each other using a UBatchContext. Each of these threads will then run self.model with the appropriate attention meta data.

Without any additional modifications to the code, this will just result in one microbatch running to completion before the other microbatch starts. In order to get overlaps, we've added a "yield" call that can be inserted into the all-to-all kernels to interleave the two microbatches. The yield_and_switch_from_compute_to_comm function yield the CPU from this thread (thread A) to the other microbatching thread (thread B). Once thread A has resumed execution, either because thread B yielded the CPU or finished it's execution, it will swap over to the communication stream and start dispatching kernels there. yield_and_switch_from_comm_to_compute behaves similarly but in the opposite direction. It swaps from the communication stream to the compute stream.

There are both GPU and CPU events to synchronize all of this. That being said, it is absolutely critical that only one microbatching thread is running at a time, meaning the other one is waiting on an event. It is also absolutely critical that both microbatches are running the exact same number of yields.

Test Plan

In general my test plan was to run lm_eval with deepseek-ai/DeepSeek-V2-Lite. We've also run numerous times with R1 in a multi node setup and verified that lm_eval produces reasonable output.

Non-DBO Runs

Eager

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --enforce-eager

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3567|±  |0.0277|
|     |       |strict-match    |     5|exact_match|↑  |0.3533|±  |0.0276|

Default

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency g2 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3700|±  |0.0279|
|     |       |strict-match    |     5|exact_match|↑  |0.3667|±  |0.0279|

DBO Runs

Eager

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency g2 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --enforce-eager --enable-microbatching --microbatching-token-threshold 4

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3800|±  |0.0281|
|     |       |strict-match    |     5|exact_match|↑  |0.3767|±  |0.0280|

Full cudagraphs

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency g2 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --compilation_config '{"cudagraph_mode": "full_decode_only"}' --enable-microbatching --microbatching-token-threshold 4

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3733|±  |0.0280|
|     |       |strict-match    |     5|exact_match|↑  |0.3700|±  |0.0279|

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Sage Moore <sage@neuralmagic.com>

) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>

…lm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>

…lm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

lhtin · 2025-09-23T07:46:19Z

@SageMoore @LucasWilkinson Could you provide some performance improvement data? I tested DeepSeek V2 Lite locally and observed a negative performance gain, with the per-step latency increasing from 38ms to 49ms. The process of launching vLLM and the test results are shown below.

According to the Nsys profile data, after enabling DBO, the execution time of both kernel batched_triton_kerneland vllm::act_and_mul_kernelhas increased significantly.

config.yaml:

model: /path/to/DeepSeek-V2-Lite
tensor-parallel-size: 1
data-parallel-size: 2
enable-expert-parallel: true
served-model-name: vllm_infer_1
enable-dbo: true
dbo-decode-token-threshold: 4

launch vllm:

export VLLM_ALL2ALL_BACKEND=deepep_low_latency
vllm serve --config config.yaml

launch bench:

vllm bench serve \
    --model /path/to/DeepSeek-V2-Lite/ \
    --served-model-name vllm_infer_1 \
    --random-input-len 1 \
    --random-output-len 1024 \
    --num-prompts 1000 \
    --max-concurrency 100 \
    --ignore-eos

timeline with dbo:

timeline without dbo:

LucasWilkinson · 2025-09-23T18:10:14Z

@SageMoore @LucasWilkinson Could you provide some performance improvement data? I tested DeepSeek V2 Lite locally and observed a negative performance gain, with the per-step latency increasing from 38ms to 49ms. The process of launching vLLM and the test results are shown below.

According to the Nsys profile data, after enabling DBO, the execution time of both kernel batched_triton_kerneland vllm::act_and_mul_kernelhas increased significantly.

Yes this is expected; DBO will increase the GEMM time when running a memory bound workload since the full model weights will have to be loaded twice (once for each microbatch). So DBO is only really beneficial when the communication time is >1x GEMM time; so it's really only intended to be used in multi-node EP setup where the communications costs are much higher. Its not expected to provide speed-up in a single node environment.

lhtin · 2025-09-24T03:48:26Z

Yes this is expected; DBO will increase the GEMM time when running a memory bound workload since the full model weights will have to be loaded twice (once for each microbatch). So DBO is only really beneficial when the communication time is >1x GEMM time; so it's really only intended to be used in multi-node EP setup where the communications costs are much higher. Its not expected to provide speed-up in a single node environment.

Thank you for the explanation. The proportion of communication time I tested on the H20 is indeed very small, less than 10%.

…t#23693) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

…lap mechanism to VLLM vllm-project#23693

…t#23693) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

…lm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: vllm-project/vllm#23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: zzzzwwjj <1183291235@qq.com>

### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: vllm-project/vllm#23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

…lm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: hwhaokun <haokun0405@163.com>

### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: vllm-project/vllm#23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>

…lm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com> Signed-off-by: nsdie <yeyifan@huawei.com>

### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: vllm-project/vllm#23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>

…lm-project#2907) ### What this PR does / why we need it? 1. This pr bump vllm commit to vllm-project/vllm@6d8246a 2. fix upstream changes vllm-project/vllm#24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by vllm-project/vllm#23693 4. fix `structured_outputs_config` changes introduced by vllm-project/vllm#22772 5. fix `moe_config` changes introduced by vllm-project/vllm#22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@c60e613 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? Remove codes of dbo. Currently, vLLM has supported dbo with pr: vllm-project/vllm#23693. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: zzzzwwjj <1183291235@qq.com>

gx16377 · 2025-12-26T02:20:32Z

Hi @SageMoore @LucasWilkinson , sorry for bothering after a long time, I see the comments in slice_query_start_locs say it will break cudagraph compatibility, how did we overcome that for DBO with fullgraphs? Are there any further changes?

LucasWilkinson and others added 30 commits May 22, 2025 20:51

wip

8293182

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

enable naive microbatching

37c9bab

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

support MLA

df8f889

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

support more args in dp example

f93bdd3

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix dummy mode

9ccfd09

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

added multhreading support

020269c

Signed-off-by: Sage Moore <sage@neuralmagic.com>

manually manage stream

ffb740a

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

working but only on the same stream

04f11d9

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

use vllm current_stream

2259b47

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tp1 working multistream tp > 1 broken

9c60a62

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix hang

2a7f25f

dp working no yields

a8439e2

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

seperate gpu wait

00f526f

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

wip

18bf91e

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

wip

2dc3b8b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

debugging hang

9edd082

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tone down prints

952f3c5

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

better debug utils

e4419df

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

better logging

37bdf9f

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

fix dp=2 tp=2 hang

020d9b0

Signed-off-by: Sage Moore <sage@neuralmagic.com>

add comment

2f39206

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

wip seperate comm and compute threads

7b31e8a

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

fixes

a743a35

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

prints

f0b66d6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

misc fixes

5cc573e

one a2a kernel per microbatch group

895a6c2

various fixes

5b0249b

more fixes

62da375

debugging

252bf08

misc cleanups to prepare for rebase

0323e29

Signed-off-by: Sage Moore <sage@neuralmagic.com>

Yikun mentioned this pull request Sep 22, 2025

[Bug]: Fix vllm main issue (0922) vllm-project/vllm-ascend#3083

Closed

Conless mentioned this pull request Sep 25, 2025

[Core] Nanoflow-style Computation-Communication Overlap #23592

Open

4 tasks

Cwndmiao added a commit to Cwndmiao/vllm that referenced this pull request Sep 26, 2025

20250926 - 0 同步vllm-project#23693 [Core/DBO][1/N] Add Dual-Batch Over…

3a5b2b8

…lap mechanism to VLLM vllm-project#23693

zzzzwwjj mentioned this pull request Oct 24, 2025

[main] remove dbo code vllm-project/vllm-ascend#3712

Merged

jiangkuaixue123 mentioned this pull request Dec 1, 2025

[Feature] AFD basic implemetation #29772

Open

11 tasks

zxdukki mentioned this pull request Dec 16, 2025

[Perf]: adapt community's dbo to vllm-ascend vllm-project/vllm-ascend#4894

Open

4 tasks

zxdukki mentioned this pull request Jan 4, 2026

[Feature]: applying dual-batch overlap to improve Prefill performance vllm-project/vllm-ascend#5591

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM#23693

[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM#23693
tlrmchlsmth merged 240 commits intovllm-project:mainfrom
neuralmagic:sage/dbo-full-cudagraphs

SageMoore commented Aug 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

lhtin commented Sep 23, 2025

Uh oh!

LucasWilkinson commented Sep 23, 2025

Uh oh!

lhtin commented Sep 24, 2025

Uh oh!

gx16377 commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Uh oh!

Conversation

SageMoore commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Non-DBO Runs

Eager

Command

Result

Default

Command

Result

DBO Runs

Eager

Command

Result

Full cudagraphs

Command

Result

Uh oh!

lhtin commented Sep 23, 2025

Uh oh!

LucasWilkinson commented Sep 23, 2025

Uh oh!

lhtin commented Sep 24, 2025

Uh oh!

gx16377 commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

SageMoore commented Aug 26, 2025 •

edited by github-actions bot

Loading