[Performance] Use the max num_tokens per DP rank as the CUDA graph batch size by xpmemeda · Pull Request #6092 · sgl-project/sglang

xpmemeda · 2025-05-07T15:31:02Z

Motivation

With --enable-dp-attention flag, cuda_graph_runner uses the batch size of each DP rank to capture CUDA graphs, but uses the sum of tokens from all DP ranks to select and replay the CUDA graph, which results in selecting an excessively large batch size.

See this issue 5527

Modifications

Change the CUDA graph selection logic from sum to max (tokens).

Benchmarks

H20 * 16，python -u -m sglang.launch_server --model-path /sgl-workspace/DeepSeek-R1 --nnodes 2 --trust-remote-code --served-model-name DeepSeek-R1 --dist-init-addr 29.226.64.238:5000 --node-rank 0 --host 0.0.0.0 --port 8000 --tp 16 --disable-radix-cache --schedule-policy fcfs --chunked-prefill-size 32768 --disable-overlap-schedule --mem-fraction-static 0.79 --attention-backend flashinfer --enable-dp-attention --dp-size 16

python -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:8000 --batch-size 128 --input-len 1000 --output-len 1000

Before

After

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

ch-wan · 2025-05-07T21:50:44Z

@xpmemeda I observe that you are reverting partial changes in #4390. However, using sum token is necessary to keep all intermediate tensor static. As an extreme case, if we have 8 DP workers and their processing batch sizes are [120, 1, 1, 1, 1, 1, 1, 1], the input of GroupedGeMM is of size [127, hidden_dim]. After this PR, it becomes [120*8, hidden_dim], most of the input tensors are padded with 0. The correct way to resolve this issue is to set a large batch size for cuda graph, or to use DP ffn and LM head. We are fixing our PRs to support the latter feature.

xpmemeda · 2025-05-08T05:33:56Z

if we have 8 DP workers and their processing batch sizes are [120, 1, 1, 1, 1, 1, 1, 1], the input of GroupedGeMM is of size [127, hidden_dim].

@ch-wan 现状是 FFN 的输入大小会变成 [127 * 8, hidden_dim]，而不是 [127, hidden_dim]。

Co-authored-by: olafxiong <olafxiong@tencent.com>

xpmemeda · 2025-05-10T06:34:22Z

gently ping. @ch-wan @zhyncs @merrymercy

ch-wan · 2025-05-10T07:09:55Z

if we have 8 DP workers and their processing batch sizes are [120, 1, 1, 1, 1, 1, 1, 1], the input of GroupedGeMM is of size [127, hidden_dim].

@ch-wan 现状是 FFN 的输入大小会变成 [127 * 8, hidden_dim]，而不是 [127, hidden_dim]。

Sorry, I don't understand your response. Could you please add more details?

ch-wan · 2025-05-10T08:44:54Z

The current implementation was checked internally. It is correct.

To answer your question, bs in this line represents the global batch size so that it can reuse the input buffer of FFN. This design does not incur redundant computation because the sequence length of padded queries are 0s.

You concern regarding excessive tensor shape after all-gather is reasonable but not valid. We only copy the first several effective tokens to the communication buffer. See this.

xpmemeda · 2025-05-10T09:34:44Z

The current implementation was checked internally. It is correct.

To answer your question, bs in this line represents the global batch size so that it can reuse the input buffer of FFN. This design does not incur redundant computation because the sequence length of padded queries are 0s.

You concern regarding excessive tensor shape after all-gather is reasonable but not valid. We only copy the first several effective tokens to the communication buffer. See this.

@ch-wan 好的，我理解了，谢谢回复。

xpmemeda · 2025-05-10T13:33:09Z

The current implementation was checked internally. It is correct.

To answer your question, bs in this line represents the global batch size so that it can reuse the input buffer of FFN. This design does not incur redundant computation because the sequence length of padded queries are 0s.

You concern regarding excessive tensor shape after all-gather is reasonable but not valid. We only copy the first several effective tokens to the communication buffer. See this.

@ch-wan 这里引出了另一个问题，既然没有引入额外的计算，为什么改成 sum 之后，dp 性能大幅下降。也就是 issue中提到的内容，我前面的 benchmark 也验证了这点。

ch-wan · 2025-05-10T16:58:42Z

@xpmemeda Probably because the slow run did not activate CUDA graph. Please consider to increase --cuda-run-max-bs.

xpmemeda · 2025-05-11T08:41:53Z

@xpmemeda Probably because the slow run did not activate CUDA graph. Please consider to increase --cuda-run-max-bs.

@ch-wan 我确认过了 CUDA graph 是生效的，nsight 分析了下，看起来用 sum 会导致大多数 kernel 耗时增加，可能要再确认一下 sum 是不是真的没有产生额外计算。

用 sum（before this pr）：

用 max（after this pr）：

xpmemeda requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners May 7, 2025 15:31

zhyncs self-assigned this May 7, 2025

[Performance] Use max num_tokens to select CUDA graph instead of sum.

5dee3a8

Co-authored-by: olafxiong <olafxiong@tencent.com>

xpmemeda force-pushed the change-to-max branch from bb78186 to 5dee3a8 Compare May 10, 2025 05:58

xpmemeda closed this May 10, 2025

ch-wan mentioned this pull request Jul 24, 2025

[3/n] DP Enhancement: Padding tokens to max length when workload is balanced #8278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Use the max num_tokens per DP rank as the CUDA graph batch size#6092

[Performance] Use the max num_tokens per DP rank as the CUDA graph batch size#6092
xpmemeda wants to merge 1 commit intosgl-project:mainfrom
xpmemeda:change-to-max

xpmemeda commented May 7, 2025 •

edited

Loading

Uh oh!

ch-wan commented May 7, 2025

Uh oh!

xpmemeda commented May 8, 2025 •

edited

Loading

Uh oh!

xpmemeda commented May 10, 2025

Uh oh!

ch-wan commented May 10, 2025

Uh oh!

ch-wan commented May 10, 2025 •

edited

Loading

Uh oh!

xpmemeda commented May 10, 2025

Uh oh!

xpmemeda commented May 10, 2025

Uh oh!

ch-wan commented May 10, 2025

Uh oh!

xpmemeda commented May 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xpmemeda commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarks

Checklist

Uh oh!

ch-wan commented May 7, 2025

Uh oh!

xpmemeda commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xpmemeda commented May 10, 2025

Uh oh!

ch-wan commented May 10, 2025

Uh oh!

ch-wan commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xpmemeda commented May 10, 2025

Uh oh!

xpmemeda commented May 10, 2025

Uh oh!

ch-wan commented May 10, 2025

Uh oh!

xpmemeda commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xpmemeda commented May 7, 2025 •

edited

Loading

xpmemeda commented May 8, 2025 •

edited

Loading

ch-wan commented May 10, 2025 •

edited

Loading

xpmemeda commented May 11, 2025 •

edited

Loading