[Performance] Use the max num_tokens per DP rank as the CUDA graph batch size#6092
[Performance] Use the max num_tokens per DP rank as the CUDA graph batch size#6092xpmemeda wants to merge 1 commit intosgl-project:mainfrom
Conversation
|
@xpmemeda I observe that you are reverting partial changes in #4390. However, using sum token is necessary to keep all intermediate tensor static. As an extreme case, if we have 8 DP workers and their processing batch sizes are [120, 1, 1, 1, 1, 1, 1, 1], the input of GroupedGeMM is of size [127, hidden_dim]. After this PR, it becomes [120*8, hidden_dim], most of the input tensors are padded with 0. The correct way to resolve this issue is to set a large batch size for cuda graph, or to use DP ffn and LM head. We are fixing our PRs to support the latter feature. |
@ch-wan 现状是 FFN 的输入大小会变成 [127 * 8, hidden_dim],而不是 [127, hidden_dim]。 |
Co-authored-by: olafxiong <olafxiong@tencent.com>
|
gently ping. @ch-wan @zhyncs @merrymercy |
Sorry, I don't understand your response. Could you please add more details? |
|
The current implementation was checked internally. It is correct. To answer your question, You concern regarding excessive tensor shape after all-gather is reasonable but not valid. We only copy the first several effective tokens to the communication buffer. See this. |
@ch-wan 好的,我理解了,谢谢回复。 |
@ch-wan 这里引出了另一个问题,既然没有引入额外的计算,为什么改成 sum 之后,dp 性能大幅下降。也就是 issue中提到的内容,我前面的 benchmark 也验证了这点。 |
|
@xpmemeda Probably because the slow run did not activate CUDA graph. Please consider to increase |


Motivation
With --enable-dp-attention flag, cuda_graph_runner uses the batch size of each DP rank to capture CUDA graphs, but uses the sum of tokens from all DP ranks to select and replay the CUDA graph, which results in selecting an excessively large batch size.
See this issue 5527
Modifications
Change the CUDA graph selection logic from sum to max (tokens).
Benchmarks
H20 * 16,python -u -m sglang.launch_server --model-path /sgl-workspace/DeepSeek-R1 --nnodes 2 --trust-remote-code --served-model-name DeepSeek-R1 --dist-init-addr 29.226.64.238:5000 --node-rank 0 --host 0.0.0.0 --port 8000 --tp 16 --disable-radix-cache --schedule-policy fcfs --chunked-prefill-size 32768 --disable-overlap-schedule --mem-fraction-static 0.79 --attention-backend flashinfer --enable-dp-attention --dp-size 16
python -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:8000 --batch-size 128 --input-len 1000 --output-len 1000
Before

After

Checklist