I have been testing and recording the output throughput of SGLang on 2*8 H100 GPUs, and I've observed a significant regression in output throughput for long outputs in the enable-dp-attention scenarios following this PR. Through debugging and profiling with Nsight Systems, I confirmed that the performance degradation is caused by the CUDA graph not being properly launched.
See the code
if self.enable_dp_attention:
total_global_tokens = sum(forward_batch.global_num_tokens_cpu)
is_bs_supported = forward_batch.can_run_dp_cuda_graph and (
total_global_tokens in self.graphs
if self.disable_padding
else total_global_tokens <= self.max_bs
)
After enable-dp-attention, total_global_tokens equals the sum of tokens across all DP ranks. For example, during the decode phase with DP = TP = 16 and a per-rank batch size of 32, the total_global_tokens would be 32 * 16 = 512. However, the maximum batch size allowed for CUDA graph capture defaults to 160. As a result, the can_run function during the decode phase returns False, and CUDA graph execution is consequently skipped.
In fact, according to the design logic of the PR, I don't consider this a bug—it consistently uses the total_global_tokens across all ranks as the batch size to be captured by the CUDA graph. A straightforward solution would be to set a sufficiently large cuda-graph-max-bs when launching the server, though this might consume a significant amount of additional memory.
I believe using the num_tokens per DP rank as the CUDA graph batch size might be a more reasonable approach, similar to the code prior to this PR. It would only require reserving adequate space for the gathered_buffer.
Below are my test output throughput before and after this PR.
2*8 H100, input_len=output_len=1000, DP=TP=16
| Concurrency |
before PR |
after PR |
after PR-fix |
| 1024 |
5115.13 |
2581.07 |
5469.53 |
| 512 |
3897.78 |
1527.95 |
4509.96 |
I have been testing and recording the output throughput of SGLang on 2*8 H100 GPUs, and I've observed a significant regression in output throughput for long outputs in the
enable-dp-attentionscenarios following this PR. Through debugging and profiling with Nsight Systems, I confirmed that the performance degradation is caused by the CUDA graph not being properly launched.See the code
After
enable-dp-attention,total_global_tokensequals the sum of tokens across all DP ranks. For example, during the decode phase with DP = TP = 16 and a per-rank batch size of 32, thetotal_global_tokenswould be 32 * 16 = 512. However, the maximum batch size allowed for CUDA graph capture defaults to 160. As a result, thecan_runfunction during the decode phase returnsFalse, and CUDA graph execution is consequently skipped.In fact, according to the design logic of the PR, I don't consider this a bug—it consistently uses the
total_global_tokensacross all ranks as the batch size to be captured by the CUDA graph. A straightforward solution would be to set a sufficiently largecuda-graph-max-bswhen launching the server, though this might consume a significant amount of additional memory.I believe using the
num_tokensper DP rank as the CUDA graph batch size might be a more reasonable approach, similar to the code prior to this PR. It would only require reserving adequate space for thegathered_buffer.Below are my test output throughput before and after this PR.
2*8 H100, input_len=output_len=1000, DP=TP=16