avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA by strgrb · Pull Request #4577 · sgl-project/sglang

strgrb · 2025-03-19T09:20:08Z

Motivation

I profiled deepseek and observed some bubbles in timeline, finally found the cause:

This is caused by D2H copy in DeepSeekV2AttentionMLA.

Modifications

Using forward_batch.extend_prefix_lens_cpu directy instead of forward_batch.extend_prefix_lens, this can decrease TTFT

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…ze in DeepSeekV2AttentionMLA

strgrb · 2025-03-19T09:22:22Z

I benchmark it with cuda12.8 and DeepGEMM

before optimize

{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 1, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 288.8952702959068, "completed": 10, "total_input_tokens": 40000, "total_output_tokens": 10000, "total_output_tokens_retokenized": 9955, "request_throughput": 0.034614619996226656, "input_throughput": 138.4584799849066, "output_throughput": 34.61461999622665, "mean_e2e_latency_ms": 28885.89523830451, "median_e2e_latency_ms": 28867.54753405694, "std_e2e_latency_ms": 43.76537623950034, "p99_e2e_latency_ms": 28971.827788078226,"mean_ttft_ms": 914.2480821348727, "median_ttft_ms": 894.8089674813673, "std_ttft_ms": 42.84710320299341, "p99_ttft_ms": 999.2891409643926, "mean_tpot_ms": 27.999646802972617, "median_tpot_ms": 28.00029096845258, "std_tpot_ms": 0.004316766073899561, "p99_tpot_ms": 28.005626542935126, "mean_itl_ms": 27.999637626112442, "median_itl_ms": 27.986736968159676, "std_itl_ms": 0.5853971754709604, "p95_itl_ms": 28.546237177215517, "p99_itl_ms": 29.95921622263268, "concurrency": 0.9998742869247236, "accept_length": null}

after optimize

{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 1, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 288.23054097685963, "completed": 10, "total_input_tokens": 40000, "total_output_tokens": 10000, "total_output_tokens_retokenized": 9955, "request_throughput": 0.03469444967944199, "input_throughput": 138.77779871776798, "output_throughput": 34.694449679441995, "mean_e2e_latency_ms": 28819.560090778396, "median_e2e_latency_ms": 28802.58282844443, "std_e2e_latency_ms": 37.77366937690407, "p99_e2e_latency_ms": 28897.36494206125, "mean_ttft_ms": 827.0063975825906, "median_ttft_ms": 808.830501860939, "std_ttft_ms": 36.31374057753745, "p99_ttft_ms": 901.3628057297319, "mean_tpot_ms": 28.020574267463275, "median_tpot_ms": 28.022242382423276,"std_tpot_ms": 0.00486772951692425, "p99_tpot_ms": 28.02489307463156, "mean_itl_ms": 28.020567437950838, "median_itl_ms": 28.001354075968266, "std_itl_ms": 0.9080998589299655, "p95_itl_ms": 28.64757542265579, "p99_itl_ms": 30.345987735781822, "concurrency": 0.9998787773531658, "accept_length": null}

with TTFT from 914ms to 827ms

use forward_batch.extend_prefix_lens_cpu to avoid cudaStreamSynchroni…

07b11bd

…ze in DeepSeekV2AttentionMLA

strgrb requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners March 19, 2025 09:20

zhyncs merged commit df7014a into sgl-project:main Mar 19, 2025

This was referenced Mar 20, 2025

[Fix] fix FlashMLA cudagraph config #4591

Closed

[FA3 Attn Backend] Remove Unnecessary Device Sync for FA3 #4745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA#4577

avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA#4577
zhyncs merged 1 commit intosgl-project:mainfrom
strgrb:dev/avoid_sync

strgrb commented Mar 19, 2025

Uh oh!

strgrb commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

strgrb commented Mar 19, 2025

Motivation

Modifications

Checklist

Uh oh!

strgrb commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants