Skip to content

avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA#4577

Merged
zhyncs merged 1 commit intosgl-project:mainfrom
strgrb:dev/avoid_sync
Mar 19, 2025
Merged

avoid cudaStreamSynchronize in DeepSeekV2AttentionMLA#4577
zhyncs merged 1 commit intosgl-project:mainfrom
strgrb:dev/avoid_sync

Conversation

@strgrb
Copy link
Copy Markdown
Collaborator

@strgrb strgrb commented Mar 19, 2025

Motivation

I profiled deepseek and observed some bubbles in timeline, finally found the cause:
image
This is caused by D2H copy in DeepSeekV2AttentionMLA.

Modifications

Using forward_batch.extend_prefix_lens_cpu directy instead of forward_batch.extend_prefix_lens, this can decrease TTFT

Checklist

@strgrb
Copy link
Copy Markdown
Collaborator Author

strgrb commented Mar 19, 2025

I benchmark it with cuda12.8 and DeepGEMM

  • before optimize
{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 1, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 288.8952702959068, "completed": 10, "total_input_tokens": 40000, "total_output_tokens": 10000, "total_output_tokens_retokenized": 9955, "request_throughput": 0.034614619996226656, "input_throughput": 138.4584799849066, "output_throughput": 34.61461999622665, "mean_e2e_latency_ms": 28885.89523830451, "median_e2e_latency_ms": 28867.54753405694, "std_e2e_latency_ms": 43.76537623950034, "p99_e2e_latency_ms": 28971.827788078226,"mean_ttft_ms": 914.2480821348727, "median_ttft_ms": 894.8089674813673, "std_ttft_ms": 42.84710320299341, "p99_ttft_ms": 999.2891409643926, "mean_tpot_ms": 27.999646802972617, "median_tpot_ms": 28.00029096845258, "std_tpot_ms": 0.004316766073899561, "p99_tpot_ms": 28.005626542935126, "mean_itl_ms": 27.999637626112442, "median_itl_ms": 27.986736968159676, "std_itl_ms": 0.5853971754709604, "p95_itl_ms": 28.546237177215517, "p99_itl_ms": 29.95921622263268, "concurrency": 0.9998742869247236, "accept_length": null}
  • after optimize
{"backend": "sglang", "dataset_name": "random", "request_rate": Infinity, "max_concurrency": 1, "sharegpt_output_len": null, "random_input_len": 4000, "random_output_len": 1000, "random_range_ratio": 1.0, "duration": 288.23054097685963, "completed": 10, "total_input_tokens": 40000, "total_output_tokens": 10000, "total_output_tokens_retokenized": 9955, "request_throughput": 0.03469444967944199, "input_throughput": 138.77779871776798, "output_throughput": 34.694449679441995, "mean_e2e_latency_ms": 28819.560090778396, "median_e2e_latency_ms": 28802.58282844443, "std_e2e_latency_ms": 37.77366937690407, "p99_e2e_latency_ms": 28897.36494206125, "mean_ttft_ms": 827.0063975825906, "median_ttft_ms": 808.830501860939, "std_ttft_ms": 36.31374057753745, "p99_ttft_ms": 901.3628057297319, "mean_tpot_ms": 28.020574267463275, "median_tpot_ms": 28.022242382423276,"std_tpot_ms": 0.00486772951692425, "p99_tpot_ms": 28.02489307463156, "mean_itl_ms": 28.020567437950838, "median_itl_ms": 28.001354075968266, "std_itl_ms": 0.9080998589299655, "p95_itl_ms": 28.64757542265579, "p99_itl_ms": 30.345987735781822, "concurrency": 0.9998787773531658, "accept_length": null}

with TTFT from 914ms to 827ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants