[Fix] fix FlashMLA cudagraph config by FlamingoPg · Pull Request #4591 · sgl-project/sglang

FlamingoPg · 2025-03-19T18:05:03Z

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla

python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256

FlashMLA

batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s

FlashInfer

batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s

With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: sleepcoo <sleepcoo@gmail.com>

FlamingoPg · 2025-03-19T18:09:44Z

Avoid CPU-GPU synchronization due to PR: #4514. cc: @zhyncs @merrymercy

zhyncs · 2025-03-20T09:11:52Z

ref #4577

tianchongchong · 2025-03-28T05:55:59Z

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla
python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256
FlashMLA
batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s
FlashInfer
batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s
With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.

Add unit tests as outlined in the Running Unit Tests.

Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.

Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.

For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

hi @yinfan98 I want to know what gpu the above test results are based on. Is there test results of compared with triton-backend

fix FlashMLA cudagraph

48c70f5

Co-authored-by: sleepcoo <sleepcoo@gmail.com>

FlamingoPg requested review from HaiShaw, Ying1123, ispobock, merrymercy and zhyncs as code owners March 19, 2025 18:05

FlamingoPg assigned FlamingoPg and sleepcoo Mar 19, 2025

sleepcoo mentioned this pull request Mar 21, 2025

Support FlashMLA backend cuda graph #4514

Merged

sleepcoo closed this Mar 23, 2025

sleepcoo mentioned this pull request Mar 23, 2025

fix FlashMLA cudagraph config #4691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix FlashMLA cudagraph config#4591

[Fix] fix FlashMLA cudagraph config#4591
FlamingoPg wants to merge 1 commit intosgl-project:mainfrom
FlamingoPg:fix_flashmla

FlamingoPg commented Mar 19, 2025

Uh oh!

FlamingoPg commented Mar 19, 2025 •

edited

Loading

Uh oh!

zhyncs commented Mar 20, 2025

Uh oh!

tianchongchong commented Mar 28, 2025

Motivation

Modifications

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

FlamingoPg commented Mar 19, 2025

Motivation

Modifications

Checklist

Uh oh!

FlamingoPg commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Mar 20, 2025

Uh oh!

tianchongchong commented Mar 28, 2025

Motivation

Modifications

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FlamingoPg commented Mar 19, 2025 •

edited

Loading