Skip to content

[Fix] fix FlashMLA cudagraph config#4591

Closed
FlamingoPg wants to merge 1 commit intosgl-project:mainfrom
FlamingoPg:fix_flashmla
Closed

[Fix] fix FlashMLA cudagraph config#4591
FlamingoPg wants to merge 1 commit intosgl-project:mainfrom
FlamingoPg:fix_flashmla

Conversation

@FlamingoPg
Copy link
Copy Markdown
Collaborator

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla
python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256
  • FlashMLA
batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s
  • FlashInfer
batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s

With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

Co-authored-by: sleepcoo <sleepcoo@gmail.com>
@FlamingoPg
Copy link
Copy Markdown
Collaborator Author

FlamingoPg commented Mar 19, 2025

Avoid CPU-GPU synchronization due to PR: #4514. cc: @zhyncs @merrymercy

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Mar 20, 2025

ref #4577

@tianchongchong
Copy link
Copy Markdown
Contributor

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla
python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256
  • FlashMLA
batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s
  • FlashInfer
batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s

With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

hi @yinfan98 I want to know what gpu the above test results are based on. Is there test results of compared with triton-backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants