Support cuda graph in the triton attention backend by merrymercy · Pull Request #1401 · sgl-project/sglang

merrymercy · 2024-09-12T06:21:02Z

Llama 3 8B (1.3x faster)

# triton w/ cuda graph
# Decode.  median latency: 0.00706 s, median throughput:    141.63 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend triton

# triton w/o cuda graph
# Decode.  median latency: 0.00928 s, median throughput:    107.79 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend triton --disable-cuda-graph


# flashinfer w/ cuda graph
# Decode.  median latency: 0.00735 s, median throughput:    135.98 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend flashinfer

# flashinfer w/o cuda graph
# Decode.  median latency: 0.00823 s, median throughput:    121.46 token/s
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --attention-backend flashinfer --disable-cuda-graph

DeepSeek-Coder-V2-Lite (4x faster)

# triton w/ cuda graph
# Decode.  median latency: 0.00622 s, median throughput:    160.82 token/s
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote --batch-size 1 --input 128 --output 8 --enable-mla

# triton w/o cuda graph
# Decode.  median latency: 0.02453 s, median throughput:     40.77 token/s
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote --batch-size 1 --input 128 --output 8 --enable-mla --disable-cuda-graph

zhyncs · 2024-09-12T14:28:26Z

Significant improvement, especially in small batch latency. Accuracy is similar to before.

ref #1285 (comment)

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --enable-mla --trust-remote-code --disable-radix

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=128,max_retries=3,tokenized_requests=False

# run 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7695|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.7559|±  |0.0118|

# run 2
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7801|±  |0.0114|
|     |       |strict-match    |     5|exact_match|↑  |0.7688|±  |0.0116|

# run 3
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7741|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7672|±  |0.0116|

The impact on max throughput is not significant, because after enabling CUDA Graph, TP 1 needs to adjust --mem-frac 0.85, otherwise it will result in OOM.

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --enable-mla --trust-remote-code --disable-radix --mem-static 0.85
python3 -m sglang.bench_serving --backend sglang --num-prompts 5000

zhyncs · 2024-09-12T14:35:16Z

python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code --disable-cuda-graph

Decode.  median latency: 0.00793 s, median throughput:    126.09 token/s
Decode.  median latency: 0.03645 s, median throughput:     27.44 token/s

zhyncs · 2024-09-12T14:55:33Z

python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code --enable-mla
python3 -m sglang.bench_latency --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --batch-size 1 --input 128 --output 8 --attention-backend triton --trust-remote-code --enable-mla --disable-cuda-graph

Decode.  median latency: 0.00621 s, median throughput:    161.09 token/s
Decode.  median latency: 0.01916 s, median throughput:     52.19 token/s

fengyang95 · 2024-09-13T10:50:22Z

Hi @zhyncs @merrymercy Does this support sm_89 (L40)? I see that cuda graph relies on vllm's fused_moe, but from what I can see, it seems that it does not support sm_89?

merrymercy · 2024-09-18T11:44:31Z

@fengyang95 It should support L40 but I haven't tested it. I think cuda graph does not depend on specific ops. It just captures the existing ops.

merrymercy added 2 commits September 11, 2024 18:50

try triton cuda graph

2b2c516

fix padding

b22e764

merrymercy requested review from ispobock and zhyncs September 12, 2024 06:50

merrymercy added 2 commits September 12, 2024 00:00

add triton attention backend to CI

d38e5af

fix workflow in ci

b50d463

merrymercy merged commit 3efa798 into main Sep 12, 2024

merrymercy deleted the triton-cuda-graph branch September 12, 2024 07:36

This was referenced Sep 12, 2024

[Bug] Lower single request speed with mla enabled #1264

Closed

deepseek-v2 enable-mla 4x slower #1369

Closed

merrymercy mentioned this pull request Sep 18, 2024

[Bug] OOM when runing bench_serving with DeepSeekCoder-V2-Lite. #1455

Closed

5 tasks

This was referenced Sep 19, 2024

Support double sparsity #1459

Merged

will triton kernels support cuda graph? #1097

Closed

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Support cuda graph in the triton attention backend (sgl-project#1401)

508df36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cuda graph in the triton attention backend#1401

Support cuda graph in the triton attention backend#1401
merrymercy merged 4 commits intomainfrom
triton-cuda-graph

merrymercy commented Sep 12, 2024 •

edited

Loading

Uh oh!

zhyncs commented Sep 12, 2024

Uh oh!

zhyncs commented Sep 12, 2024

Uh oh!

zhyncs commented Sep 12, 2024

Uh oh!

fengyang95 commented Sep 13, 2024 •

edited

Loading

Uh oh!

merrymercy commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

merrymercy commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Llama 3 8B (1.3x faster)

DeepSeek-Coder-V2-Lite (4x faster)

Uh oh!

zhyncs commented Sep 12, 2024

Uh oh!

zhyncs commented Sep 12, 2024

Uh oh!

zhyncs commented Sep 12, 2024

Uh oh!

fengyang95 commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

merrymercy commented Sep 12, 2024 •

edited

Loading

fengyang95 commented Sep 13, 2024 •

edited

Loading