[Feat] optimize Qwen3 on H20 by hybrid Attention Backend by TianQiLin666666 · Pull Request #6151 · sgl-project/sglang

TianQiLin666666 · 2025-05-09T09:19:54Z

Motivation

FA3 decode performance is significantly lower than flashinfer on H20

I have profiled the Qwen3-235B-A22B performance on H20(tp8) as follows.

SGLANG_TORCH_PROFILER_DIR=/data/sglang_profilers/QWEN3_tp8_32_3500_25_96G python3 -m sglang.bench_offline_throughput --model-path /data/models/Qwen3-235B-A22B --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --num-prompts 32 --random-input-len 3500 --random-output-len 25 --random-range-ratio 1 --tp-size 8 --trust-remote-code --disable-radix-cache --mem-fraction-static 0.9 --profile --reasoning-parser qwen3 --attention-backend {fa3, flashinfer}

Prefill

FA3: 554us
flashinfer: 722us

Decode

FA3: 98us
flashinfer: 35us

The results indicate that FA3 decode performance is lower than flashinfer on H20, but FA3 prefill performance is higher than flashinfer on H20.

Modifications

Use fa3 for prefill and flashinfer for decode on H20 for Qwen3 models. To enhance compatibility, server_arg --enable-flashinfer-attention-decode is introduced to determine whether enforce flashinfer as attention backend during the decoding phase, while the prefill phase retains the original --attention-backend configuration. By default, this argument is disabled and both prefill and decode phases utilize the backend specified by --attention-backend as before.

Evaluation

Summary

sglang 0.4.6.post2
H20-96GB
CUDA 12.4.131
pytorch 2.6.0+cu124

Throughput(input + output tokens/s)

			FA3(default)	Flashinfer	Hybrid(fa3 for P and flashinfer for D)
Qwen3-235B-A22B	TP=8	QPS=32, input/output=3500/1500	2875.43	3466.48	3534.57
Qwen3-235B-A22B	TP=8	QPS=32, input/output=8192/200	10438.92	10275.08	10793.59
Qwen3-30B-A3B	TP=1	QPS=32, input/output=3500/1500	3048.44	4249.24	4305.04
Qwen3-32B	TP=1	QPS=4, input/output=3500/1500	228.47	249.49	249.88

Qwen3-235B-A22B' records as an example

comands

### server
# FA3 (default)
python3 -m sglang.launch_server --model /data/models/Qwen3-235B-A22B --tp 8 --reasoning-parser qwen3 --port 8080

# Flashinfer
python3 -m sglang.launch_server --model /data/models/Qwen3-235B-A22B --tp 8 --reasoning-parser qwen3 --port 8080 --attention-backend flashinfer

# Hybrid (FA3 for prefill and Flashinfer for decode)
python3 -m sglang.launch_server --model /data/models/Qwen3-235B-A22B --tp 8 --reasoning-parser qwen3 --port 8080 --enable-flashinfer-attention-decode

### client
python3 -m sglang.bench_serving --backend sglang \
            --dataset-name random \
            --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
            --random-input-len 3500 \
            --random-output-len 1500 \
            --random-range-ratio 1 \
            --request-rate 32 \
            --max-concurrency 32 \
            --num-prompts 128 \
            --host 0.0.0.0 --port 8080

results

FA3(default)

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    32.0      
Max reqeuest concurrency:                32        
Successful requests:                     128       
Benchmark duration (s):                  222.58    
Total input tokens:                      448000    
Total generated tokens:                  192000    
Total generated tokens (retokenized):    191973    
Request throughput (req/s):              0.58      
Input token throughput (tok/s):          2012.80   
Output token throughput (tok/s):         862.63    
Total token throughput (tok/s):          2875.43   
Concurrency:                             31.95     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   55551.83  
Median E2E Latency (ms):                 55102.26  
---------------Time to First Token----------------
Mean TTFT (ms):                          4597.17   
Median TTFT (ms):                        4514.20   
P99 TTFT (ms):                           8477.12   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.99     
Median ITL (ms):                         31.61     
P95 ITL (ms):                            33.84     
P99 ITL (ms):                            34.42     
Max ITL (ms):                            7467.86   
==================================================

Flashinfer

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    32.0      
Max reqeuest concurrency:                32        
Successful requests:                     128       
Benchmark duration (s):                  184.63    
Total input tokens:                      448000    
Total generated tokens:                  192000    
Total generated tokens (retokenized):    191985    
Request throughput (req/s):              0.69      
Input token throughput (tok/s):          2426.54   
Output token throughput (tok/s):         1039.95   
Total token throughput (tok/s):          3466.48   
Concurrency:                             31.94     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   46065.23  
Median E2E Latency (ms):                 46034.10  
---------------Time to First Token----------------
Mean TTFT (ms):                          4825.35   
Median TTFT (ms):                        4708.10   
P99 TTFT (ms):                           9092.60   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           27.51     
Median ITL (ms):                         24.82     
P95 ITL (ms):                            26.18     
P99 ITL (ms):                            26.53     
Max ITL (ms):                            9292.85   
==================================================

Hybrid

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    32.0      
Max reqeuest concurrency:                32        
Successful requests:                     128       
Benchmark duration (s):                  181.07    
Total input tokens:                      448000    
Total generated tokens:                  192000    
Total generated tokens (retokenized):    191987    
Request throughput (req/s):              0.71      
Input token throughput (tok/s):          2474.20   
Output token throughput (tok/s):         1060.37   
Total token throughput (tok/s):          3534.57   
Concurrency:                             31.94     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45176.10  
Median E2E Latency (ms):                 45040.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          4495.42   
Median TTFT (ms):                        4383.04   
P99 TTFT (ms):                           8057.08   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           27.14     
Median ITL (ms):                         24.54     
P95 ITL (ms):                            25.84     
P99 ITL (ms):                            26.23     
Max ITL (ms):                            8783.36   
==================================================

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

hebiao064 · 2025-05-10T16:06:50Z

Nice PR! And thanks for the detailed benchmark!

How about provide two optional server args:

prefill attention backend
decode attention backend

And if they are not provided, we will use attention backend.

TianQiLin666666 · 2025-05-12T03:14:50Z

Nice PR! And thanks for the detailed benchmark!

How about provide two optional server args:

prefill attention backend

decode attention backend

And if they are not provided, we will use attention backend.

OK, I will do it.

Qiaolin-Yu · 2025-05-16T00:57:07Z

Nice PR! And thanks for the detailed benchmark!
How about provide two optional server args:

prefill attention backend

decode attention backend

And if they are not provided, we will use attention backend.

OK, I will do it.

Hi @TianQiLin666666 , thanks for the great work! I will finish the follow-up on this PR. I have opened a new PR #6338, and added you as a coauthor 👀.

TianQiLin666666 added 3 commits May 8, 2025 19:29

feat(hybrid-attention): FA3 for P and FlashInfer for D

cb31c2f

fix(hybrid-attention): enable it by server_args

1e40ba0

fix: clang-format

e83b79d

TianQiLin666666 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners May 9, 2025 09:19

zhyncs added the high priority label May 11, 2025

zhyncs assigned hebiao064, merrymercy, qingquansong and zhyncs May 11, 2025

snippetzero mentioned this pull request May 12, 2025

[Bug] The performance of the FA3 attention backend on Hopper is not up to expect. #6066

Closed

5 tasks

Qiaolin-Yu mentioned this pull request May 16, 2025

[feat] Support different attention backends for prefill and decode #6338

Merged

6 tasks

zhyncs closed this May 18, 2025

HanHan009527 deleted the feat/attn_backend branch December 16, 2025 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] optimize Qwen3 on H20 by hybrid Attention Backend#6151

[Feat] optimize Qwen3 on H20 by hybrid Attention Backend#6151
TianQiLin666666 wants to merge 3 commits intosgl-project:mainfrom
bytedance-iaas:feat/attn_backend

TianQiLin666666 commented May 9, 2025

Uh oh!

hebiao064 commented May 10, 2025

Uh oh!

TianQiLin666666 commented May 12, 2025

Uh oh!

Qiaolin-Yu commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

TianQiLin666666 commented May 9, 2025

Motivation

FA3 decode performance is significantly lower than flashinfer on H20

Prefill

Decode

Modifications

Evaluation

Summary

Throughput(input + output tokens/s)

Qwen3-235B-A22B' records as an example

comands

results

Checklist

Uh oh!

hebiao064 commented May 10, 2025

Uh oh!

TianQiLin666666 commented May 12, 2025

Uh oh!

Qiaolin-Yu commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants