Skip to content

[Feat] optimize Qwen3 on H20 by hybrid Attention Backend#6151

Closed
TianQiLin666666 wants to merge 3 commits intosgl-project:mainfrom
bytedance-iaas:feat/attn_backend
Closed

[Feat] optimize Qwen3 on H20 by hybrid Attention Backend#6151
TianQiLin666666 wants to merge 3 commits intosgl-project:mainfrom
bytedance-iaas:feat/attn_backend

Conversation

@TianQiLin666666
Copy link
Copy Markdown
Collaborator

Motivation

FA3 decode performance is significantly lower than flashinfer on H20

#5630
Dao-AILab/flash-attention#1572

I have profiled the Qwen3-235B-A22B performance on H20(tp8) as follows.

SGLANG_TORCH_PROFILER_DIR=/data/sglang_profilers/QWEN3_tp8_32_3500_25_96G python3 -m sglang.bench_offline_throughput --model-path /data/models/Qwen3-235B-A22B --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name random --num-prompts 32 --random-input-len 3500 --random-output-len 25 --random-range-ratio 1 --tp-size 8 --trust-remote-code --disable-radix-cache --mem-fraction-static 0.9 --profile --reasoning-parser qwen3 --attention-backend {fa3, flashinfer}

Prefill

  • FA3: 554us
    image

  • flashinfer: 722us
    image

Decode

  • FA3: 98us
    image

  • flashinfer: 35us
    image

The results indicate that FA3 decode performance is lower than flashinfer on H20, but FA3 prefill performance is higher than flashinfer on H20.

Modifications

Use fa3 for prefill and flashinfer for decode on H20 for Qwen3 models. To enhance compatibility, server_arg --enable-flashinfer-attention-decode is introduced to determine whether enforce flashinfer as attention backend during the decoding phase, while the prefill phase retains the original --attention-backend configuration. By default, this argument is disabled and both prefill and decode phases utilize the backend specified by --attention-backend as before.

Evaluation

Summary

  • sglang 0.4.6.post2
  • H20-96GB
  • CUDA 12.4.131
  • pytorch 2.6.0+cu124

Throughput(input + output tokens/s)

      FA3(default) Flashinfer Hybrid(fa3 for P and flashinfer for D)
Qwen3-235B-A22B TP=8 QPS=32, input/output=3500/1500 2875.43 3466.48 3534.57
Qwen3-235B-A22B TP=8 QPS=32, input/output=8192/200 10438.92 10275.08 10793.59
Qwen3-30B-A3B TP=1 QPS=32, input/output=3500/1500 3048.44 4249.24 4305.04
Qwen3-32B TP=1 QPS=4, input/output=3500/1500 228.47 249.49 249.88

Qwen3-235B-A22B' records as an example

comands

### server
# FA3 (default)
python3 -m sglang.launch_server --model /data/models/Qwen3-235B-A22B --tp 8 --reasoning-parser qwen3 --port 8080

# Flashinfer
python3 -m sglang.launch_server --model /data/models/Qwen3-235B-A22B --tp 8 --reasoning-parser qwen3 --port 8080 --attention-backend flashinfer

# Hybrid (FA3 for prefill and Flashinfer for decode)
python3 -m sglang.launch_server --model /data/models/Qwen3-235B-A22B --tp 8 --reasoning-parser qwen3 --port 8080 --enable-flashinfer-attention-decode

### client
python3 -m sglang.bench_serving --backend sglang \
            --dataset-name random \
            --dataset-path /data/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
            --random-input-len 3500 \
            --random-output-len 1500 \
            --random-range-ratio 1 \
            --request-rate 32 \
            --max-concurrency 32 \
            --num-prompts 128 \
            --host 0.0.0.0 --port 8080

results

  • FA3(default)
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    32.0      
Max reqeuest concurrency:                32        
Successful requests:                     128       
Benchmark duration (s):                  222.58    
Total input tokens:                      448000    
Total generated tokens:                  192000    
Total generated tokens (retokenized):    191973    
Request throughput (req/s):              0.58      
Input token throughput (tok/s):          2012.80   
Output token throughput (tok/s):         862.63    
Total token throughput (tok/s):          2875.43   
Concurrency:                             31.95     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   55551.83  
Median E2E Latency (ms):                 55102.26  
---------------Time to First Token----------------
Mean TTFT (ms):                          4597.17   
Median TTFT (ms):                        4514.20   
P99 TTFT (ms):                           8477.12   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.99     
Median ITL (ms):                         31.61     
P95 ITL (ms):                            33.84     
P99 ITL (ms):                            34.42     
Max ITL (ms):                            7467.86   
==================================================
  • Flashinfer
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    32.0      
Max reqeuest concurrency:                32        
Successful requests:                     128       
Benchmark duration (s):                  184.63    
Total input tokens:                      448000    
Total generated tokens:                  192000    
Total generated tokens (retokenized):    191985    
Request throughput (req/s):              0.69      
Input token throughput (tok/s):          2426.54   
Output token throughput (tok/s):         1039.95   
Total token throughput (tok/s):          3466.48   
Concurrency:                             31.94     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   46065.23  
Median E2E Latency (ms):                 46034.10  
---------------Time to First Token----------------
Mean TTFT (ms):                          4825.35   
Median TTFT (ms):                        4708.10   
P99 TTFT (ms):                           9092.60   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           27.51     
Median ITL (ms):                         24.82     
P95 ITL (ms):                            26.18     
P99 ITL (ms):                            26.53     
Max ITL (ms):                            9292.85   
==================================================
  • Hybrid
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    32.0      
Max reqeuest concurrency:                32        
Successful requests:                     128       
Benchmark duration (s):                  181.07    
Total input tokens:                      448000    
Total generated tokens:                  192000    
Total generated tokens (retokenized):    191987    
Request throughput (req/s):              0.71      
Input token throughput (tok/s):          2474.20   
Output token throughput (tok/s):         1060.37   
Total token throughput (tok/s):          3534.57   
Concurrency:                             31.94     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45176.10  
Median E2E Latency (ms):                 45040.04  
---------------Time to First Token----------------
Mean TTFT (ms):                          4495.42   
Median TTFT (ms):                        4383.04   
P99 TTFT (ms):                           8057.08   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           27.14     
Median ITL (ms):                         24.54     
P95 ITL (ms):                            25.84     
P99 ITL (ms):                            26.23     
Max ITL (ms):                            8783.36   
==================================================

Checklist

@hebiao064
Copy link
Copy Markdown
Collaborator

Nice PR! And thanks for the detailed benchmark!

How about provide two optional server args:

  • prefill attention backend
  • decode attention backend

And if they are not provided, we will use attention backend.

@TianQiLin666666
Copy link
Copy Markdown
Collaborator Author

Nice PR! And thanks for the detailed benchmark!

How about provide two optional server args:

  • prefill attention backend
  • decode attention backend

And if they are not provided, we will use attention backend.

OK, I will do it.

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

Nice PR! And thanks for the detailed benchmark!
How about provide two optional server args:

  • prefill attention backend
  • decode attention backend

And if they are not provided, we will use attention backend.

OK, I will do it.

Hi @TianQiLin666666 , thanks for the great work! I will finish the follow-up on this PR. I have opened a new PR #6338, and added you as a coauthor 👀.

@zhyncs zhyncs closed this May 18, 2025
@HanHan009527 HanHan009527 deleted the feat/attn_backend branch December 16, 2025 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants