FA4 by hyhieu · Pull Request #9428 · sgl-project/sglang

hyhieu · 2025-08-21T05:02:24Z

Motivation

Integrate Flash Attention 4 into SGLang.

Modifications

Copy the code into sglang/srt/layers/attention/cute_ops
Create a new backend blackwell_prefill_attention_backend.py
Allow --prefill-attention-backend to take the value "fa-cute"

Accuracy Tests

I compared FA4 to the baseline default kernel on GSM8K and MMLU. The result looks okay.

FA4

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3.0	flexible-extract	2	exact_match	↑	0.8704	±	0.0093
		strict-match	2	exact_match	↑	0.2108	±	0.0112
mmlu_pro	2.0	custom-extract		exact_match	↑	0.5244	±	0.0043
- biology	2.1	custom-extract	2	exact_match	↑	0.7071	±	0.0170
- business	2.1	custom-extract	2	exact_match	↑	0.5501	±	0.0177
- chemistry	2.1	custom-extract	2	exact_match	↑	0.5124	±	0.0149
- computer_science	2.1	custom-extract	2	exact_match	↑	0.5537	±	0.0246
- economics	2.1	custom-extract	2	exact_match	↑	0.5960	±	0.0169
- engineering	2.1	custom-extract	2	exact_match	↑	0.3571	±	0.0154
- health	2.1	custom-extract	2	exact_match	↑	0.6259	±	0.0169
- history	2.1	custom-extract	2	exact_match	↑	0.3097	±	0.0237
- law	2.1	custom-extract	2	exact_match	↑	0.1317	±	0.0102
- math	2.1	custom-extract	2	exact_match	↑	0.6099	±	0.0133
- other	2.1	custom-extract	2	exact_match	↑	0.5671	±	0.0163
- philosophy	2.1	custom-extract	2	exact_match	↑	0.4810	±	0.0224
- physics	2.1	custom-extract	2	exact_match	↑	0.6559	±	0.0132
- psychology	2.1	custom-extract	2	exact_match	↑	0.6241	±	0.0172

Baseline

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3.0	flexible-extract	2	exact_match	↑	0.8840	±	0.0088
		strict-match	2	exact_match	↑	0.1865	±	0.0107
mmlu_pro	2.0	custom-extract		exact_match	↑	0.5209	±	0.0043
- biology	2.1	custom-extract	2	exact_match	↑	0.6932	±	0.0172
- business	2.1	custom-extract	2	exact_match	↑	0.4550	±	0.0177
- chemistry	2.1	custom-extract	2	exact_match	↑	0.5009	±	0.0149
- computer_science	2.1	custom-extract	2	exact_match	↑	0.6439	±	0.0237
- economics	2.1	custom-extract	2	exact_match	↑	0.6754	±	0.0161
- engineering	2.1	custom-extract	2	exact_match	↑	0.3540	±	0.0154
- health	2.1	custom-extract	2	exact_match	↑	0.6247	±	0.0169
- history	2.1	custom-extract	2	exact_match	↑	0.2336	±	0.0217
- law	2.1	custom-extract	2	exact_match	↑	0.1335	±	0.0103
- math	2.1	custom-extract	2	exact_match	↑	0.6514	±	0.0130
- other	2.1	custom-extract	2	exact_match	↑	0.5617	±	0.0163
- philosophy	2.1	custom-extract	2	exact_match	↑	0.4689	±	0.0224
- physics	2.1	custom-extract	2	exact_match	↑	0.6082	±	0.0135
- psychology	2.1	custom-extract	2	exact_match	↑	0.6241	±	0.0172

Benchmarking and Profiling

FA4 provides between 10% and 20% TTFT:

FA4

batch size: 64
inp len: 32768
out len: 2
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     64        
Benchmark duration (s):                  15.30     
Total input tokens:                      2097122   
Total generated tokens:                  101       
Total generated tokens (retokenized):    89        
Request throughput (req/s):              4.18      
Input token throughput (tok/s):          137022.74 
Output token throughput (tok/s):         6.60      
Total token throughput (tok/s):          137029.34 
Concurrency:                             49.12     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11746.39  
Median E2E Latency (ms):                 15285.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          7252.11   
Median TTFT (ms):                        6711.85   
P99 TTFT (ms):                           15292.66  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8132.74   
Median ITL (ms):                         8584.20   
P95 ITL (ms):                            13752.97  
P99 ITL (ms):                            13753.42  
Max ITL (ms):                            13753.46  
==================================================

batch size: 128
inp len: 16384
out len: 2
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     128       
Benchmark duration (s):                  19.73     
Total input tokens:                      2097095   
Total generated tokens:                  192       
Total generated tokens (retokenized):    153       
Request throughput (req/s):              6.49      
Input token throughput (tok/s):          106295.92 
Output token throughput (tok/s):         9.73      
Total token throughput (tok/s):          106305.65 
Concurrency:                             93.09     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14348.17  
Median E2E Latency (ms):                 19514.66  
---------------Time to First Token----------------
Mean TTFT (ms):                          9679.16   
Median TTFT (ms):                        9699.07   
P99 TTFT (ms):                           19719.53  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           8997.11   
Median ITL (ms):                         7850.61   
P95 ITL (ms):                            18175.75  
P99 ITL (ms):                            18176.99  
Max ITL (ms):                            18177.14  
==================================================

Baseline:

batch size: 64
inp len: 32768
out len: 2
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     64        
Benchmark duration (s):                  24.54     
Total input tokens:                      2097122   
Total generated tokens:                  101       
Total generated tokens (retokenized):    89        
Request throughput (req/s):              2.61      
Input token throughput (tok/s):          85461.46  
Output token throughput (tok/s):         4.12      
Total token throughput (tok/s):          85465.58  
Concurrency:                             50.23     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19259.89  
Median E2E Latency (ms):                 24519.19  
---------------Time to First Token----------------
Mean TTFT (ms):                          12531.10  
Median TTFT (ms):                        12277.33  
P99 TTFT (ms):                           24526.36  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           11757.88  
Median ITL (ms):                         11675.08  
P95 ITL (ms):                            23015.46  
P99 ITL (ms):                            23015.77  
Max ITL (ms):                            23015.85  
==================================================

batch size: 128
inp len: 16384
out len: 2
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     128       
Benchmark duration (s):                  20.37     
Total input tokens:                      2097095   
Total generated tokens:                  192       
Total generated tokens (retokenized):    158       
Request throughput (req/s):              6.28      
Input token throughput (tok/s):          102957.12 
Output token throughput (tok/s):         9.43      
Total token throughput (tok/s):          102966.54 
Concurrency:                             93.15     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14822.42  
Median E2E Latency (ms):                 20166.12  
---------------Time to First Token----------------
Mean TTFT (ms):                          10380.45  
Median TTFT (ms):                        10355.02  
P99 TTFT (ms):                           20358.85  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           9020.62   
Median ITL (ms):                         7918.26   
P95 ITL (ms):                            18772.18  
P99 ITL (ms):                            18773.45  
Max ITL (ms):                            18773.97  
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…t_executor.py

yiakwy-xpu-ml-framework-team · 2025-08-21T08:18:30Z

Hi @hyhieu really nice job for integrating new attention backend. Do we have any performance benchmarking against latest triton (triton_kernels), and cutlass implementation ?

cicirori · 2025-08-26T21:54:04Z

Why not just use the code from the FlashAttention repo directly?

yuan-luo · 2025-08-27T02:24:01Z

Why not just use the code from the FlashAttention repo directly?

+1 If FA upgrade, it is convenient to update in sglang.

hyhieu · 2025-09-07T16:56:58Z

Why not just use the code from the FlashAttention repo directly?

Just historical reasons. When I started working on this, the FA repository was not complete (e.g., it didn't have paged attention), so I had to implement some of the features myself. Now the FA4 repo has it all, so perhaps we should move over there. In light of this, I think #9928 is better than this PR.

I propose to close this one, and try to merge #9928 instead. WDYT?

zhyncs · 2025-09-08T16:27:17Z

Hi @hyhieu I think that this pr is also good, we can merge the efforts lol. I'm working on this. Thanks!

zhyncs · 2025-10-08T08:10:23Z

Hi @hyhieu it has been merged

thanks for your contribution!!

#10937

root and others added 22 commits August 11, 2025 00:19

Blackwell files.

c1d6993

Can trigger the bad assertion

ca7d206

ref impl

fc6ccee

now require FlashAttentionForwardSm100

84a7ccb

copied in the code from flash-attn

30af6a2

fix padding

2f47cc6

can pass unit tests

dec1b18

fix gargabe collector

77a2128

separate test code into a new file. with this, no need to hack the ji…

109c51d

…t_executor.py

fixed imports

7097820

sync

2df4547

Merge branch 'main' into hieu/fa4

2e2e967

paged

a4fe336

kind of understood what paged is doing

1d5acc7

passed all paged tests

ab1db6e

comment and format

8faccb1

more tests

a836a86

trying. still fail because of shape mismatch

9ff07a1

non-cuda graph case works.

84367d6

cuda graph. why doesnt this fail?

7cc9d52

all work.

de4a470

fa4 -> fa-cute.

4c2f68c

hyhieu requested review from BBuf, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy and zhyncs as code owners August 21, 2025 05:02

zhyncs added the high priority label Aug 21, 2025

zhyncs changed the title ~~FA cute.~~ FA cute Aug 21, 2025

zhyncs assigned ispobock and Qiaolin-Yu Aug 21, 2025

zhyncs assigned qywu Aug 26, 2025

fix page_table with page_size > 1

48305d2

hyhieu force-pushed the hieu/fa-cute branch from 80972cd to 48305d2 Compare August 27, 2025 05:51

fix window_size

8556e90

zhyncs mentioned this pull request Sep 2, 2025

support using fa4 on deepseek on blackwell #9928

Merged

hyhieu and others added 5 commits September 3, 2025 05:50

with some extra prints

b2ccdda

sync code

4675d8c

fix format

0ea6a1c

fix format

8e17a86

fa-cute -> fa4

19574e1

hyhieu changed the title ~~FA cute~~ FA4 Sep 7, 2025

merge main

33ac815

zhyncs added enhancement New feature or request collaboration labels Sep 8, 2025

zhyncs mentioned this pull request Sep 9, 2025

feat: support fa cute in sgl-kernel #10205

Merged

4 tasks

zhyncs assigned cicirori and lifuhuang Sep 17, 2025

lifuhuang mentioned this pull request Sep 26, 2025

[2/2] Support MHA prefill with FlashAttention 4. #10937

Merged

4 tasks

zhyncs closed this Oct 8, 2025

zhyncs deleted the hieu/fa-cute branch October 8, 2025 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA4#9428

FA4#9428
hyhieu wants to merge 31 commits intomainfrom
hieu/fa-cute

hyhieu commented Aug 21, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Aug 21, 2025

Uh oh!

cicirori commented Aug 26, 2025

Uh oh!

yuan-luo commented Aug 27, 2025

Uh oh!

hyhieu commented Sep 7, 2025

Uh oh!

zhyncs commented Sep 8, 2025

Uh oh!

zhyncs commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

hyhieu commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

yiakwy-xpu-ml-framework-team commented Aug 21, 2025

Uh oh!

cicirori commented Aug 26, 2025

Uh oh!

yuan-luo commented Aug 27, 2025

Uh oh!

hyhieu commented Sep 7, 2025

Uh oh!

zhyncs commented Sep 8, 2025

Uh oh!

zhyncs commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

hyhieu commented Aug 21, 2025 •

edited

Loading