Conversation
|
Hi @hyhieu really nice job for integrating new attention backend. Do we have any performance benchmarking against latest triton (triton_kernels), and cutlass implementation ? |
|
Why not just use the code from the FlashAttention repo directly? |
+1 If FA upgrade, it is convenient to update in sglang. |
Just historical reasons. When I started working on this, the FA repository was not complete (e.g., it didn't have paged attention), so I had to implement some of the features myself. Now the FA4 repo has it all, so perhaps we should move over there. In light of this, I think #9928 is better than this PR. I propose to close this one, and try to merge #9928 instead. WDYT? |
|
Hi @hyhieu I think that this pr is also good, we can merge the efforts lol. I'm working on this. Thanks! |

Motivation
Integrate Flash Attention 4 into SGLang.
Modifications
sglang/srt/layers/attention/cute_opsblackwell_prefill_attention_backend.py--prefill-attention-backendto take the value"fa-cute"Accuracy Tests
I compared FA4 to the baseline default kernel on GSM8K and MMLU. The result looks okay.
FA4
Baseline
Benchmarking and Profiling
FA4 provides between 10% and 20% TTFT:
FA4
Baseline:
Checklist