Further Speed up FA3 Backend

We explored and discussed some ideas and we want to write it down for tracking, also welcome community developer to try out those unfinished

- [x] (Good first issue) Skip `len` operation, get it directly from forward batch: https://github.com/sgl-project/sglang/pull/5969 @lifuhuang 
- [ ] GQA head packing: https://github.com/Dao-AILab/flash-attention/blob/main/hopper/flash_attn_interface.py#L658 Change it to True and run benchmark.
- [x] Split-KV. aka Flash Decoding: We already enabled it, it is indeed faster in lower batch and long context scenario. Benchmark will be attached.
- [ ] PDL: https://github.com/Dao-AILab/flash-attention/commit/000090d02f0398e9087a8823fc1f5242becfac99
- [x] (Won't do) Prepare Scheduler Metadata: https://github.com/Dao-AILab/flash-attention/commit/fa60e7cc97300b4b26721983df580a7da7a8ebea (From Tri Dao's note, it can only speed up 2us, we can keep an eye on this, not recommending adopting this)
- [ ] For Llama Models, we observed that Spec Decoding with Top K > 1 is slightly slower than Flash Infer backend, we need comprehensive profiling and optimize it @MrAta 
- [x] Replace Pad operation by Copy: https://github.com/sgl-project/sglang/pull/5945
- [x] Remove is_fa3_supported from fa3 kernel: https://github.com/sgl-project/sglang/pull/6112
- [x] Remove pad operation for all decode cases: https://github.com/sgl-project/sglang/pull/6077 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further Speed up FA3 Backend #5810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Further Speed up FA3 Backend #5810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions