You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Won't do) Prepare Scheduler Metadata: Dao-AILab/flash-attention@fa60e7c (From Tri Dao's note, it can only speed up 2us, we can keep an eye on this, not recommending adopting this)
For Llama Models, we observed that Spec Decoding with Top K > 1 is slightly slower than Flash Infer backend, we need comprehensive profiling and optimize it @MrAta
We explored and discussed some ideas and we want to write it down for tracking, also welcome community developer to try out those unfinished
lenoperation, get it directly from forward batch: FA3 speed up: skip len operation and get batch size directly from forward batch #5969 @lifuhuang