(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving.
FlashInfer's unique features include:
- Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page
Zihao Ye
259 posts
- We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and
- We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and
- Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.
- (1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: blog: flashinfer.ai/2024/02/02/cas… Trying out our APIs: docs.flashinfer.ai/api/python/cas…
- 🚀 Excited to announce day-0 support from @NVIDIAAIDev for @OpenAI's gpt-oss model in flashinfer v0.2.10! github.com/flashinfer-ai/… ✅ Speed-of-light Blackwell mxfp4/mxfp8 MoE kernels + attention-sink from trtllm-gen ✅ FA2/FA3 template-based attention-sink support for earlierOur open models are here. Both of them. openai.com/open-models
- Check our #ASPLOS23 presentation tomorrow at 11 in session 4C at Grand D! We’ll be discussing our paper “SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning” - a compiler for sparse DL workloads built on @ApacheTVM (1/2)
- LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…
- MLC-LLM now supports deploying Llama-2-70B-chat locally (needs an Apple Silicon Mac w/ 50GB VRAM to run).🦙💬🔥 The decoding speed can achieve ~10.0 tokens/s on an M2 Ultra! Try it out at: mlc.ai/mlc-llm/docs/g… and join our discord server: discord.gg/9Xpy2HGBuD
GIF
00:25(1/2) 🦙 Buckle up and ready for a wild llama ride with 70B Llama-2 on a single MacBook 💻 🤯 Now 70B Llama-2 can be run smoothly on an 64G M2 max with 4bit quantization. 👉 Here is a step-by-step guide: mlc.ai/mlc-llm/docs/g… 🚀 How about the performance? It's - It's our first step toward Compiler for Sparsity in Deep Learning, and there are many intriguing research opportunities in this field. Check out our (experimental) repository at github.com/uwsampl/sparse…, more examples and tutorials are coming. typo: Ruihang Lai instead of Lao :)Excited about our upcoming ASPLOS'23 paper on optimizing sparse tensor code using composable abstractions on top of @ApacheTVM by @ye_combinator @tqchenml @junrushao and Ruihang Lao. Great collaboration between @uwcse @CSDatCMU and @OctoML! Pre-print: arxiv.org/abs/2207.04606
- I enjoy using copilot until I realize it's trying to predict my experiment result when I type <tab>.😅
- Retrospective on preparing SparseTIR artifact (for ASPLOS 2023):
- Replying to @ye_combinator(4/4) More interesting results and analysis can be found in our blog post. FlashInfer has been adopted by MLC-LLM from @ApacheTVM , Punica from @abcdabcd987 , and the SGLang project from @lmsysorg (we are looking forward to seeing more !!). The AMD and Mac GPU implementations
- Replying to @ye_combinator(2/4) We analyzed attention kernels' performance bottleneck for different use cases, for example, grouped query attention has high operational intensity than original multi-head attention, and GQA's performance will be bounded by low CUDA Cores performance and we propose to use















