Zihao Ye (@ye_combinator) / X

Zihao Ye

259 posts

Zihao Ye

@ye_combinator

往前看别回头

Seattle

Joined October 2017

Zihao Ye
@ye_combinator
Feb 5, 2024
(1/4) Announcing FlashInfer, a kernel library that provides state-of-the-art kernel implementations for LLM Inference/Serving. FlashInfer's unique features include: - Comprehensive Attention Kernels: covering prefill/decode/append attention for various KV-Cache formats (Page
58K
Zihao Ye
@ye_combinator
May 13, 2025
We’re thrilled that FlashInfer won a Best Paper Award at MLSys 2025! 🎉 This wouldn’t have been possible without the community — huge thanks to @lmsysorg’s sglang for deep co-design (which is crtical for inference kernel evolution) and stress-testing over the years, and to
NVIDIA AI Developer
@NVIDIAAIDev
May 13, 2025
🎉 Congratulations to the FlashInfer team – their technical paper, "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving," just won best paper at #MLSys2025. 🏆 🙌 We are excited to share that we are now backing FlashInfer – a supporter and
39K
Zihao Ye
@ye_combinator
Dec 19, 2024
We are excite to announce FlashInfer v0.2! Core contributions of this release include: - Block/Vector Sparse (Paged) Attention on FlashAttention-3 - JIT compilation for customized attention variants - Fused Multi-head Latent Attention (MLA) decoding kernel - Lots of bugfix and
34K
Zihao Ye
@ye_combinator
Mar 6, 2025
Check out the intra-kernel profiler in flashinfer to visualize the timeline of each SM/warpgroup in the lifecycle of a CUDA persistent kernel: github.com/flashinfer-ai/… You can clearly understand how tensor/cuda cores overlapping, variable length load-balancing and fusion works.
8.8K
Zihao Ye
@ye_combinator
Feb 5, 2024
(1/3) Memory Bandwidth Efficient Shared Prefix Batch Decoding, brought to you by FlashInfer: blog: flashinfer.ai/2024/02/02/cas… Trying out our APIs: docs.flashinfer.ai/api/python/cas…
28K
Zihao Ye
@ye_combinator
Aug 5, 2025
🚀 Excited to announce day-0 support from @NVIDIAAIDev for @OpenAI's gpt-oss model in flashinfer v0.2.10! github.com/flashinfer-ai/… ✅ Speed-of-light Blackwell mxfp4/mxfp8 MoE kernels + attention-sink from trtllm-gen ✅ FA2/FA3 template-based attention-sink support for earlier
OpenAI
@OpenAI
Aug 5, 2025
Our open models are here. Both of them. openai.com/open-models
GPT-OSS Support: Add Blackwell MoE mxfp4 implementation from TRTLLM and Attention Sink by joker-eph...
From github.com
6.2K
Zihao Ye
@ye_combinator
Mar 27, 2023
Check our #ASPLOS23 presentation tomorrow at 11 in session 4C at Grand D! We’ll be discussing our paper “SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning” - a compiler for sparse DL workloads built on @ApacheTVM (1/2)
9.7K
Zihao Ye
@ye_combinator
Mar 11, 2025
LLM is not all about tensor cores. categorical sampling under filters (top-p/top-k/min-p) are critical operators in llms as vocabulary size grows, flashinfer uses sorting-free rejection sampling algorithm for efficient sampling. checkout this great blog post written by @0xsling0
Shanli Xing
@shanli_xing
Mar 11, 2025
🚀Meet flashinfer.sampling—our sorting-free GPU kernels for lightning-fast #LLM sampling. Our implementation achieves over 50% reduction in sampling time. Blog post: flashinfer.ai/2025/03/10/sam…
4.8K
Zihao Ye
@ye_combinator
Jul 20, 2023
MLC-LLM now supports deploying Llama-2-70B-chat locally (needs an Apple Silicon Mac w/ 50GB VRAM to run).🦙💬🔥 The decoding speed can achieve ~10.0 tokens/s on an M2 Ultra! Try it out at: mlc.ai/mlc-llm/docs/g… and join our discord server: discord.gg/9Xpy2HGBuD
GIF
00:25
Junru Shao
@junrushao
Jul 20, 2023
(1/2) 🦙 Buckle up and ready for a wild llama ride with 70B Llama-2 on a single MacBook 💻 🤯 Now 70B Llama-2 can be run smoothly on an 64G M2 max with 4bit quantization. 👉 Here is a step-by-step guide: mlc.ai/mlc-llm/docs/g… 🚀 How about the performance? It's
6.5K
Zihao Ye
@ye_combinator
Jan 28, 2023
It's our first step toward Compiler for Sparsity in Deep Learning, and there are many intriguing research opportunities in this field. Check out our (experimental) repository at github.com/uwsampl/sparse…, more examples and tutorials are coming. typo: Ruihang Lai instead of Lao :)
Luis Ceze
@luisceze
Jan 27, 2023
Excited about our upcoming ASPLOS'23 paper on optimizing sparse tensor code using composable abstractions on top of @ApacheTVM by @ye_combinator @tqchenml @junrushao and Ruihang Lao. Great collaboration between @uwcse @CSDatCMU and @OctoML! Pre-print: arxiv.org/abs/2207.04606
GitHub - uwsampl/SparseTIR: SparseTIR: Sparse Tensor Compiler for Deep Learning
From github.com
5.7K
Zihao Ye
@ye_combinator
Jul 1, 2022
I enjoy using copilot until I realize it's trying to predict my experiment result when I type <tab>.😅
Zihao Ye
@ye_combinator
Feb 18, 2023
Retrospective on preparing SparseTIR artifact (for ASPLOS 2023):
Retrospective on SparseTIR artifact
From gist.github.com
2.3K
Zihao Ye
@ye_combinator
Feb 5, 2024
Replying to @ye_combinator
(4/4) More interesting results and analysis can be found in our blog post. FlashInfer has been adopted by MLC-LLM from @ApacheTVM , Punica from @abcdabcd987 , and the SGLang project from @lmsysorg (we are looking forward to seeing more !!). The AMD and Mac GPU implementations
2K
Zihao Ye
@ye_combinator
Feb 5, 2024
Replying to @ye_combinator
(2/4) We analyzed attention kernels' performance bottleneck for different use cases, for example, grouped query attention has high operational intensity than original multi-head attention, and GQA's performance will be bounded by low CUDA Cores performance and we propose to use
2.4K