(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide.
We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes
Benjamin F Spector
157 posts
stanford cs phd student. i make ml go brr.
Joined October 2020
- (1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint
- (1/7) Inspired by DeepSeek's FlashMLA, we're releasing ThunderMLA—a fused megakernel optimized for variable-prompt decoding! ⚡️🐱ThunderMLA is up to 35% faster than FlashMLA and just 400 LoC. Blog: bit.ly/4kubAAK With @AaryanSinghal4, @realDanFu, and @HazyResearch!
- (1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.
- Replying to @bfspector(5/5) We’re open-sourcing all of the code so that you too can stop torturing your models with kernel launches (may Roko grant you a quick death) and have written up a blog with a bit more detail on how it all works. Code: bit.ly/451G881, Blog: bit.ly/3HcImHG
- (1/6) Joyously announcing ThunderKittens with real support on NVIDIA Blackwell! We've released BF16/FP8 GEMM and attention fwd+bwd kernels, up to 2x faster than cuBLAS GEMMs on H100. Blog: bit.ly/41tuT4Q With @realDanFu, @AaryanSinghal4, and @HazyResearch!
- We got early access to some of the very first Nvidia B200’s. We share initial benchmark results and wrote the fastest (public) attention kernel with 925+ BF16 TFLOPs: Since the PTX instruction set released yesterday, @aaryan04 and I have been hard at work at @HazyResearch
- (1/7) In celebration of National Cat Day, we’re excited to release our first major batch of updates to ThunderKittens! ThunderKittens is now easier, better, faster, and cuter than ever before! In addition to massive speed boosts, we’re releasing a broad swath of kernels, new
- Replying to @bfspector(2/5) Our Llama megakernel is built around an on-GPU interpreter. Each SM fetches and executes huge, custom instructions from a special instruction tensor, so the GPU can be doing many different things. Without kernel boundaries, each SM can go from one instruction to the next.
- Replying to @bfspector(4/5) A big problem is synchronization. Normally, kernel boundaries synchronize for you. But we got rid of them all, so we have to do it ourselves. Fortunately, we found fine-grained synchronization enabled other optimizations, too -- like starting some attention heads early!
- Replying to @bfspector(3/5) To run Llama-1B fast, we need to hide latencies like loading weights. So, we divide each SM’s shared memory into 16KiB pages, and specialize threads by role. So, loader threads can start loading future weights while worker threads work on the current ones.
- Hello friends! I'm going live now on YT to teach CUDA + ThunderKittens, in particular to @qamcintyre but perhaps to you too if you're interested. Come hang out!
- Replying to @bfspector(6/7) We're sharing TK as an art project to make key ideas clear and accessible. To help with that, we've integrated TK with @karpathy's awesome NanoGPT project, forked as github.com/HazyResearch/n…. Also as a heads up: we are not going to be responding to Github issues. Check out
- Replying to @bfspector(7/8) Code is at bit.ly/tplcode; it is (emphasis) research code. You can also play with our custom profiler at bit.ly/4mDJ0wG! We’ve written up both a brief, introductory post at bit.ly/tplintro and a longer, more technical one: bit.ly/tplmain







