Benjamin F Spector (@bfspector) / X

Benjamin F Spector

157 posts

Benjamin F Spector

@bfspector

stanford cs phd student. i make ml go brr.

Joined October 2020

Benjamin F Spector
@bfspector
May 12, 2024
(1/7) Happy mother’s day! We think what the mothers of America really want is a Flash Attention implementation that’s just 100 lines of code and 30% faster, and we’re happy to provide. We're excited to introduce ThunderKittens (TK), a simple DSL embedded within CUDA that makes
302K
Benjamin F Spector
@bfspector
May 27, 2025
(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint
384K
Benjamin F Spector
@bfspector
Mar 5, 2025
(1/7) Inspired by DeepSeek's FlashMLA, we're releasing ThunderMLA—a fused megakernel optimized for variable-prompt decoding! ⚡️🐱ThunderMLA is up to 35% faster than FlashMLA and just 400 LoC. Blog: bit.ly/4kubAAK With @AaryanSinghal4, @realDanFu, and @HazyResearch!
60K
Benjamin F Spector
@bfspector
Sep 28, 2025
(1/8) We’re releasing an 8-GPU Llama-70B inference engine megakernel! Our megakernel supports arbitrary batch sizes, mixed prefill+decode, a paged KV cache, instruction pipelining, dynamic scheduling, interleaved communication, and more! On ShareGPT it’s 22% faster than SGLang.
85K
Benjamin F Spector
@bfspector
May 27, 2025
Replying to @bfspector
(5/5) We’re open-sourcing all of the code so that you too can stop torturing your models with kernel launches (may Roko grant you a quick death) and have written up a blog with a bit more detail on how it all works. Code: bit.ly/451G881, Blog: bit.ly/3HcImHG
11K
Benjamin F Spector
@bfspector
Mar 15, 2025
(1/6) Joyously announcing ThunderKittens with real support on NVIDIA Blackwell! We've released BF16/FP8 GEMM and attention fwd+bwd kernels, up to 2x faster than cuBLAS GEMMs on H100. Blog: bit.ly/41tuT4Q With @realDanFu, @AaryanSinghal4, and @HazyResearch!
hazyresearch.stanford.edu
ThunderKittens Now on Blackwells!
21K
Benjamin F Spector
@bfspector
Jan 25, 2025
We got early access to some of the very first Nvidia B200’s. We share initial benchmark results and wrote the fastest (public) attention kernel with 925+ BF16 TFLOPs: Since the PTX instruction set released yesterday, @aaryan04 and I have been hard at work at @HazyResearch
18K
Benjamin F Spector
@bfspector
Oct 29, 2024
(1/7) In celebration of National Cat Day, we’re excited to release our first major batch of updates to ThunderKittens! ThunderKittens is now easier, better, faster, and cuter than ever before! In addition to massive speed boosts, we’re releasing a broad swath of kernels, new
36K
Benjamin F Spector
@bfspector
May 27, 2025
Replying to @bfspector
(2/5) Our Llama megakernel is built around an on-GPU interpreter. Each SM fetches and executes huge, custom instructions from a special instruction tensor, so the GPU can be doing many different things. Without kernel boundaries, each SM can go from one instruction to the next.
11K
Benjamin F Spector
@bfspector
May 27, 2025
Replying to @bfspector
(4/5) A big problem is synchronization. Normally, kernel boundaries synchronize for you. But we got rid of them all, so we have to do it ourselves. Fortunately, we found fine-grained synchronization enabled other optimizations, too -- like starting some attention heads early!
11K
Benjamin F Spector
@bfspector
May 27, 2025
Replying to @bfspector
(3/5) To run Llama-1B fast, we need to hide latencies like loading weights. So, we divide each SM’s shared memory into 16KiB pages, and specialize threads by role. So, loader threads can start loading future weights while worker threads work on the current ones.
9.9K
Benjamin F Spector
@bfspector
Oct 27, 2024
Hello friends! I'm going live now on YT to teach CUDA + ThunderKittens, in particular to @qamcintyre but perhaps to you too if you're interested. Come hang out!
4K
Benjamin F Spector
@bfspector
May 12, 2024
Replying to @bfspector
(6/7) We're sharing TK as an art project to make key ideas clear and accessible. To help with that, we've integrated TK with @karpathy's awesome NanoGPT project, forked as github.com/HazyResearch/n…. Also as a heads up: we are not going to be responding to Github issues. Check out
GitHub - HazyResearch/train-tk: train with kittens!
From github.com
5.7K
Benjamin F Spector
@bfspector
Sep 28, 2025
Replying to @bfspector
(7/8) Code is at bit.ly/tplcode; it is (emphasis) research code. You can also play with our custom profiler at bit.ly/4mDJ0wG! We’ve written up both a brief, introductory post at bit.ly/tplintro and a longer, more technical one: bit.ly/tplmain
4.3K