user avatar
Tri Dao
@tri_dao
Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.
Stanford, CA
Joined May 2012
Posts
  • Pinned
    user avatar
    FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS! 1/
  • user avatar
    Announcing FlashAttention-2! We released FlashAttention a year ago, making attn 2-4 faster and is now widely used in most LLM libraries. Recently I’ve been working on the next version: 2x faster than v1, 5-9x vs standard attn, reaching 225 TFLOPs/s training speed on A100. 1/
  • user avatar
    Very excited to announce that I've finished my PhD @Stanford and will be joining @Princeton CS department as an Assistant Professor in Fall 2024. Looking forward to working with students and colleagues @PrincetonCS on ML & systems!
  • user avatar
    Announcing FlashAttention, a fast and memory-efficient attention algorithm with no approximation! 📣 w/ @realDanFu By reducing GPU memory reads/writes, FlashAttention runs 2-4x faster & requires 5-20x less memory than PyTorch standard attention, & scales to seq. length 64K. 1/
  • user avatar
    Transformers power most advances in LLMs, but its core attention layer can’t scale to long context. With @_albertgu, we’re releasing Mamba, an SSM architecture that matches/beats Transformers in language modeling, yet with linear scaling and 5x higher inference throughput. 1/
    Quadratic attention has been indispensable for information-dense modalities such as language... until now. Announcing Mamba: a new SSM arch. that has linear-time scaling, ultra long context, and most importantly--outperforms Transformers everywhere we've tried. With @tri_dao 1/
  • user avatar
    State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical
    We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels
  • user avatar
    One way to tell that the AI-written kernel is wrong without even reading the code is that it's way too fast: ~1800 TFLOPS of FP32 on H100, 30x the theoretical max! If your verifier (correctness check) is even slightly wrong the model will reward-hack its way to crazy numbers
  • user avatar
    They’ve finally done it. They got rid of tokenizers!
    Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data
    GIF
    GIF
  • user avatar
    With @_albertgu, we’ve built a rich theoretical framework of state-space duality, showing that many linear attn variants and SSMs are equivalent! The resulting model, Mamba-2 is better & faster than Mamba-1, and still matching strong Transformer arch on language modeling. 1/
  • user avatar
    Announcing Flash-Decoding, to make long-context LLM inference up to 8x faster! Great collab with @d_haziza, @fvsmassa and Grigory Sizov. Main idea: load the KV cache in parallel as fast as possible, then separately rescale to combine the results. 1/7
  • user avatar
    We're releasing an optimized implementation of GPT2/GPT3 with FlashAttention🚀! This trains 3-5x faster than the Huggingface version, reaching up to 189 TFLOPs/sec per A100, 60.6% (model) FLOPs util of the theoretical maximum. 1/6 github.com/HazyResearch/f…
  • user avatar
    ML algorithm design + systems optimization is the way!
    🚀 Introducing NSA: A Hardware-Aligned and Natively Trainable Sparse Attention mechanism for ultra-fast long-context training & inference! Core components of NSA: • Dynamic hierarchical sparse strategy • Coarse-grained token compression • Fine-grained token selection 💡 With
  • user avatar
    Love that DeepSeek is building on FlashAttention-3 code, this is why OSS can move so fast ❤️ FA3 recently enabled MLA as well, thanks to my student @tedzadouri. If you want MLA prefill & decode with full features (arbitrary page size, sliding window, rotary...), check out FA3!
    🚀 Day 1 of #OpenSourceWeek: FlashMLA Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production. ✅ BF16 support ✅ Paged KV cache (block size 64) ⚡ 3000 GB/s memory-bound & 580 TFLOPS
  • user avatar
    Crazy that we now have an open source model with 13B params that’s competitive w o1. And Mamba layers help bring much higher inference throughput
    🚀 Introducing Hunyuan-A13B, our latest open-source LLM. As an MoE model, it leverages 80B total parameters with just 13B active, delivering powerful performance that scores on par with o1 and DeepSeek across multiple mainstream benchmarks. Hunyuan-A13B features a hybrid