Transformers are arguably the most impactful deep learning architecture from the last 5 yrs.
In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention.
1/n
Misha Laskin
885 posts
co-founder, ceo @reflection_ai
- Today I’m launching @reflection_ai with my friend and co-founder @real_ioannis. Our team pioneered major advances in RL and LLMs, including AlphaGo and Gemini. At Reflection, we're building superintelligent autonomous systems. Starting with autonomous coding.
- Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
00:00 - In our new work - Algorithm Distillation - we show that transformers can improve themselves autonomously through trial and error without ever updating their weights. No prompting, no finetuning. A single transformer collects its own data and maximizes rewards on new tasks. 1/N
00:00 - Patch extraction is a fundamental operation in deep learning, especially for computer vision. By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy. 1/n
- GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP. In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding. 1/N
- Einops are pretty magical. For example, with einops you can implement max pooling in 2 lines of code. Patches → set size of patch, decompose HW dims in rearrange as (num_patches * size), specify output dim. Pooling → pick out maximum over each patch. That is all.
- Starting a blog about the engineering + scientific ideas behind training large models (e.g. transformers). First post covers data parallelism, a simple and common technique for parallelizing computation across multiple devices. mishalaskin.com/posts/data_par… 1/N
00:00 - New paper led by @astooke w/ @kimin_le2 & @pabbeel - Decoupling Representation Learning from RL. First time RL trained on unsupervised features matches (or beats) end-to-end RL! Paper: arxiv.org/abs/2009.08319 Code: github.com/astooke/rlpyt/… Site: mishalaskin.github.io/atc/ [1/N]
- How much memory do you need to train deep neural networks? You may find the answer to be counter intuitive. For example, suppose we're training a 4 megabyte MLP with batch_size = hidden_dim, how much memory do we need? 4MB? No - we need 8MB! Here's why... 1/N
- Excited to share that I've joined @DeepMind and for the opportunity to work at the frontier of RL research. Thank you @pabbeel and all of my collaborators for an incredible two years at Berkeley.
- We are bringing the open model frontier back to the US to build a thriving AI ecosystem globally. Thankful for the support of our investors including NVIDIA, Disruptive, DST, 1789, B Capital, Lightspeed, GIC, Eric Yuan, Eric Schmidt, Citi, Sequoia, CRV, and others.Today we're sharing the next phase of Reflection. We're building frontier open intelligence accessible to all. We've assembled an extraordinary AI team, built a frontier LLM training stack, and raised $2 billion. Why Open Intelligence Matters Technological and scientific
- Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT. This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code. 1/N
- Ever gotten tired of seeing the same architecture in deep RL ever since DeepMind's Atari-DQN, and wanted to see more papers that explore helpful changes? Check out our latest work FLARE, which replaces frame-stacking. 📝 bit.ly/3s4J1il 💻 bit.ly/3bpHM7D 1/N










