Misha Laskin (@MishaLaskin) / X

Misha Laskin

885 posts

Misha Laskin

@MishaLaskin

co-founder, ceo @reflection_ai

NYC

Joined August 2013

Misha Laskin
@MishaLaskin
Jan 7, 2022
Transformers are arguably the most impactful deep learning architecture from the last 5 yrs. In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention. 1/n
Misha Laskin
@MishaLaskin
Mar 7, 2025
Today I’m launching @reflection_ai with my friend and co-founder @real_ioannis. Our team pioneered major advances in RL and LLMs, including AlphaGo and Gemini. At Reflection, we're building superintelligent autonomous systems. Starting with autonomous coding.
499K
Misha Laskin
@MishaLaskin
Jul 16, 2025
Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
00:00
369K
Misha Laskin
@MishaLaskin
Oct 26, 2022
In our new work - Algorithm Distillation - we show that transformers can improve themselves autonomously through trial and error without ever updating their weights. No prompting, no finetuning. A single transformer collects its own data and maximizes rewards on new tasks. 1/N
00:00
Misha Laskin
@MishaLaskin
Jan 4, 2022
Patch extraction is a fundamental operation in deep learning, especially for computer vision. By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy. 1/n
Misha Laskin
@MishaLaskin
Jan 13, 2022
GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP. In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding. 1/N
Misha Laskin
@MishaLaskin
Jan 19, 2022
Einops are pretty magical. For example, with einops you can implement max pooling in 2 lines of code. Patches → set size of patch, decompose HW dims in rearrange as (num_patches * size), specify output dim. Pooling → pick out maximum over each patch. That is all.
Misha Laskin
@MishaLaskin
Feb 23, 2023
Starting a blog about the engineering + scientific ideas behind training large models (e.g. transformers). First post covers data parallelism, a simple and common technique for parallelizing computation across multiple devices. mishalaskin.com/posts/data_par… 1/N
00:00
71K
Misha Laskin
@MishaLaskin
Sep 18, 2020
New paper led by @astooke w/ @kimin_le2 & @pabbeel - Decoupling Representation Learning from RL. First time RL trained on unsupervised features matches (or beats) end-to-end RL! Paper: arxiv.org/abs/2009.08319 Code: github.com/astooke/rlpyt/… Site: mishalaskin.github.io/atc/ [1/N]
Misha Laskin
@MishaLaskin
Jul 11, 2022
How much memory do you need to train deep neural networks? You may find the answer to be counter intuitive. For example, suppose we're training a 4 megabyte MLP with batch_size = hidden_dim, how much memory do we need? 4MB? No - we need 8MB! Here's why... 1/N
Misha Laskin
@MishaLaskin
Feb 10, 2022
Excited to share that I've joined @DeepMind and for the opportunity to work at the frontier of RL research. Thank you @pabbeel and all of my collaborators for an incredible two years at Berkeley.
Misha Laskin
@MishaLaskin
Oct 9, 2025
We are bringing the open model frontier back to the US to build a thriving AI ecosystem globally. Thankful for the support of our investors including NVIDIA, Disruptive, DST, 1789, B Capital, Lightspeed, GIC, Eric Yuan, Eric Schmidt, Citi, Sequoia, CRV, and others.
Reflection
@reflection_ai
Oct 9, 2025
Today we're sharing the next phase of Reflection. We're building frontier open intelligence accessible to all. We've assembled an extraordinary AI team, built a frontier LLM training stack, and raised $2 billion. Why Open Intelligence Matters Technological and scientific
58K
Misha Laskin
@MishaLaskin
Jan 18, 2022
Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT. This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code. 1/N
Misha Laskin
@MishaLaskin
Jan 11, 2021
Ever gotten tired of seeing the same architecture in deep RL ever since DeepMind's Atari-DQN, and wanted to see more papers that explore helpful changes? Check out our latest work FLARE, which replaces frame-stacking. 📝 bit.ly/3s4J1il 💻 bit.ly/3bpHM7D 1/N