user avatar
Misha Laskin
@MishaLaskin
co-founder, ceo @reflection_ai
NYC
Joined August 2013
Posts
  • user avatar
    Transformers are arguably the most impactful deep learning architecture from the last 5 yrs. In the next few threads, we’ll cover multi-head attention, GPT and BERT, Vision Transformer, and write these out in code. This thread → understanding multi-head attention. 1/n
  • user avatar
    Today I’m launching @reflection_ai with my friend and co-founder @real_ioannis. Our team pioneered major advances in RL and LLMs, including AlphaGo and Gemini. At Reflection, we're building superintelligent autonomous systems. Starting with autonomous coding.
  • user avatar
    Engineers spend 70% of their time understanding code, not writing it. That’s why we built Asimov at @reflection_ai. The best-in-class code research agent, built for teams and organizations.
    00:00
  • user avatar
    In our new work - Algorithm Distillation - we show that transformers can improve themselves autonomously through trial and error without ever updating their weights. No prompting, no finetuning. A single transformer collects its own data and maximizes rewards on new tasks. 1/N
    00:00
  • user avatar
    Patch extraction is a fundamental operation in deep learning, especially for computer vision. By the end of this thread, you’ll know how to implement an efficient vectorized patch extractor (no for loops) in a few lines of code and learn about memory allocation in numpy. 1/n
  • user avatar
    GPT has been a core part of the unsupervised learning revolution that’s been happening in NLP. In part 2 of the transformer series, we’ll build GPT from the ground up. This thread → masked causal self-attention, the transformer block, tokenization & position encoding. 1/N
  • user avatar
    Einops are pretty magical. For example, with einops you can implement max pooling in 2 lines of code. Patches → set size of patch, decompose HW dims in rearrange as (num_patches * size), specify output dim. Pooling → pick out maximum over each patch. That is all.
  • user avatar
    Starting a blog about the engineering + scientific ideas behind training large models (e.g. transformers). First post covers data parallelism, a simple and common technique for parallelizing computation across multiple devices. mishalaskin.com/posts/data_par… 1/N
    00:00
  • user avatar
    New paper led by @astooke w/ @kimin_le2 & @pabbeel - Decoupling Representation Learning from RL. First time RL trained on unsupervised features matches (or beats) end-to-end RL! Paper: arxiv.org/abs/2009.08319 Code: github.com/astooke/rlpyt/… Site: mishalaskin.github.io/atc/ [1/N]
  • user avatar
    How much memory do you need to train deep neural networks? You may find the answer to be counter intuitive. For example, suppose we're training a 4 megabyte MLP with batch_size = hidden_dim, how much memory do we need? 4MB? No - we need 8MB! Here's why... 1/N
  • user avatar
    Excited to share that I've joined @DeepMind and for the opportunity to work at the frontier of RL research. Thank you @pabbeel and all of my collaborators for an incredible two years at Berkeley.
  • user avatar
    We are bringing the open model frontier back to the US to build a thriving AI ecosystem globally. Thankful for the support of our investors including NVIDIA, Disruptive, DST, 1789, B Capital, Lightspeed, GIC, Eric Yuan, Eric Schmidt, Citi, Sequoia, CRV, and others.
    Today we're sharing the next phase of Reflection. We're building frontier open intelligence accessible to all. We've assembled an extraordinary AI team, built a frontier LLM training stack, and raised $2 billion. Why Open Intelligence Matters Technological and scientific
  • user avatar
    Building on parts 1 & 2 which explained multi-head attention and GPT, in part 3 of the Transformer Series we'll cover masked language models like BERT. This thread → masked language models, diff between causal and bi-directional masked attention, finetuning, and code. 1/N
  • user avatar
    Ever gotten tired of seeing the same architecture in deep RL ever since DeepMind's Atari-DQN, and wanted to see more papers that explore helpful changes? Check out our latest work FLARE, which replaces frame-stacking. 📝 bit.ly/3s4J1il 💻 bit.ly/3bpHM7D 1/N