Log inSign up
Sang Michael Xie
403 posts
user avatar
Sang Michael Xie
@sangmichaelxie
self-improving @OpenAI. Prev: PhD @StanfordAILab, @GoogleAI Brain/DeepMind, @Meta GenAI
San Francisco, CA
cs.stanford.edu/~eix
Joined May 2019
781
Following
4,410
Followers
  • Pinned
    user avatar
    Sang Michael Xie
    @sangmichaelxie
    Feb 4
    Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL. I think this will let us reuse previous RL and sampling FLOPs much more
    24K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    May 18, 2023
    Should LMs train on more books, news, or web data? Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains! 🧵⬇️ arxiv.org/abs/2305.10429
    241K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Nov 16, 2021
    Why can GPT3 magically learn tasks? It just reads a few examples, without any parameter updates or explicitly being trained to learn. We prove that this in-context learning can emerge from modeling long-range coherence in the pretraining data! arxiv.org/abs/2111.02080 (1/n)
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Mar 20, 2023
    I've gotten some requests about the "building language models" project from last year's Stanford Large Language Models class, so we're releasing it: github.com/sangmichaelxie… The task is to finetune LMs to give them new capabilities/properties, similarly to Toolformer and Alpaca.
    59K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Feb 8, 2023
    Data selection for LMs (GPT-3, PaLM) is done with heuristics that select data by training a classifier for high-quality text. Can we do better? Turns out we can boost downstream GLUE acc by 2+% by adapting the classic importance resampling algorithm.. arxiv.org/abs/2302.03169 🧵
    82K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Sep 14, 2023
    Releasing an open-source PyTorch implementation of DoReMi! github.com/sangmichaelxie… The pretraining data mixture is a secret sauce of LLM training. Optimizing your data mixture for robust learning with DoReMi can reduce training time by 2-3x. Train smarter, not longer!
    user avatar
    Sang Michael Xie
    @sangmichaelxie
    May 18, 2023
    Should LMs train on more books, news, or web data? Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains! 🧵⬇️ arxiv.org/abs/2305.10429
    68K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Aug 2, 2022
    How can large language models (LMs) do tasks even when given random labels? While traditional supervised learning would fail, viewing in-context learning (ICL) as Bayesian inference explains how this can work! Blog post with @sewon__min:
    How does in-context learning work? A framework for understanding the differences from traditional...
    From ai.stanford.edu
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Dec 11, 2023
    I’m presenting 2 papers at #NeurIPS2023 on data-centric ML for large language models: DSIR (targeted data selection): Wed Dec 13 @ 5pm DoReMi (pretraining data mixtures): Thu Dec 14 @ 10:45am Excited to chat about large language models, data, pretraining/adaptation, and more!
    36K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Nov 8, 2023
    DSIR is a fast trillion-token-scale data selection tool for LLMs that’s been used on the Pile/RefinedWeb/C4/CCNet/RedPajama/etc. ⚡️Select 100M documents from the full Pile in just 𝟰.𝟱 𝗵𝗼𝘂𝗿𝘀 with 1 CPU node Now on PyPI: pip install data-selection Just 4 lines of code:
    36K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Dec 18, 2020
    🍔🍟"In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness" arxiv.org/abs/2012.04550 Real-world tasks (crop yield prediction from satellites) are often label-scarce. Only some countries have labels - how do we generalize globally?
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Jul 21, 2021
    Fine-tuning destroys some pre-trained info. Freezing parameters *preserves* it and *simplifies* the learning problem -> better ID and OOD accuracy. Excited to present Composed Fine-Tuning as a long talk at #ICML2021! Paper: arxiv.org/abs/2006.16205 Talk: bit.ly/36Sf5wA
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Jan 9, 2024
    The 2nd ME-FoMo workshop on understanding foundation models will be at ICLR 2024 in Vienna! Topics include pretraining (data, archs), adaptation (instruct tuning, alignment), and emergence. Paper ddl: Feb 3 Website: sites.google.com/view/me-fomo20… OpenReview: openreview.net/group?id=ICLR.…
    50K
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Aug 11, 2020
    (1/n) Persistent neck pain and how to fix it: A thread for computer people 💻👇 Neck pain can stick around forever if you use the computer every day, especially in suboptimal WFH setups. Here's my story and a highly effective exercise that worked for me:
  • user avatar
    Sang Michael Xie
    @sangmichaelxie
    Jun 30, 2020
    Simplifying Models with Unlabeled Output Data arxiv.org/abs/2006.16205… Joint w/ @tengyuma @percyliang Can “unlabeled” outputs help in semi-supervised learning? In problems with rich output spaces like code, images, or molecules, unlabeled outputs help with modeling valid outputs.

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up