Sang Michael Xie (@sangmichaelxie) / X

Sang Michael Xie

403 posts

Sang Michael Xie

@sangmichaelxie

self-improving @OpenAI. Prev: PhD @StanfordAILab, @GoogleAI Brain/DeepMind, @Meta GenAI

San Francisco, CA

Joined May 2019

Pinned
Sang Michael Xie
@sangmichaelxie
Feb 4
Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL. I think this will let us reuse previous RL and sampling FLOPs much more
24K
Sang Michael Xie
@sangmichaelxie
May 18, 2023
Should LMs train on more books, news, or web data? Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains! 🧵⬇️ arxiv.org/abs/2305.10429
241K
Sang Michael Xie
@sangmichaelxie
Nov 16, 2021
Why can GPT3 magically learn tasks? It just reads a few examples, without any parameter updates or explicitly being trained to learn. We prove that this in-context learning can emerge from modeling long-range coherence in the pretraining data! arxiv.org/abs/2111.02080 (1/n)
Sang Michael Xie
@sangmichaelxie
Mar 20, 2023
I've gotten some requests about the "building language models" project from last year's Stanford Large Language Models class, so we're releasing it: github.com/sangmichaelxie… The task is to finetune LMs to give them new capabilities/properties, similarly to Toolformer and Alpaca.
59K
Sang Michael Xie
@sangmichaelxie
Feb 8, 2023
Data selection for LMs (GPT-3, PaLM) is done with heuristics that select data by training a classifier for high-quality text. Can we do better? Turns out we can boost downstream GLUE acc by 2+% by adapting the classic importance resampling algorithm.. arxiv.org/abs/2302.03169 🧵
82K
Sang Michael Xie
@sangmichaelxie
Sep 14, 2023
Releasing an open-source PyTorch implementation of DoReMi! github.com/sangmichaelxie… The pretraining data mixture is a secret sauce of LLM training. Optimizing your data mixture for robust learning with DoReMi can reduce training time by 2-3x. Train smarter, not longer!
Sang Michael Xie
@sangmichaelxie
May 18, 2023
Should LMs train on more books, news, or web data? Introducing DoReMi🎶, which optimizes the data mixture with a small 280M model. Our data mixture makes 8B Pile models train 2.6x faster, get +6.5% few-shot acc, and get lower pplx on *all* domains! 🧵⬇️ arxiv.org/abs/2305.10429
68K
Sang Michael Xie
@sangmichaelxie
Aug 2, 2022
How can large language models (LMs) do tasks even when given random labels? While traditional supervised learning would fail, viewing in-context learning (ICL) as Bayesian inference explains how this can work! Blog post with @sewon__min:
How does in-context learning work? A framework for understanding the differences from traditional...
From ai.stanford.edu
Sang Michael Xie
@sangmichaelxie
Dec 11, 2023
I’m presenting 2 papers at #NeurIPS2023 on data-centric ML for large language models: DSIR (targeted data selection): Wed Dec 13 @ 5pm DoReMi (pretraining data mixtures): Thu Dec 14 @ 10:45am Excited to chat about large language models, data, pretraining/adaptation, and more!
36K
Sang Michael Xie
@sangmichaelxie
Nov 8, 2023
DSIR is a fast trillion-token-scale data selection tool for LLMs that’s been used on the Pile/RefinedWeb/C4/CCNet/RedPajama/etc. ⚡️Select 100M documents from the full Pile in just 𝟰.𝟱 𝗵𝗼𝘂𝗿𝘀 with 1 CPU node Now on PyPI: pip install data-selection Just 4 lines of code:
36K
Sang Michael Xie
@sangmichaelxie
Dec 18, 2020
🍔🍟"In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness" arxiv.org/abs/2012.04550 Real-world tasks (crop yield prediction from satellites) are often label-scarce. Only some countries have labels - how do we generalize globally?
Sang Michael Xie
@sangmichaelxie
Jul 21, 2021
Fine-tuning destroys some pre-trained info. Freezing parameters *preserves* it and *simplifies* the learning problem -> better ID and OOD accuracy. Excited to present Composed Fine-Tuning as a long talk at #ICML2021! Paper: arxiv.org/abs/2006.16205 Talk: bit.ly/36Sf5wA
Sang Michael Xie
@sangmichaelxie
Jan 9, 2024
The 2nd ME-FoMo workshop on understanding foundation models will be at ICLR 2024 in Vienna! Topics include pretraining (data, archs), adaptation (instruct tuning, alignment), and emergence. Paper ddl: Feb 3 Website: sites.google.com/view/me-fomo20… OpenReview: openreview.net/group?id=ICLR.…
50K
Sang Michael Xie
@sangmichaelxie
Aug 11, 2020
(1/n) Persistent neck pain and how to fix it: A thread for computer people 💻👇 Neck pain can stick around forever if you use the computer every day, especially in suboptimal WFH setups. Here's my story and a highly effective exercise that worked for me:
Sang Michael Xie
@sangmichaelxie
Jun 30, 2020
Simplifying Models with Unlabeled Output Data arxiv.org/abs/2006.16205… Joint w/ @tengyuma @percyliang Can “unlabeled” outputs help in semi-supervised learning? In problems with rich output spaces like code, images, or molecules, unlabeled outputs help with modeling valid outputs.