Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA).
Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold).
arxiv.org/abs/2305.14342 🧵⬇️
Tengyu Ma
609 posts
Assistant prof. @ Stanford; Chief AI Scientist @ MongoDB; Former Co-founder/CEO of Voyage AI
Working on ML, DL, RL, LLMs, and their theory.
- 📢 Introducing Voyage AI @Voyage_AI_! Founded by a talented team of leading AI researchers and me 🚀🚀. We build state-of-the-art embedding models (e.g., better than OpenAI 😜). We also offer custom models that deliver 🎯+10-20% accuracy gain in your LLM products. 🧵
- Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps: arxiv.org/abs/2204.00570 arxiv.org/abs/2204.02683
- RL + CoT works great for DeepSeek-R1 & o1, but: 1️⃣ Linear-in-log scaling in train & test-time compute 2️⃣ Likely bounded by difficulty of training problems Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵 arxiv.org/abs/2502.00212
- Very honored to be named as a 2021 Sloan Fellow. Thanks to all my group members and collaborators for their wonderful works! Thanks for appreciating our works on ML. Check out them on my Twitter homepage or website! #SloanFellow
- Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
- If you are interested in training deep models without batchnorm, or why batchnorm can help training, please check out our paper! Arxiv link arxiv.org/abs/1901.09321 . Thanks to @ajmooch for the tweet and re-implementation!Fixup (formerly ZeroInit) by H. Zhang, Y.N. Dauphin, and @tengyuma (ICLR2019): openreview.net/forum?id=H1gsz… They manage to train deep nets (10k layers!) w/o BatchNorm, by careful init scaling & initializing the 2nd residual conv to 0. My @PyTorch impl. here: github.com/ajbrock/Boiler…arxiv.orgFixup Initialization: Residual Learning Without NormalizationNormalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and...
- Why does contrastive learning magically produce linearly separable features? We leverage spectral graph theory to analyze it under realistic settings. (In contrast, many prior works require that positive pairs are independent conditioned on the label.) arxiv.org/abs/2106.04156
- An introductory and short survey on nonconvex optimization for machine learning problems arxiv.org/abs/2103.13462. A chapter of Beyond the Worst-Case Analysis of Algorithms edited by @algo_class.
- A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.
- Thinking of applying self-supervised learning (SSL) on your uncurated, imbalanced datasets? Good news: we found SSL is more robust to long tails than supervised representations. We also present theoretical and empirical analyses and an improved algorithm. arxiv.org/abs/2110.05025
- WSD learning rate is taking off—lower loss, no pre-set compute budget, & easier continual training. Yet, its loss curve is puzzling—high in stable phase but jumps in decay phase. Our paper explains it with a 'River Valley' structure of the loss! arxiv.org/abs/2410.05192 🧵🧵
- Our recent paper arxiv.org/abs/1810.05369 that tries to explain why over-parameterized models can even help generalization: bigger models always have larger max-margins, and a weak regularizer + logistic loss can give the max-margin!












