Tengyu Ma (@tengyuma) / X

Tengyu Ma

609 posts

Tengyu Ma

@tengyuma

Assistant prof. @ Stanford; Chief AI Scientist @ MongoDB; Former Co-founder/CEO of Voyage AI Working on ML, DL, RL, LLMs, and their theory.

Palo Alto, CA

ai.stanford.edu/~tengyuma

Joined June 2011

Tengyu Ma
@tengyuma
May 24, 2023
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
1M
Tengyu Ma
@tengyuma
Oct 30, 2023
📢 Introducing Voyage AI @Voyage_AI_! Founded by a talented team of leading AI researchers and me 🚀🚀. We build state-of-the-art embedding models (e.g., better than OpenAI 😜). We also offer custom models that deliver 🎯+10-20% accuracy gain in your LLM products. 🧵
225K
Tengyu Ma
@tengyuma
Apr 12, 2022
Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps: arxiv.org/abs/2204.00570 arxiv.org/abs/2204.02683
Tengyu Ma
@tengyuma
Feb 4, 2025
RL + CoT works great for DeepSeek-R1 & o1, but: 1️⃣ Linear-in-log scaling in train & test-time compute 2️⃣ Likely bounded by difficulty of training problems Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵 arxiv.org/abs/2502.00212
92K
Tengyu Ma
@tengyuma
Feb 16, 2021
Very honored to be named as a 2021 Sloan Fellow. Thanks to all my group members and collaborators for their wonderful works! Thanks for appreciating our works on ML. Check out them on my Twitter homepage or website! #SloanFellow
Tengyu Ma
@tengyuma
May 25, 2023
Releasing the code of Sophia 😀, a new optimizer (⬇️）. code: github.com/Liuhong99/Soph…
Tengyu Ma
@tengyuma
May 24, 2023
Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
134K
Tengyu Ma
@tengyuma
Feb 3, 2019
If you are interested in training deep models without batchnorm, or why batchnorm can help training, please check out our paper! Arxiv link arxiv.org/abs/1901.09321 . Thanks to @ajmooch for the tweet and re-implementation!
Andy Brock
@ajmooch
Jan 29, 2019
Fixup (formerly ZeroInit) by H. Zhang, Y.N. Dauphin, and @tengyuma (ICLR2019): openreview.net/forum?id=H1gsz… They manage to train deep nets (10k layers!) w/o BatchNorm, by careful init scaling & initializing the 2nd residual conv to 0. My @PyTorch impl. here: github.com/ajbrock/Boiler…
arxiv.org
Fixup Initialization: Residual Learning Without Normalization
Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and...
Tengyu Ma
@tengyuma
Jun 10, 2021
Why does contrastive learning magically produce linearly separable features? We leverage spectral graph theory to analyze it under realistic settings. (In contrast, many prior works require that positive pairs are independent conditioned on the label.) arxiv.org/abs/2106.04156
Tengyu Ma
@tengyuma
Mar 29, 2021
An introductory and short survey on nonconvex optimization for machine learning problems arxiv.org/abs/2103.13462. A chapter of Beyond the Worst-Case Analysis of Algorithms edited by @algo_class.
Tengyu Ma
@tengyuma
Oct 21, 2019
A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.
arxiv.org
Improved Sample Complexities for Deep Networks and Robust...
For linear classifiers, the relationship between (normalized) output margin and generalization is captured in a clear and simple bound -- a large output margin implies good generalization....
Tengyu Ma
@tengyuma
Oct 13, 2021
Thinking of applying self-supervised learning (SSL) on your uncurated, imbalanced datasets? Good news: we found SSL is more robust to long tails than supervised representations. We also present theoretical and empirical analyses and an improved algorithm. arxiv.org/abs/2110.05025
Tengyu Ma
@tengyuma
Feb 24, 2025
We joined @MongoDB! @VoyageAI’s best-in-class embedding models and rerankers will be part of MongoDB’s best-in-class database, powering mission-critical AI applications with high-quality semantic retrieval capability. A huge thank you to everyone with us on this journey, and to
41K
Tengyu Ma
@tengyuma
Oct 31, 2024
WSD learning rate is taking off—lower loss, no pre-set compute budget, & easier continual training. Yet, its loss curve is puzzling—high in stable phase but jumps in decay phase. Our paper explains it with a 'River Valley' structure of the loss! arxiv.org/abs/2410.05192 🧵🧵
83K
Tengyu Ma
@tengyuma
Oct 21, 2018
Our recent paper arxiv.org/abs/1810.05369 that tries to explain why over-parameterized models can even help generalization: bigger models always have larger max-margins, and a weak regularizer + logistic loss can give the max-margin!