user avatar
Tengyu Ma
@tengyuma
Assistant prof. @ Stanford; Chief AI Scientist @ MongoDB; Former Co-founder/CEO of Voyage AI Working on ML, DL, RL, LLMs, and their theory.
Palo Alto, CA
Joined June 2011
Posts
  • user avatar
    Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
  • user avatar
    📢 Introducing Voyage AI @Voyage_AI_! Founded by a talented team of leading AI researchers and me 🚀🚀. We build state-of-the-art embedding models (e.g., better than OpenAI 😜). We also offer custom models that deliver 🎯+10-20% accuracy gain in your LLM products. 🧵
  • user avatar
    Pretraining is ≈SoTA for domain adaptation: just do contrastive learning on *all* unlabeled data + finetune on source labels. Features are NOT domain-invariant, but disentangle class & domain info to enable transfer. Theory & exps: arxiv.org/abs/2204.00570 arxiv.org/abs/2204.02683
  • user avatar
    RL + CoT works great for DeepSeek-R1 & o1, but:  1️⃣ Linear-in-log scaling in train & test-time compute 2️⃣ Likely bounded by difficulty of training problems Meet STP—a self-play algorithm that conjectures & proves indefinitely, scaling better! 🧠⚡🧵🧵 arxiv.org/abs/2502.00212
  • user avatar
    Very honored to be named as a 2021 Sloan Fellow. Thanks to all my group members and collaborators for their wonderful works! Thanks for appreciating our works on ML. Check out them on my Twitter homepage or website! #SloanFellow
  • user avatar
    Releasing the code of Sophia 😀, a new optimizer (⬇️). code: github.com/Liuhong99/Soph…
    Adam, a 9-yr old optimizer, is the go-to for training LLMs (eg, GPT-3, OPT, LLAMA). Introducing Sophia, a new optimizer that is 2x faster than Adam on LLMs. Just a few more lines of code could cut your costs from $2M to $1M (if scaling laws hold). arxiv.org/abs/2305.14342 🧵⬇️
  • user avatar
    If you are interested in training deep models without batchnorm, or why batchnorm can help training, please check out our paper! Arxiv link arxiv.org/abs/1901.09321 . Thanks to @ajmooch for the tweet and re-implementation!
  • user avatar
    Why does contrastive learning magically produce linearly separable features? We leverage spectral graph theory to analyze it under realistic settings. (In contrast, many prior works require that positive pairs are independent conditioned on the label.) arxiv.org/abs/2106.04156
  • user avatar
    An introductory and short survey on nonconvex optimization for machine learning problems arxiv.org/abs/2103.13462. A chapter of Beyond the Worst-Case Analysis of Algorithms edited by @algo_class.
  • user avatar
    A new paper on improving the generalization of deep models (w.r.t clean or robust accuracy) by theory-inspired explicit regularizers.
  • user avatar
    Thinking of applying self-supervised learning (SSL) on your uncurated, imbalanced datasets? Good news: we found SSL is more robust to long tails than supervised representations. We also present theoretical and empirical analyses and an improved algorithm. arxiv.org/abs/2110.05025
  • user avatar
    We joined @MongoDB! @VoyageAI’s best-in-class embedding models and rerankers will be part of MongoDB’s best-in-class database, powering mission-critical AI applications with high-quality semantic retrieval capability. A huge thank you to everyone with us on this journey, and to
  • user avatar
    WSD learning rate is taking off—lower loss, no pre-set compute budget, & easier continual training. Yet, its loss curve is puzzling—high in stable phase but jumps in decay phase. Our paper explains it with a 'River Valley' structure of the loss! arxiv.org/abs/2410.05192 🧵🧵
  • user avatar
    Our recent paper arxiv.org/abs/1810.05369 that tries to explain why over-parameterized models can even help generalization: bigger models always have larger max-margins, and a weak regularizer + logistic loss can give the max-margin!