Very excited to finally release our paper for OpenThoughts!
After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
Ludwig Schmidt
246 posts
- Super excited to finally release DataComp! There is still a lot we don't understand about Internet-scale datasets. DataComp makes research on datasets more accessible and leads to better training sets. The results so far are very encouraging and there is much more to explore!Introducing DataComp, a new benchmark for multimodal datasets! We release 12.8B image-text pairs, 300+ experiments and a 1.4B subset that outcompetes compute-matched CLIP runs from OpenAI & LAION 📜 arxiv.org/abs/2304.14108 🖥️ github.com/mlfoundations/… 🌐 datacomp.ai
- I'm a big fan of the approach to research funding @andykonwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks @alexgshaw!) and I'm excited that they're going to support more open, impact-oriented research.Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.
- Very excited about our new agent benchmark! I think it's a nice way of evaluating how well agents can do complex task in terminal (command line) environments.Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
- Very excited about this! DCLM already led to a great training set for language models, and there is (much) more to understand + more room for improvement here.I am really excited to introduce DataComp for Language Models (DCLM), our new testbed for controlled dataset experiments aimed at improving language models. 1/x
- Very excited about this!Announcing the Open Thoughts project. We are building the best reasoning datasets out in the open. Building off our work with Stratos, today we are releasing OpenThoughts-114k and OpenThinker-7B.
- If you are working on empirical phenomena in deep learning, consider submitting to our ICML workshop "Identifying and Understanding Deep Learning Phenomena" (deep-phenomena.org). The deadline is May 5, but relevant work that was already published elsewhere is still welcome!
- I learned a lot about the nuances of language model scaling laws from this project. Also the checkpoints are available now:🧵1/8 We resolve the discrepancy between the compute optimal scaling laws of Kaplan (exponent 0.88, Figure 14, left) et al. and Hoffmann et al. (“Chinchilla”, exponent 0.5). Paper: arxiv.org/abs/2406.19146 Data + Code: github.com/formll/resolvi…
- Replying to @lschmidt3Similar to previous DataComp projects, we systematically experiment with every step of the data generation pipeline to build a state-of-the-art training set. Overall we conducted more than 1,000 individual experiments.
- Replying to @lschmidt3More details on openthoughts.ai/blog/ot3, Ryan’s thread below, and the paper itself arxiv.org/abs/2506.04178Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
- Very nice community progress on open-data reasoning models since the R1 release!Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, carefully curated datasets with verified R1 annotations produce SoTA reasoning models. Our 32B model outperforms all 32B models including
- Replying to @lschmidt3Together with the paper we also release our new dataset OpenThoughts3-1.2M and the corresponding model OpenThinker3-7B, which is currently the best open-data 7B reasoning model.
- Replying to @beenwrekt and @HazyResearchCongrats! Do you know "Benjamen Recht"? He won the 2017 test of time award nips.cc/Conferences/20… Could be related?


















