Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.
We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
Ryan Marten
591 posts
- Announcing the Open Thoughts project. We are building the best reasoning datasets out in the open. Building off our work with Stratos, today we are releasing OpenThoughts-114k and OpenThinker-7B.
- Replying to @ryanmart3nPaper: arxiv.org/abs/2506.04178 Model: huggingface.co/open-thoughts/โฆ Dataset: huggingface.co/datasets/open-โฆ Code: github.com/open-thoughts/โฆ Blog: openthoughts.ai/blog/ot3 (10/N)
- Replying to @teknium
- Replying to @ryanmart3nHighlight 1. Sampling multiple answers for the same question from a teacher model is a surprisingly effective way to increase the dataset size. Would it be better to have 30k questions, each answered once, or 10k questions, each answered 3 times independently? Surprisingly, with
- Replying to @ryanmart3nOur model surpasses similar scale models from industry labs, such as Nvidia, Hugging Face, and GPT-4.1, among others. We achieve SOTA on held out evals, demonstrating strong generalization. OpenThoughts3-1.2M is built through 1,000 ablation experiments. (2/N)
- Replying to @ryanmart3nThank you to the whole OpenThoughts team for yet another great effort! @etash_guha, @ryanmart3n, @sedrickkeh2, @NeginRaoof_, @GeorgeSmyrnis1, @hbXNov, @marnezhurina, @MercatJean, @trungthvu, @ZayneSprague, @suvarna_ashima, @FeuerBenjamin, @cliangyu_, @codezakh, @esfrankel,
- Speaking about OpenThoughts3 (hot off the presses!!) and how you can use our reasoning data recipe lessons to train your own specialized reasoning models. 12:15pm @aiDotEngineer (talk will also be recorded)
- Replying to @ryanmart3nHighlight 2. Models with better performance are not necessarily better teachers. QwQ-32B is a stronger teacher than DeepSeek-R1, although it scores lower on target reasoning benchmarks. (4/N)
- Replying to @ryanmart3nHighlight 5. Question Filtering did work. Filtering questions by LLM labeled difficulty or LLM response length yields better results than filters typical to pre-training data curation that use embeddings or fastText. (7/N)
- Replying to @ryanmart3nOur dataset also works for post-training Llama! OpenThoughts3 is versatile and works across multiple base models. We train Llama-3.1-8B-Instruct on 100k samples of OpenThoughts3-1.2M, and we see similar or even larger gains on downstream evals. (9/N)
- Replying to @ryanmart3nOpenThoughts3 consists of 850k math, 250K code, and 100K science questions with reasoning traces from QwQ-32B. All completely open! (8/N)
- Replying to @ryanmart3nHighlight 3. Answer Filtering didnโt work. We experimented with numerous verification and answer filtering methods, and none gave significant performance improvements. (5/N)















