Today, we’re announcing the next chapter of Terminal-Bench with two releases:
1. Harbor, a new package for running sandboxed agent rollouts at scale
2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse?
We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
I've finished my PhD and am starting a postdoc where I'll be working with an amazing team on this and other projects related to datasets and benchmarks for reasoning models. Hope to see some of you in the Bay Area soon ;)
We are announcing Open Thoughts, our large-scale open-source effort to curate the best open reasoning datasets!
DeepSeek-R1 is amazing but we still don't have access to high-quality open reasoning datasets. These datasets are crucial if you want to build your reasoning models!
The question below is pretty easy for humans. Why can't GPT-4 get it right? In our new preprint we introduce "time series reasoning" and show that modern language models are surprisingly bad at interpreting these critical data. arxiv.org/abs/2404.11757
Plenty of new papers use LLMs for time series forecasting📈. But we don't think it works! Our preprint demonstrates that the pretrained parameters of LLMs are not actually useful for (and can even hurt performance on!) forecasting tasks:
Thrilled to see Terminal-Bench on the Claude 4 model card. We're just getting started! Come join our community to help us build the best framework for evaluating agents on valuable tasks
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.
Claude Opus 4 is our most powerful model yet, and the world’s best coding model.
Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.
LLM-based agents could provide cheap, expedient, and personalized health guidance at a population scale. 🤖🩺
In our paper from my internship (highlighted on the Google Research blog!) we introduce PHIA - the Personal Health Insights Agent. (1/6)
Today on the blog, read about the latest from our two new research papers on how AI, particularly fine-tuned Gemini models, can create personalized health experiences that cater to individuals’ unique health journeys. →goo.gle/3RnwHbl#AI#healthcare#personalizedhealth
I love how counterintuitive rigorous empirical research can be. We found that the best models (R1) aren't necessarily the best teachers (QwQ), and that scaling answers per question is as efficient as scaling the number of questions. Great work team!
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.
We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
1/ 📢 Introducing Homekit2020: The first web-scale public benchmark for wearable sensor data! Our new paper offers an unprecedented dataset with 14 million+ hours of Fitbit data, symptom reports, & PCR influenza test results. 🔬
🚨 [Call for Papers] SEA Workshop @ NeurIPS 2025 🚨
📅 December 6, 2025 | 📍 San Diego, USA
🌐: sea-workshop.github.io
Environments are the "data" for training agents, which is largely missing in the open source ecosystem.
We are hosting Scaling Environments for Agents (SEA)
The question below is pretty easy for humans. Why can't GPT-4 get it right? In our new preprint we introduce "time series reasoning" and show that modern language models are surprisingly bad at interpreting these critical data. arxiv.org/abs/2404.11757
Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)