Mike A. Merrill (@Mike_A

Mike A. Merrill

388 posts

Mike A. Merrill

@Mike_A_Merrill

Evals at Anthropic, Co-Creator of Terminal Bench, Go Bills

San Francisco, CA

Joined February 2020

Pinned
Mike A. Merrill
@Mike_A_Merrill
Nov 7, 2025
It's here!
Alex Shaw
@alexgshaw
Nov 7, 2025
Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
20K
Mike A. Merrill
@Mike_A_Merrill
May 19, 2025
Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
52K
Mike A. Merrill
@Mike_A_Merrill
Jan 29, 2025
I've finished my PhD and am starting a postdoc where I'll be working with an amazing team on this and other projects related to datasets and benchmarks for reasoning models. Hope to see some of you in the Bay Area soon ;)
Mahesh Sathiamoorthy
@madiator
Jan 28, 2025
We are announcing Open Thoughts, our large-scale open-source effort to curate the best open reasoning datasets! DeepSeek-R1 is amazing but we still don't have access to high-quality open reasoning datasets. These datasets are crucial if you want to build your reasoning models!
12K
Mike A. Merrill
@Mike_A_Merrill
Apr 19, 2024
The question below is pretty easy for humans. Why can't GPT-4 get it right? In our new preprint we introduce "time series reasoning" and show that modern language models are surprisingly bad at interpreting these critical data. arxiv.org/abs/2404.11757
18K
Mike A. Merrill
@Mike_A_Merrill
Jul 2, 2024
Plenty of new papers use LLMs for time series forecasting📈. But we don't think it works! Our preprint demonstrates that the pretrained parameters of LLMs are not actually useful for (and can even hurt performance on!) forecasting tasks:
15K
Mike A. Merrill
@Mike_A_Merrill
May 22, 2025
Thrilled to see Terminal-Bench on the Claude 4 model card. We're just getting started! Come join our community to help us build the best framework for evaluating agents on valuable tasks
Anthropic
@AnthropicAI
May 22, 2025
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4. Claude Opus 4 is our most powerful model yet, and the world’s best coding model. Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.
2K
Mike A. Merrill
@Mike_A_Merrill
Jun 14, 2025
Couldn’t have done it without my advisor @timalthoff (Not pictured)
1.6K
Mike A. Merrill
@Mike_A_Merrill
Jun 12, 2024
LLM-based agents could provide cheap, expedient, and personalized health guidance at a population scale. 🤖🩺 In our paper from my internship (highlighted on the Google Research blog!) we introduce PHIA - the Personal Health Insights Agent. (1/6)
Google AI
@GoogleAI
Jun 12, 2024
Today on the blog, read about the latest from our two new research papers on how AI, particularly fine-tuned Gemini models, can create personalized health experiences that cater to individuals’ unique health journeys. →goo.gle/3RnwHbl #AI #healthcare #personalizedhealth
4.4K
Mike A. Merrill
@Mike_A_Merrill
Jun 5, 2025
I love how counterintuitive rigorous empirical research can be. We found that the best models (R1) aren't necessarily the best teachers (QwQ), and that scaling answers per question is as efficient as scaling the number of questions. Great work team!
Ryan Marten
@ryanmart3n
Jun 5, 2025
Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
2.3K
Mike A. Merrill
@Mike_A_Merrill
Jun 7, 2023
1/ 📢 Introducing Homekit2020: The first web-scale public benchmark for wearable sensor data! Our new paper offers an unprecedented dataset with 14 million+ hours of Fitbit data, symptom reports, & PCR influenza test results. 🔬
6.2K
Mike A. Merrill
@Mike_A_Merrill
Jul 25, 2025
Super excited for this. If you’re working on evals/environments/training for agents consider submitting!
Guohao Li 🐫
@guohao_li
Jul 25, 2025
🚨 [Call for Papers] SEA Workshop @ NeurIPS 2025 🚨 📅 December 6, 2025 | 📍 San Diego, USA 🌐: sea-workshop.github.io Environments are the "data" for training agents, which is largely missing in the open source ecosystem. We are hosting Scaling Environments for Agents (SEA)
1.8K
Mike A. Merrill
@Mike_A_Merrill
Sep 20, 2024
Coming soon to #EMNLP 2024 🥳
Mike A. Merrill
@Mike_A_Merrill
Apr 19, 2024
The question below is pretty easy for humans. Why can't GPT-4 get it right? In our new preprint we introduce "time series reasoning" and show that modern language models are surprisingly bad at interpreting these critical data. arxiv.org/abs/2404.11757
1.1K
Mike A. Merrill
@Mike_A_Merrill
Jun 16, 2025
this is why we made terminal bench - just give the ai a bash shell, it'll be fine
Sasha Rush
@srush_nlp
Jun 16, 2025
folks, claude+grep is crazy good. I don't think "searching for stuff" is going to be the core ability that differentiates human intelligence.
11K
Mike A. Merrill
@Mike_A_Merrill
Apr 3, 2025
it’s a good model Etash, Ryan, and the rest of the team are crushing it with these OpenThinker releases. Proud to have a (tiny) part in all this.
Etash Guha
@etash_guha
Apr 3, 2025
Turns out, it’s possible to outperform DeepSeekR1-32B with only SFT on open data and no RL: Announcing OpenThinker2-32B and OpenThinker2-7B. We also release the data, OpenThoughts2-1M, curated by selecting quality instructions from diverse sources. 🧵 (1/n)
866