Today, we’re announcing the next chapter of Terminal-Bench with two releases:
1. Harbor, a new package for running sandboxed agent rollouts at scale
2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days.
We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks.
Now
The craziest part of this chart is not how well the AI performs (although that is impressive).
It’s that the best physician has less than 40% accuracy.
Excited to share what I’ve been working on with @andykonwinski, @Mike_A_Merrill, and @lschmidt3 at Stanford & Laude.
Introducing Terminal-Bench! A benchmark and framework to quantify how well AI agents accomplish complex tasks in a terminal environment. We believe that the
Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse?
We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
Harbor is the package we wish we had had while making Terminal-Bench. It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models.
Great to see Warp putting up the top score on Terminal-Bench 2.0 just days after release! Even more exciting to hear that they've already made improvements to their agent based on the results.
Ultimately, we hope that Terminal-Bench 2.0 accelerates model and agent development in
Warp is back at the top.
Terminal-Bench 2.0 just launched and Warp secured the top spot with a score of 50.1%.
The best agent to go from prompt to production.
Just a few of the features I love about Harbor:
- Evaluate any agent that can be installed and run autonomously
- Scale up to thousands of concurrent containers using providers like @daytonaio and @modal
- Generate rollouts for SFT and RL
- Create your own benchmarks or use
.@supabase's integration of AI into the SQL editor is easily the most convenient use of AI I have found in a product other than @GitHubCopilot. It's not complicated, but it does exactly what I need it to: (1/2)
Additionally, Terminal-Bench wouldn’t be possible without its community. We’re so thankful to the over 1k members of our Discord who contributed and audited tasks, helped build and beta test Harbor, and made this such a fun project for everyone involved.
Had a great time talking about the history of terminal-bench and the future of agent evals with @alexgshaw , @swyx and @FanaHOVA on @latentspacepod.
🔗 Link below!