Alex Shaw (@alexgshaw) / X

Alex Shaw

644 posts

Alex Shaw

@alexgshaw

Hacking on @terminalbench and @harborframework. Founding MTS @LaudeInstitute. Formerly Google. BYU alum.

Joined October 2021

Pinned
Alex Shaw
@alexgshaw
Nov 7, 2025
Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
144K
Alex Shaw
@alexgshaw
Nov 3, 2025
We're releasing Terminal-Bench 2.0 this week! Come to our meetup on Thursday @ Databricks to get early access :)
Terminal-Bench 2.0 · Luma
From luma.com
14K
Alex Shaw
@alexgshaw
Jul 16, 2025
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now
14K
Alex Shaw
@alexgshaw
Jun 30, 2025
Replying to @NTFabiano
The craziest part of this chart is not how well the AI performs (although that is impressive). It’s that the best physician has less than 40% accuracy.
2.9K
Alex Shaw
@alexgshaw
May 19, 2025
Excited to share what I’ve been working on with @andykonwinski, @Mike_A_Merrill, and @lschmidt3 at Stanford & Laude. Introducing Terminal-Bench! A benchmark and framework to quantify how well AI agents accomplish complex tasks in a terminal environment. We believe that the
Mike A. Merrill
@Mike_A_Merrill
May 19, 2025
Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
6.2K
Alex Shaw
@alexgshaw
Jul 24, 2025
Replying to @astrodanish
If you have $32B you don’t need to go to Turkey for a hair transplant 😂
7.6K
Alex Shaw
@alexgshaw
Nov 7, 2025
Replying to @alexgshaw
Harbor is the package we wish we had had while making Terminal-Bench. It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models.
Harbor
From harborframework.com
3.6K
Alex Shaw
@alexgshaw
Nov 13, 2025
Great to see Warp putting up the top score on Terminal-Bench 2.0 just days after release! Even more exciting to hear that they've already made improvements to their agent based on the results. Ultimately, we hope that Terminal-Bench 2.0 accelerates model and agent development in
Warp
@warpdotdev
Nov 11, 2025
Warp is back at the top. Terminal-Bench 2.0 just launched and Warp secured the top spot with a score of 50.1%. The best agent to go from prompt to production.
2.8K
Alex Shaw
@alexgshaw
Nov 7, 2025
Replying to @alexgshaw
Just a few of the features I love about Harbor: - Evaluate any agent that can be installed and run autonomously - Scale up to thousands of concurrent containers using providers like @daytonaio and @modal - Generate rollouts for SFT and RL - Create your own benchmarks or use
3K
Alex Shaw
@alexgshaw
Apr 10, 2024
Replying to @karpathy
Be honest, did you use GitHub copilot when you wrote this
15K
Alex Shaw
@alexgshaw
Nov 7, 2025
Replying to @alexgshaw
At present, Codex CLI with GPT-5 sits at the top of our new leaderboard. tbench.ai/leaderboard
2.6K
Alex Shaw
@alexgshaw
Aug 9, 2023
.@supabase's integration of AI into the SQL editor is easily the most convenient use of AI I have found in a product other than @GitHubCopilot. It's not complicated, but it does exactly what I need it to: (1/2)
879
Alex Shaw
@alexgshaw
Nov 7, 2025
Replying to @alexgshaw
Additionally, Terminal-Bench wouldn’t be possible without its community. We’re so thankful to the over 1k members of our Discord who contributed and audited tasks, helped build and beta test Harbor, and made this such a fun project for everyone involved.
1.9K
Alex Shaw
@alexgshaw
Oct 20, 2025
Mike and I went on the @latentspacepod !
Mike A. Merrill
@Mike_A_Merrill
Oct 20, 2025
Had a great time talking about the history of terminal-bench and the future of agent evals with @alexgshaw , @swyx and @FanaHOVA on @latentspacepod. 🔗 Link below!
2.5K