Challenge and measure AI agents on economically valuable, real-world tasks.
Led by UC Berkeley RDI × RDI Foundation
Agents' Last Exam aims to build the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. Co-led by Berkeley RDI and 300+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy).
| Broadest Coverage 55 industries |
Verifiable Outcomes Hidden references + deterministic graders |
Long-Horizon Multi-step workflows on real OS sandboxes |
Economically Valuable Sourced and validated by industry experts |
This repository is the open evaluation framework: the ale_run toolkit that
provisions sandboxes, runs agents, and grades them, plus 150 reference tasks
across 55 industries (the current public subset of a 1,500+ task corpus) and two reference agent harnesses.
One command boots a real cloud sandbox, runs an agent on a hello-world task, and grades the result — after a one-time Google Cloud setup (~10 min, covered by the $300 free trial).
→ docs/quickstart.md walks it end to end: create a project, copy the sandbox image, fill in two keys, and run your first task. You can manually setup your account then hand the doc to your coding agents to finish the rest.
ALE targets frontier agent systems — a harness orchestrating a foundation model, carrying its own action loop, tools, memory, and sub-agents. Rather than puppeteer such a system step by step — which would strip away the very machinery that makes it capable — ALE hands it only a task description, lets it work to completion on a real machine, and scores the artifacts it leaves behind. Each system keeps its full capabilities, and very different agents become comparable on the one axis that matters: did the work get done.
Every run is built from three interchangeable pieces:
- Agent harness — the system under test (Claude Code, Codex, Openclaw, …), a harness driving a foundation model through its own loop. Real workflows need both a terminal and a screen, so ALE evaluates what the paper calls Generalist CUA-agents: agents that combine CLI and GUI, not just one. Most harnesses are CLI-native, so ALE lifts any of them to that surface with a unified, cross-OS CUA MCP bridge — desktop actions (screenshot, click, type, scroll, …) exposed as ordinary tools in the agent's loop.
- Environment (sandbox) — a virtual machine that faithfully reproduces the real production work context: a full Windows or Linux OS with the actual professional software installed and the task's real data on disk — not a simplified or sanitized environment.
- Task — a unit of real professional work, written as an executable
main.py: an instruction, its input data, and a hidden reference that the graderevaluate()scores the output against, in [0, 1].
An experiment is just a pairing (one agent × one environment × one task) that the orchestrator runs through a fixed loop:
provision the sandbox → stage the task's inputs → run the agent to completion
→ stage the hidden reference → grade the output → score + collect logs and a unified trajectory
The hidden reference is staged only after the agent finishes, so the answer cannot leak into the run. Every run is then recorded in full: a uniform trajectory (each step, tool call, and observation in one schema), the agent's raw logs, the evaluation result, and the artifacts in play (files written, screenshots seen). A run can be replayed and audited end to end.
A harness reaches the sandbox in one of two shapes. In-sandbox harnesses are CLIs that run inside the VM; at launch ALE injects the CLI and the CUA MCP bridge into the freshly-booted machine. An out-of-sandbox harness runs in ALE's own process, outside the VM, driving it remotely through two MCP bridges (one CLI-based for shell and files, one GUI) while keeping its own memory, sub-agents, and context management alongside. ALE-Claw is the reference: ale_run/agents/ale_claw.
Deeper dives into the system design (the sandbox and providers, the executor/deployer split, task data and grading, the trajectory format) live in the docs site: agents-last-exam.org/docs.
Past the demo, ALE ships curated task lists across three difficulty tiers (near-term, full-spectrum, last-exam), plus an unlicensed track and a Linux-only slice. A full run is one experiment YAML wiring an agent matrix, an environment, and a task list, with outputs pushed to a GCS bucket.
The step-by-step (provider setup, configuring an experiment, choosing task lists) is in the docs site under Run experiments. Browse tasks and results at the tasks gallery.
To test your own agent harness or CLI on ALE, implement a small deployer. Guide: Build on ALE → Add an agent.
| Component | License | Scope |
|---|---|---|
| Software | Apache-2.0 | ale_run/, configs/, docs/, and other framework files |
| Data | CC-BY-4.0 | tasks/, selected_tasks/, sample_run/, and other benchmark content |
If you use ALE in published work, please cite the paper (arXiv:2606.05405):
@article{sun2026agentslastexam,
title = {Agents' Last Exam},
author = {Sun, Yiyou and Han, Xinyang and Zhang, Weichen and Pang, Yuanbo and Wang, Tianyu and Cao, Yuhan and Huang, Yixiao and Duroiu, Chris and Zhang, Haoyun and Lin, Jeffrey and Zhang, Weishu and Zeng, Tyler and Yan, Ying and Liu, Bo and Wen, Hanson and Xu, Mingyang and Liu, Xiaoyuan and Chen, Zimeng and Shi, Weiyan and Dsouza, Amanda and Chen, Vincent Sunn and Song, Dawn and Bryant, Patrick and Boettiger, Carl and Rangan, Yamini and Rothenberg, Bradley and Steinfeld, Kyle and Rao, Arvind and Schneider, Tapio and Yannakakis, Georgios and Zanna, Laure and Ozbay, Kaan and Sim, Ida and Zohdi, Tarek and Karniadakis, George Em and Gallant, Jack and Head-gordon, Teresa and others},
journal = {arXiv preprint arXiv:2606.05405},
year = {2026}
}Stay updated: Mailing list · Contact: rdi_research@berkeley.edu
