Agents' Last Exam

Challenge and measure AI agents on economically valuable, real-world tasks.

Led by UC Berkeley RDI × RDI Foundation

Agents' Last Exam aims to build the broadest-coverage agent evaluation benchmark to date, measuring performance on long-horizon, economically valuable tasks with verifiable outcomes. Co-led by Berkeley RDI and 300+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy).

Broadest Coverage
_{55 industries}

Verifiable Outcomes
_{Hidden references
+ deterministic graders}

Long-Horizon
_{Multi-step workflows
on real OS sandboxes}

Economically Valuable
_{Sourced and validated
by industry experts}

This repository is the open evaluation framework: the ale_run toolkit that provisions sandboxes, runs agents, and grades them, plus 150 reference tasks across 55 industries (the current public subset of a 1,500+ task corpus) and two reference agent harnesses.

Quick start

One command boots a real cloud sandbox, runs an agent on a hello-world task, and grades the result — after a one-time Google Cloud setup (~10 min, covered by the $300 free trial).

→ docs/quickstart.md walks it end to end: create a project, copy the sandbox image, fill in two keys, and run your first task. You can manually setup your account then hand the doc to your coding agents to finish the rest.

How ALE works

ALE targets frontier agent systems — a harness orchestrating a foundation model, carrying its own action loop, tools, memory, and sub-agents. Rather than puppeteer such a system step by step — which would strip away the very machinery that makes it capable — ALE hands it only a task description, lets it work to completion on a real machine, and scores the artifacts it leaves behind. Each system keeps its full capabilities, and very different agents become comparable on the one axis that matters: did the work get done.

Every run is built from three interchangeable pieces:

Agent harness — the system under test (Claude Code, Codex, Openclaw, …), a harness driving a foundation model through its own loop. Real workflows need both a terminal and a screen, so ALE evaluates what the paper calls Generalist CUA-agents: agents that combine CLI and GUI, not just one. Most harnesses are CLI-native, so ALE lifts any of them to that surface with a unified, cross-OS CUA MCP bridge — desktop actions (screenshot, click, type, scroll, …) exposed as ordinary tools in the agent's loop.
Environment (sandbox) — a virtual machine that faithfully reproduces the real production work context: a full Windows or Linux OS with the actual professional software installed and the task's real data on disk — not a simplified or sanitized environment.
Task — a unit of real professional work, written as an executable main.py: an instruction, its input data, and a hidden reference that the grader evaluate() scores the output against, in [0, 1].

A run, end to end

An experiment is just a pairing (one agent × one environment × one task) that the orchestrator runs through a fixed loop:

  provision the sandbox  →  stage the task's inputs  →  run the agent to completion
       →  stage the hidden reference  →  grade the output  →  score + collect logs and a unified trajectory

The hidden reference is staged only after the agent finishes, so the answer cannot leak into the run. Every run is then recorded in full: a uniform trajectory (each step, tool call, and observation in one schema), the agent's raw logs, the evaluation result, and the artifacts in play (files written, screenshots seen). A run can be replayed and audited end to end.

A harness reaches the sandbox in one of two shapes. In-sandbox harnesses are CLIs that run inside the VM; at launch ALE injects the CLI and the CUA MCP bridge into the freshly-booted machine. An out-of-sandbox harness runs in ALE's own process, outside the VM, driving it remotely through two MCP bridges (one CLI-based for shell and files, one GUI) while keeping its own memory, sub-agents, and context management alongside. ALE-Claw is the reference: ale_run/agents/ale_claw.

Deeper dives into the system design (the sandbox and providers, the executor/deployer split, task data and grading, the trajectory format) live in the docs site: agents-last-exam.org/docs.

Running the benchmark

Past the demo, ALE ships curated task lists across three difficulty tiers (near-term, full-spectrum, last-exam), plus an unlicensed track and a Linux-only slice. A full run is one experiment YAML wiring an agent matrix, an environment, and a task list, with outputs pushed to a GCS bucket.

The step-by-step (provider setup, configuring an experiment, choosing task lists) is in the docs site under Run experiments. Browse tasks and results at the tasks gallery.

Build on ALE

To test your own agent harness or CLI on ALE, implement a small deployer. Guide: Build on ALE → Add an agent.

License

Component	License	Scope
Software	Apache-2.0	`ale_run/`, `configs/`, `docs/`, and other framework files
Data	CC-BY-4.0	`tasks/`, `selected_tasks/`, `sample_run/`, and other benchmark content

Citation

If you use ALE in published work, please cite the paper (arXiv:2606.05405):

@article{sun2026agentslastexam,
  title   = {Agents' Last Exam},
  author  = {Sun, Yiyou and Han, Xinyang and Zhang, Weichen and Pang, Yuanbo and Wang, Tianyu and Cao, Yuhan and Huang, Yixiao and Duroiu, Chris and Zhang, Haoyun and Lin, Jeffrey and Zhang, Weishu and Zeng, Tyler and Yan, Ying and Liu, Bo and Wen, Hanson and Xu, Mingyang and Liu, Xiaoyuan and Chen, Zimeng and Shi, Weiyan and Dsouza, Amanda and Chen, Vincent Sunn and Song, Dawn and Bryant, Patrick and Boettiger, Carl and Rangan, Yamini and Rothenberg, Bradley and Steinfeld, Kyle and Rao, Arvind and Schneider, Tapio and Yannakakis, Georgios and Zanna, Laure and Ozbay, Kaan and Sim, Ida and Zohdi, Tarek and Karniadakis, George Em and Gallant, Jack and Head-gordon, Teresa and others},
  journal = {arXiv preprint arXiv:2606.05405},
  year    = {2026}
}

Stay updated: Mailing list · Contact: rdi_research@berkeley.edu

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
ale_run		ale_run
assets		assets
configs		configs
docs		docs
secret		secret
selected_tasks		selected_tasks
tasks		tasks
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
example_exp.yaml		example_exp.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agents' Last Exam

Quick start

How ALE works

A run, end to end

Running the benchmark

Build on ALE

License

Citation

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agents' Last Exam

Quick start

How ALE works

A run, end to end

Running the benchmark

Build on ALE

License

Citation

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages