To build AI agents that one day may become qualified for real-world work, we need to benchmark their performance on tasks that closely resemble those they will encounter in practice. This repository:
- Surveys and depicts the landscape of real-world tasks through the lens of domain and skills (
real_work) - Aggregates existing benchmarking efforts about LLM-based AI agents (
benchmark) - Maps individual benchmark examples to real-world work they represent (
mapping) - Profiles agent success and autonomy on varied benchmarks (
profiling) - Performs multi-faceted analysis regarding the gap between benchmarks and real-world work (
analysis)
We understand real-world work through two main dimensions: domain and skills.
- Domains: We refer to the Job Family annotation from O*NET to define domains. Each job family represents a group of occupations that share similar work performed, skills, education, training, and credentials.
- Skills: We here define skills as activities frequently exercised by human workers in their daily work. We adopted the Work Activities annotation from O*NET. Because it covers a wide range of work activities, including digital and physical fields, we particularly annotated the nodes that are more relevant to digital work.
We aggregate existing benchmarking efforts about AI agents that are relevant to real-world work (read more), to
- map benchmark tasks to the domains and skills they represent in human work
- collect agent task-solving trajectories and profile their autonomy levels
We map individual benchmark examples to real-world work, in terms of domain and skills, using the taxonomy defined above.
The mapping is driven by LLM then validated by multiple human annotators. See mapping/ for more details.
We induce workflows on agent trajectories collected, and measure their autonomy levels across (i) work domains and skills, and (ii) agent frameworks and models.
To answer the question: "How should you intereact with your agent?", or, "What is the right level of autonomy for your agent?"
See profiling/ for more details.
Beyond work areas that current benchmarking efforts cover, we also explore:
- What skills and domains are over/under-represented in current benchmarking efforts?
- How should we create new examples? to better reflect the diversity and complexity of real-world work?
- ... ...
See analysis/ for more details.