AI Agents for Real World Work

To build AI agents that one day may become qualified for real-world work, we need to benchmark their performance on tasks that closely resemble those they will encounter in practice. This repository:

Surveys and depicts the landscape of real-world tasks through the lens of domain and skills (real_work)
Aggregates existing benchmarking efforts about LLM-based AI agents (benchmark)
Maps individual benchmark examples to real-world work they represent (mapping)
Profiles agent success and autonomy on varied benchmarks (profiling)
Performs multi-faceted analysis regarding the gap between benchmarks and real-world work (analysis)

💼 Taxonomy of Real-World Work

We understand real-world work through two main dimensions: domain and skills.

Domains: We refer to the Job Family annotation from O*NET to define domains. Each job family represents a group of occupations that share similar work performed, skills, education, training, and credentials.
Skills: We here define skills as activities frequently exercised by human workers in their daily work. We adopted the Work Activities annotation from O*NET. Because it covers a wide range of work activities, including digital and physical fields, we particularly annotated the nodes that are more relevant to digital work.

📊 Agent Benchmarks

We aggregate existing benchmarking efforts about AI agents that are relevant to real-world work (read more), to

map benchmark tasks to the domains and skills they represent in human work
collect agent task-solving trajectories and profile their autonomy levels

🪄 Mapping Benchmarks to Real-World Work

We map individual benchmark examples to real-world work, in terms of domain and skills, using the taxonomy defined above.

The mapping is driven by LLM then validated by multiple human annotators. See mapping/ for more details.

🎨 Profiling Agent Autonomy

We induce workflows on agent trajectories collected, and measure their autonomy levels across (i) work domains and skills, and (ii) agent frameworks and models.

To answer the question: "How should you intereact with your agent?", or, "What is the right level of autonomy for your agent?" See profiling/ for more details.

💭 Analysis: The Gap Between Benchmarks and Real-World Work

Beyond work areas that current benchmarking efforts cover, we also explore:

What skills and domains are over/under-represented in current benchmarking efforts?
How should we create new examples? to better reflect the diversity and complexity of real-world work?
... ...

See analysis/ for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 378 Commits
analysis		analysis
benchmark		benchmark
mapping		mapping
profiling		profiling
real_work		real_work
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Agents for Real World Work

💼 Taxonomy of Real-World Work

📊 Agent Benchmarks

🪄 Mapping Benchmarks to Real-World Work

🎨 Profiling Agent Autonomy

💭 Analysis: The Gap Between Benchmarks and Real-World Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Agents for Real World Work

💼 Taxonomy of Real-World Work

📊 Agent Benchmarks

🪄 Mapping Benchmarks to Real-World Work

🎨 Profiling Agent Autonomy

💭 Analysis: The Gap Between Benchmarks and Real-World Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages