Skip to content

Feature: Agent Autonomy Profiling — Complexity-Graded Evaluation Across Work Domains (inspired by AI4Work) #506

@teknium1

Description

@teknium1

Overview

AI4Work (arXiv: 2603.01203) introduces a unified complexity scale and autonomy measurement framework for evaluating AI agents against real-world work. Rather than measuring binary pass/fail on isolated benchmarks, it measures the maximum task complexity an agent can handle reliably across different work domains and skills — what they call the agent's "autonomy frontier."

This issue proposes implementing AI4Work's autonomy profiling framework as an evaluation environment for Hermes Agent, providing a systematic way to measure and track our agent's real-world work capability across the O*NET occupational taxonomy.

Companion issue: #505 (Work-Aligned Capability Expansion — what to build based on AI4Work's gap analysis)
Related: #340 (YC-Bench — complementary long-horizon benchmark)


Research Findings

The Autonomy Framework

AI4Work defines agent autonomy formally:

Autonomy = max{k | SR(k) >= H}

Where:

  • k = task complexity level (number of distinct procedural steps/skills required)
  • SR(k) = success rate at complexity level k
  • H = target success threshold (e.g., 80%)

In plain terms: the maximum task complexity an agent can handle while maintaining an acceptable success rate. This is a much richer signal than "X% on SWE-bench."

Complexity Scale Examples

Level Example Task
1 "Navigate to the Plus section of Cambridge Dictionary"
3 "Find the cheapest one-way flight from NYC to London"
7 "Inspect JSON file structure and update script to use correct field names"
10 "Set up a CI pipeline with testing, linting, and deployment stages"
14 "Implement a reinforcement learning algorithm for a multi-agent environment"

Complexity is measured by workflow induction — decomposing agent trajectories into hierarchical action sequences and counting distinct procedural steps. AI4Work provides a Workflow Induction Toolkit (desktop app for macOS/Windows) and Python scripts (profiling/segment.py, profiling/induce.py) for this analysis.

Key Performance Findings from the Paper

  • Outside coding and office tasks, most agents show zero coverage for complexity > 6
  • Even in software engineering, success rates drop sharply with complexity
  • OpenHands outperformed SWE-agent; Claude models showed higher autonomy than GPT for medium-complexity tasks
  • Agents perform best on self-contained "Work Output" tasks but struggle with "Information Input" and "Interacting with Others"
  • Framework matters: same model achieves different autonomy levels with different agent frameworks
  • The framework-model interaction is significant — best model ≠ best agent; the agent architecture matters as much as the LLM

The AI4Work Data Infrastructure

The companion repository (zorazrw/ai4work-resources) provides:

  • 72,342 mapped task instances across 43 benchmarks with O*NET domain + skill annotations
  • Domain mapping database (19MB JSON): benchmark tasks → O*NET occupational domains/tasks
  • Skill mapping database (25MB JSON): benchmark tasks → O*NET work activities/skills
  • 3,828 workflow JSON files from agent trajectories with hierarchical action decomposition
  • Autonomy evaluation scripts (evaluate/measure_autonomy.py)
  • Coverage analysis tools (coverage.py)

The mapping was performed via GPT-5 with 90.9% human-LM agreement for domains and 89.3% for skills.


Current State in Hermes Agent

Existing Evaluation Infrastructure

environments/
  hermes_base_env.py    — Abstract base class (HermesAgentBaseEnv)
  agent_loop.py         — Reusable multi-turn agent engine (HermesAgentLoop)
  tool_context.py       — Per-rollout tool access
  benchmarks/
    terminalbench_2/    — Per-task coding challenges (binary pass/fail)
    tblite/             — Terminal benchmark lite
  hermes_swe_env/       — Software engineering tasks
  terminal_test_env/    — Terminal capability testing

The Gap

All existing benchmarks measure domain-specific task completion (did the code work? did the test pass?). None measure:

  • Cross-domain capability — How does Hermes perform on management vs. legal vs. financial tasks?
  • Complexity frontier — At what complexity level does Hermes start failing?
  • Skill-level profiling — Which O*NET work activities can Hermes handle?
  • Autonomy tracking over time — Is Hermes getting better at real-world work with each release?

Implementation Plan

Skill vs. Tool Classification

This is an Atropos environment (neither a skill nor a tool). It lives under environments/benchmarks/autonomy_profile/ following the pattern of TerminalBench2 and the proposed YC-Bench (#340).

Architecture

┌─────────────────────────────────────────┐
│          Autonomy Profile Env           │
│                                         │
│  ┌──────────┐   ┌───────────────────┐   │
│  │ O*NET    │   │ Task Complexity   │   │
│  │ Taxonomy │   │ Grading Engine    │   │
│  │ (JSON)   │   │                   │   │
│  └────┬─────┘   └────────┬──────────┘   │
│       │                  │              │
│  ┌────▼──────────────────▼──────────┐   │
│  │   Evaluation Matrix Generator    │   │
│  │   (domain × skill × complexity)  │   │
│  └────────────┬─────────────────────┘   │
│               │                         │
│  ┌────────────▼─────────────────────┐   │
│  │   HermesAgentLoop Executor       │   │
│  │   (runs agent on each task)      │   │
│  └────────────┬─────────────────────┘   │
│               │                         │
│  ┌────────────▼─────────────────────┐   │
│  │   Autonomy Frontier Calculator   │   │
│  │   max{k | SR(k) >= H}           │   │
│  └────────────┬─────────────────────┘   │
│               │                         │
│  ┌────────────▼─────────────────────┐   │
│  │   Profile Report Generator       │   │
│  │   (domain radar, skill heatmap)  │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘

What We'd Need

  1. O*NET taxonomy data — Import AI4Work's taxonomy_domain.json and taxonomy_skill.json (or generate our own subset)
  2. Task bank — Curated tasks at varying complexity levels across domains, sourced from:
    • AI4Work's mapped benchmark tasks (72K available)
    • Hand-crafted tasks for domains with poor benchmark coverage
    • Real-world task descriptions from O*NET adapted into agent-executable instructions
  3. Complexity grading — Either use AI4Work's workflow induction approach (post-hoc) or pre-grade tasks by complexity
  4. Environment classautonomy_profile_env.py extending HermesAgentBaseEnv
  5. Scoring functions — Per-task binary success + complexity-aware autonomy calculation
  6. Visualization — Domain radar chart, skill heatmap, autonomy frontier curve

Phased Rollout

Phase 1: Core Profiling Environment (MVP)

  • Import O*NET domain and skill taxonomies from AI4Work
  • Curate an initial task bank: 50-100 tasks across 10 domains at 5 complexity levels (500-1000 total eval items)
  • Focus on domains where tasks can be objectively evaluated:
    • Computer & Mathematical (baseline, should score high)
    • Office & Administrative (document processing, email drafting, scheduling)
    • Business & Financial (data analysis, report generation, budgeting)
    • Management (project planning, resource allocation)
    • Legal (document review, research queries)
  • Implement autonomy frontier calculation: max{k | SR(k) >= H}
  • Generate per-domain autonomy scores and an aggregate profile
  • JSONL output compatible with existing eval infrastructure
  • default.yaml config with task subset and complexity range

Phase 2: Full Taxonomy Coverage + Tracking

  • Expand task bank to all 23 O*NET domains (where feasible for a digital agent)
  • Add skill-level profiling (41 fine-grained skills)
  • Implement temporal tracking: run periodically, store results, show capability growth over time
  • Add model comparison mode: run same profile across different LLM backends
  • Generate the full "autonomy profile" visualization (radar chart + frontier curves)
  • wandb integration for tracking runs across releases

Phase 3: Workflow Induction + Complexity Auto-Grading

  • Integrate AI4Work's workflow induction toolkit to auto-decompose agent trajectories
  • Post-hoc complexity measurement: instead of pre-grading tasks, measure actual workflow complexity from agent behavior
  • This enables evaluating ANY task (not just pre-graded ones) by analyzing how the agent solves it
  • Enables comparing agent efficiency: two agents solving the same task may exhibit different workflow complexity
  • Integrate with Feature: Work-Aligned Capability Expansion — Targeting Underrepresented High-Value Domains (inspired by AI4Work) #505's new domain skills to create domain-specific evaluation suites

Task Bank Design

The hardest part is creating high-quality evaluation tasks across diverse domains. Design principles:

Task Structure

{
  "id": "mgmt-003-L7",
  "domain": "Management",
  "skill": ["Organizing, Planning, and Prioritizing Work", "Making Decisions and Solving Problems"],
  "complexity": 7,
  "instruction": "You have a team of 5 developers. Create a project plan for migrating a monolithic app to microservices over 3 months. Include: task decomposition, dependency graph, resource assignments, risk register, and milestone schedule. Save the plan as a markdown file.",
  "evaluation": {
    "type": "rubric",
    "criteria": [
      {"name": "task_decomposition", "weight": 0.2, "check": "file_contains_sections"},
      {"name": "dependencies", "weight": 0.2, "check": "has_dependency_graph"},
      {"name": "resource_allocation", "weight": 0.2, "check": "assigns_all_developers"},
      {"name": "risk_register", "weight": 0.2, "check": "identifies_risks"},
      {"name": "milestones", "weight": 0.2, "check": "has_dated_milestones"}
    ]
  }
}

Evaluation Methods by Domain

Domain Evaluation Approach
Computer & Mathematical Existing: test execution, code compilation
Office & Admin File output validation (document structure, formatting, completeness)
Business & Financial Numerical accuracy + report structure + correct methodology
Management Rubric-based: presence of required components, logical consistency
Legal Factual accuracy (verifiable against public databases), document structure
Data Analysis Output comparison (correct numbers, appropriate visualizations)

Sourcing Tasks

  1. AI4Work's mapped tasks — 72K tasks already graded by domain/skill. Filter for tasks that an agent with terminal/web access could realistically attempt.
  2. O*NET task descriptions — 5,806 computer-use task descriptions from O*NET occupations. Adapt into agent-executable instructions.
  3. Hand-crafted — For domains with poor coverage, write tasks based on actual job requirements.
  4. Community submissions — AI4Work's Google Forms for benchmark task submission could be a model for crowdsourcing.

Pros & Cons

Pros

  • First systematic real-world capability measurement — No other agent framework measures capability across the full spectrum of human work
  • Actionable development signal — Shows exactly which domains/skills need improvement, unlike aggregate benchmark scores
  • Temporal tracking — Measure capability growth over releases, not just point-in-time snapshots
  • Research contribution — AI4Work explicitly calls for agents to submit trajectories; Hermes contributing profiles would be a research contribution
  • Model comparison — Same profile across different LLMs reveals which models are best for which work domains
  • Complements existing benchmarks — TerminalBench2 measures coding depth; this measures breadth across domains
  • Principled task design — Using O*NET's validated occupational taxonomy ensures tasks reflect real work, not synthetic puzzles

Cons / Risks

  • Task bank creation is expensive — Writing 500-1000 high-quality evaluation tasks across 10+ domains is a significant upfront investment
  • Evaluation subjectivity — Unlike code tests, many real-world tasks (management plans, legal analysis) don't have single correct answers. Rubric-based evaluation introduces subjectivity.
  • Cost per full profile — Running 500+ tasks × multiple turns each = thousands of API calls. A full profile run could cost $50-200+.
  • Domain expertise needed — Validating legal/financial/management task quality requires subject matter experts
  • Complexity grading — Pre-grading task complexity is subjective; post-hoc workflow induction (Phase 3) is more rigorous but harder to implement
  • Benchmark gaming — If we optimize for this profile, we might overfit to our own task bank rather than genuinely improving capabilities
  • No license on AI4Work data — The ai4work-resources repository has NO LICENSE file. Using their taxonomy data requires clarifying terms.

Open Questions

  1. Task bank scope for MVP? 50 tasks × 10 domains × 5 complexity levels = 500 tasks is ambitious. Should Phase 1 start smaller (e.g., 20 tasks × 5 domains × 3 levels = 300)?
  2. LLM-as-judge for evaluation? For subjective domains (management, communication), should we use an LLM judge for rubric evaluation? This adds cost but enables richer assessment.
  3. AI4Work data licensing? The repository has no license. Should we contact the authors about using their taxonomy/mapping data, or build our own from O*NET directly (which is public domain)?
  4. Relationship to TerminalBench2? Should the autonomy profiler subsume TerminalBench2's coding tasks (tagged as Computer & Mathematical domain), or remain separate?
  5. Community benchmarking? Should we publish Hermes's autonomy profile publicly and invite other agents to run the same profile? This could become a community benchmark.
  6. Integration with AI4Work's submission system? They accept new benchmark tasks and agent trajectories via Google Forms. Should we submit Hermes's results back to their database?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions