You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI4Work (arXiv: 2603.01203) introduces a unified complexity scale and autonomy measurement framework for evaluating AI agents against real-world work. Rather than measuring binary pass/fail on isolated benchmarks, it measures the maximum task complexity an agent can handle reliably across different work domains and skills — what they call the agent's "autonomy frontier."
This issue proposes implementing AI4Work's autonomy profiling framework as an evaluation environment for Hermes Agent, providing a systematic way to measure and track our agent's real-world work capability across the O*NET occupational taxonomy.
Companion issue:#505 (Work-Aligned Capability Expansion — what to build based on AI4Work's gap analysis) Related:#340 (YC-Bench — complementary long-horizon benchmark)
Research Findings
The Autonomy Framework
AI4Work defines agent autonomy formally:
Autonomy = max{k | SR(k) >= H}
Where:
k = task complexity level (number of distinct procedural steps/skills required)
SR(k) = success rate at complexity level k
H = target success threshold (e.g., 80%)
In plain terms: the maximum task complexity an agent can handle while maintaining an acceptable success rate. This is a much richer signal than "X% on SWE-bench."
Complexity Scale Examples
Level
Example Task
1
"Navigate to the Plus section of Cambridge Dictionary"
3
"Find the cheapest one-way flight from NYC to London"
7
"Inspect JSON file structure and update script to use correct field names"
10
"Set up a CI pipeline with testing, linting, and deployment stages"
14
"Implement a reinforcement learning algorithm for a multi-agent environment"
Complexity is measured by workflow induction — decomposing agent trajectories into hierarchical action sequences and counting distinct procedural steps. AI4Work provides a Workflow Induction Toolkit (desktop app for macOS/Windows) and Python scripts (profiling/segment.py, profiling/induce.py) for this analysis.
Key Performance Findings from the Paper
Outside coding and office tasks, most agents show zero coverage for complexity > 6
Even in software engineering, success rates drop sharply with complexity
OpenHands outperformed SWE-agent; Claude models showed higher autonomy than GPT for medium-complexity tasks
Agents perform best on self-contained "Work Output" tasks but struggle with "Information Input" and "Interacting with Others"
Framework matters: same model achieves different autonomy levels with different agent frameworks
The framework-model interaction is significant — best model ≠ best agent; the agent architecture matters as much as the LLM
All existing benchmarks measure domain-specific task completion (did the code work? did the test pass?). None measure:
Cross-domain capability — How does Hermes perform on management vs. legal vs. financial tasks?
Complexity frontier — At what complexity level does Hermes start failing?
Skill-level profiling — Which O*NET work activities can Hermes handle?
Autonomy tracking over time — Is Hermes getting better at real-world work with each release?
Implementation Plan
Skill vs. Tool Classification
This is an Atropos environment (neither a skill nor a tool). It lives under environments/benchmarks/autonomy_profile/ following the pattern of TerminalBench2 and the proposed YC-Bench (#340).
AI4Work's mapped tasks — 72K tasks already graded by domain/skill. Filter for tasks that an agent with terminal/web access could realistically attempt.
O*NET task descriptions — 5,806 computer-use task descriptions from O*NET occupations. Adapt into agent-executable instructions.
Hand-crafted — For domains with poor coverage, write tasks based on actual job requirements.
Community submissions — AI4Work's Google Forms for benchmark task submission could be a model for crowdsourcing.
Pros & Cons
Pros
First systematic real-world capability measurement — No other agent framework measures capability across the full spectrum of human work
Actionable development signal — Shows exactly which domains/skills need improvement, unlike aggregate benchmark scores
Temporal tracking — Measure capability growth over releases, not just point-in-time snapshots
Research contribution — AI4Work explicitly calls for agents to submit trajectories; Hermes contributing profiles would be a research contribution
Model comparison — Same profile across different LLMs reveals which models are best for which work domains
Complements existing benchmarks — TerminalBench2 measures coding depth; this measures breadth across domains
Principled task design — Using O*NET's validated occupational taxonomy ensures tasks reflect real work, not synthetic puzzles
Cons / Risks
Task bank creation is expensive — Writing 500-1000 high-quality evaluation tasks across 10+ domains is a significant upfront investment
Evaluation subjectivity — Unlike code tests, many real-world tasks (management plans, legal analysis) don't have single correct answers. Rubric-based evaluation introduces subjectivity.
Cost per full profile — Running 500+ tasks × multiple turns each = thousands of API calls. A full profile run could cost $50-200+.
Complexity grading — Pre-grading task complexity is subjective; post-hoc workflow induction (Phase 3) is more rigorous but harder to implement
Benchmark gaming — If we optimize for this profile, we might overfit to our own task bank rather than genuinely improving capabilities
No license on AI4Work data — The ai4work-resources repository has NO LICENSE file. Using their taxonomy data requires clarifying terms.
Open Questions
Task bank scope for MVP? 50 tasks × 10 domains × 5 complexity levels = 500 tasks is ambitious. Should Phase 1 start smaller (e.g., 20 tasks × 5 domains × 3 levels = 300)?
LLM-as-judge for evaluation? For subjective domains (management, communication), should we use an LLM judge for rubric evaluation? This adds cost but enables richer assessment.
AI4Work data licensing? The repository has no license. Should we contact the authors about using their taxonomy/mapping data, or build our own from O*NET directly (which is public domain)?
Relationship to TerminalBench2? Should the autonomy profiler subsume TerminalBench2's coding tasks (tagged as Computer & Mathematical domain), or remain separate?
Community benchmarking? Should we publish Hermes's autonomy profile publicly and invite other agents to run the same profile? This could become a community benchmark.
Integration with AI4Work's submission system? They accept new benchmark tasks and agent trajectories via Google Forms. Should we submit Hermes's results back to their database?
References
AI4Work Paper — "How Well Does Agent Development Reflect Real-World Work?" (Wang et al., March 2026)
Overview
AI4Work (arXiv: 2603.01203) introduces a unified complexity scale and autonomy measurement framework for evaluating AI agents against real-world work. Rather than measuring binary pass/fail on isolated benchmarks, it measures the maximum task complexity an agent can handle reliably across different work domains and skills — what they call the agent's "autonomy frontier."
This issue proposes implementing AI4Work's autonomy profiling framework as an evaluation environment for Hermes Agent, providing a systematic way to measure and track our agent's real-world work capability across the O*NET occupational taxonomy.
Companion issue: #505 (Work-Aligned Capability Expansion — what to build based on AI4Work's gap analysis)
Related: #340 (YC-Bench — complementary long-horizon benchmark)
Research Findings
The Autonomy Framework
AI4Work defines agent autonomy formally:
Where:
k= task complexity level (number of distinct procedural steps/skills required)SR(k)= success rate at complexity level kH= target success threshold (e.g., 80%)In plain terms: the maximum task complexity an agent can handle while maintaining an acceptable success rate. This is a much richer signal than "X% on SWE-bench."
Complexity Scale Examples
Complexity is measured by workflow induction — decomposing agent trajectories into hierarchical action sequences and counting distinct procedural steps. AI4Work provides a Workflow Induction Toolkit (desktop app for macOS/Windows) and Python scripts (
profiling/segment.py,profiling/induce.py) for this analysis.Key Performance Findings from the Paper
The AI4Work Data Infrastructure
The companion repository (zorazrw/ai4work-resources) provides:
evaluate/measure_autonomy.py)coverage.py)The mapping was performed via GPT-5 with 90.9% human-LM agreement for domains and 89.3% for skills.
Current State in Hermes Agent
Existing Evaluation Infrastructure
The Gap
All existing benchmarks measure domain-specific task completion (did the code work? did the test pass?). None measure:
Implementation Plan
Skill vs. Tool Classification
This is an Atropos environment (neither a skill nor a tool). It lives under
environments/benchmarks/autonomy_profile/following the pattern of TerminalBench2 and the proposed YC-Bench (#340).Architecture
What We'd Need
taxonomy_domain.jsonandtaxonomy_skill.json(or generate our own subset)autonomy_profile_env.pyextendingHermesAgentBaseEnvPhased Rollout
Phase 1: Core Profiling Environment (MVP)
max{k | SR(k) >= H}default.yamlconfig with task subset and complexity rangePhase 2: Full Taxonomy Coverage + Tracking
Phase 3: Workflow Induction + Complexity Auto-Grading
Task Bank Design
The hardest part is creating high-quality evaluation tasks across diverse domains. Design principles:
Task Structure
{ "id": "mgmt-003-L7", "domain": "Management", "skill": ["Organizing, Planning, and Prioritizing Work", "Making Decisions and Solving Problems"], "complexity": 7, "instruction": "You have a team of 5 developers. Create a project plan for migrating a monolithic app to microservices over 3 months. Include: task decomposition, dependency graph, resource assignments, risk register, and milestone schedule. Save the plan as a markdown file.", "evaluation": { "type": "rubric", "criteria": [ {"name": "task_decomposition", "weight": 0.2, "check": "file_contains_sections"}, {"name": "dependencies", "weight": 0.2, "check": "has_dependency_graph"}, {"name": "resource_allocation", "weight": 0.2, "check": "assigns_all_developers"}, {"name": "risk_register", "weight": 0.2, "check": "identifies_risks"}, {"name": "milestones", "weight": 0.2, "check": "has_dated_milestones"} ] } }Evaluation Methods by Domain
Sourcing Tasks
Pros & Cons
Pros
Cons / Risks
Open Questions
References