You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI4Work (arXiv: 2603.01203) by Zora Wang et al. (CMU/Stanford, March 2026) maps 72,342 tasks from 43 agent benchmarks to 1,016 real-world U.S. occupations using O*NET domain and skill taxonomies. The paper reveals a massive misalignment between what AI agents are benchmarked on and where real human work (and economic value) actually lies.
The core finding: Agent development is driven by "methodological convenience" — domains with easily specified NL instructions and easily verifiable rewards (programming, math) are disproportionately developed, while 92.4% of U.S. employment lies in domains with minimal agent capability coverage. Many of these underrepresented domains are highly digital (70-88% digital work ratio) and highly paid (avg $116-121K), meaning they are both automatable and economically valuable.
This issue proposes using AI4Work's findings to systematically prioritize Hermes Agent's skill development toward high-value, underrepresented work domains — expanding beyond the programming-centric capability set we have today.
Research Findings
The Benchmark-to-Reality Gap
AI4Work uses two parallel taxonomies from O*NET:
23 job families (domain taxonomy) mapped to BLS employment/wage data
41 fine-grained skills (skill taxonomy) organized into 4 categories: Information Input, Mental Processes, Interacting With Others, Work Output
Key statistics:
Computer & Mathematical: 8,622 benchmark tasks, but only 7.6% of U.S. employment
Management: 88% digital work ratio, avg wage $120,935 — only 676 benchmark tasks (1.4%)
Business & Financial Operations: 83% digital — minimal benchmark coverage
Office & Administrative Support: largest employment category — sparse coverage
Architecture & Engineering: 71% digital — only 0.7% of benchmark examples
Skill-Level Gaps
Benchmarks overfocus on "Getting Information" and "Working with Computers" — skills that account for <7% of human labor. Meanwhile:
"Interacting with Others" skills (Communicating, Coordinating, Negotiating, Coaching, Resolving Conflicts) are pervasive in the labor market but essentially absent from agent benchmarks
"Mental Processes" like Scheduling, Organizing, Making Decisions, and Thinking Creatively are underrepresented despite being core to management and operations work
Even within tested domains, complex multi-skill tasks are rarely evaluated
The Three Principles
AI4Work proposes three principles for better agent development:
❌ No structured communication skills (drafting, summarizing for audiences)
Making Decisions & Solving Problems
⚠️ Implicit in agent behavior, no structured frameworks
Scheduling Work
❌ Only cronjobs, no work/resource scheduling
Organizing/Planning Work
❌ Only todo list, no project planning skills
Resolving Conflicts & Negotiating
❌ None
Analyzing Data/Information
⚠️ No dedicated data analysis skill (pandas, SQL, visualization)
Implementation Plan
Skill vs. Tool Classification
These are all skills (Skills Hub, not bundled) because:
Each wraps existing tools (terminal, web_extract, execute_code) with domain-specific instructions
No custom Python integration or API key management needed in the harness
They're specialized by domain — not every user needs a legal research skill
They follow the pattern of existing skills (axolotl, google-workspace, etc.)
Priority 1: Management & Project Coordination
Rationale: 88% digital, highest economic value, largest gap.
Potential skills:
Project Management Skill — Task decomposition, milestone tracking, resource allocation using tools like todo, file-based project plans, and git-based tracking
Data Analysis & Reporting Skill — pandas/SQL data analysis, chart generation, business report creation (distinct from ML — focused on business intelligence)
These require domain-specific knowledge but are still terminal/web-based
Phase 3: Interpersonal & Complex Skills
Communication drafting with audience adaptation
Coordination workflows for multi-stakeholder projects
These require more sophisticated prompting and potentially multi-agent patterns
Pros & Cons
Pros
Data-driven prioritization — Using rigorous research (72k tasks, 43 benchmarks, O*NET taxonomy) to guide development, not gut feelings
Massive untapped market — 92.4% of employment is outside our current focus area; even partial coverage of high-value domains expands Hermes's utility dramatically
Competitive differentiation — Most agent frameworks focus exclusively on coding/engineering; Hermes covering management, legal, and financial domains would be genuinely unique
Incremental effort — Each domain skill is independent; we can ship one at a time
Aligned with AI4Work's call to action — The paper explicitly calls for agents that cover broader work domains; building these skills helps answer that call
Cons / Risks
Domain expertise required — Building a useful legal research skill requires understanding legal databases, terminology, and workflows. Getting it wrong could be worse than not having it.
Validation difficulty — How do we know a management skill is actually good? Unlike code (which has tests), management/legal/business tasks have subjective quality measures.
Scope creep risk — "Cover all work domains" is an infinite project. Must be disciplined about which skills actually deliver value.
Not our core audience (yet) — Current Hermes users are primarily developers. Management/legal/financial skills target a different user base.
Quality vs. quantity — Better to have 3 excellent domain skills than 10 mediocre ones
Open Questions
Which domain first? Management has the highest economic value, but Data Analysis might be easier to build well and validate. What's the right starting point?
How to validate domain skills? We need subject matter experts to evaluate whether a legal research skill or project management skill actually produces useful outputs.
Bundled vs. Hub? Are any of these domains universal enough to bundle? Data Analysis might be — most users benefit from pandas/SQL workflows.
Should we build corresponding benchmarks? AI4Work's gap analysis suggests we should not only build skills but also create evaluation tasks for underrepresented domains. See companion issue for autonomy profiling.
Overview
AI4Work (arXiv: 2603.01203) by Zora Wang et al. (CMU/Stanford, March 2026) maps 72,342 tasks from 43 agent benchmarks to 1,016 real-world U.S. occupations using O*NET domain and skill taxonomies. The paper reveals a massive misalignment between what AI agents are benchmarked on and where real human work (and economic value) actually lies.
The core finding: Agent development is driven by "methodological convenience" — domains with easily specified NL instructions and easily verifiable rewards (programming, math) are disproportionately developed, while 92.4% of U.S. employment lies in domains with minimal agent capability coverage. Many of these underrepresented domains are highly digital (70-88% digital work ratio) and highly paid (avg $116-121K), meaning they are both automatable and economically valuable.
This issue proposes using AI4Work's findings to systematically prioritize Hermes Agent's skill development toward high-value, underrepresented work domains — expanding beyond the programming-centric capability set we have today.
Research Findings
The Benchmark-to-Reality Gap
AI4Work uses two parallel taxonomies from O*NET:
Key statistics:
Skill-Level Gaps
Benchmarks overfocus on "Getting Information" and "Working with Computers" — skills that account for <7% of human labor. Meanwhile:
The Three Principles
AI4Work proposes three principles for better agent development:
Current State in Hermes Agent
What We Cover Well (Computer & Mathematical domain)
What We Partially Cover
What We Don't Cover (High-Value Gaps)
Skill-Level Gaps
Implementation Plan
Skill vs. Tool Classification
These are all skills (Skills Hub, not bundled) because:
Priority 1: Management & Project Coordination
Rationale: 88% digital, highest economic value, largest gap.
Potential skills:
todo, file-based project plans, and git-based trackingPriority 2: Legal & Compliance
Rationale: 71% digital, second-highest wages, near-zero benchmark coverage.
Potential skills:
Priority 3: Business & Financial Operations
Rationale: 83% digital, large employment base.
Potential skills:
Priority 4: Interpersonal Skills
Rationale: AI4Work's biggest skill-level finding — "Interacting with Others" is pervasive in work but absent from benchmarks.
These are harder to implement as discrete skills but could manifest as:
Phased Rollout
Phase 1: Low-Hanging Fruit (Management + Data Analysis)
Phase 2: Domain Expertise Skills (Legal + Financial)
Phase 3: Interpersonal & Complex Skills
Pros & Cons
Pros
Cons / Risks
Open Questions
References