Skip to content

Feature: Work-Aligned Capability Expansion — Targeting Underrepresented High-Value Domains (inspired by AI4Work) #505

@teknium1

Description

@teknium1

Overview

AI4Work (arXiv: 2603.01203) by Zora Wang et al. (CMU/Stanford, March 2026) maps 72,342 tasks from 43 agent benchmarks to 1,016 real-world U.S. occupations using O*NET domain and skill taxonomies. The paper reveals a massive misalignment between what AI agents are benchmarked on and where real human work (and economic value) actually lies.

The core finding: Agent development is driven by "methodological convenience" — domains with easily specified NL instructions and easily verifiable rewards (programming, math) are disproportionately developed, while 92.4% of U.S. employment lies in domains with minimal agent capability coverage. Many of these underrepresented domains are highly digital (70-88% digital work ratio) and highly paid (avg $116-121K), meaning they are both automatable and economically valuable.

This issue proposes using AI4Work's findings to systematically prioritize Hermes Agent's skill development toward high-value, underrepresented work domains — expanding beyond the programming-centric capability set we have today.


Research Findings

The Benchmark-to-Reality Gap

AI4Work uses two parallel taxonomies from O*NET:

  • 23 job families (domain taxonomy) mapped to BLS employment/wage data
  • 41 fine-grained skills (skill taxonomy) organized into 4 categories: Information Input, Mental Processes, Interacting With Others, Work Output

Key statistics:

  • Computer & Mathematical: 8,622 benchmark tasks, but only 7.6% of U.S. employment
  • Management: 88% digital work ratio, avg wage $120,935 — only 676 benchmark tasks (1.4%)
  • Legal: 71% digital, avg wage $116,645 — only 71 benchmark tasks (0.3%)
  • Business & Financial Operations: 83% digital — minimal benchmark coverage
  • Office & Administrative Support: largest employment category — sparse coverage
  • Architecture & Engineering: 71% digital — only 0.7% of benchmark examples

Skill-Level Gaps

Benchmarks overfocus on "Getting Information" and "Working with Computers" — skills that account for <7% of human labor. Meanwhile:

  • "Interacting with Others" skills (Communicating, Coordinating, Negotiating, Coaching, Resolving Conflicts) are pervasive in the labor market but essentially absent from agent benchmarks
  • "Mental Processes" like Scheduling, Organizing, Making Decisions, and Thinking Creatively are underrepresented despite being core to management and operations work
  • Even within tested domains, complex multi-skill tasks are rarely evaluated

The Three Principles

AI4Work proposes three principles for better agent development:

  1. Domain & Skill Coverage: Target underrepresented, high-capital domains
  2. Realism & Complexity: Use human-annotated tasks spanning multiple domains/skills, not synthesized templates
  3. Granular Evaluation: Move beyond binary pass/fail to intermediate checkpoints and workflow-level assessment

Current State in Hermes Agent

What We Cover Well (Computer & Mathematical domain)

  • GitHub skills: code review, PR workflow, issues, repo management, codebase inspection
  • MLOps skills: axolotl, unsloth, vllm, evaluating-llms-harness, weights-and-biases, modal, lambda-labs, etc. (19 skills)
  • Autonomous coding: claude-code, codex, hermes-agent-spawning
  • Benchmarks: TerminalBench2, TBLite, SWE environments

What We Partially Cover

  • Office & Admin: google-workspace, himalaya (email), notion, obsidian, powerpoint, nano-pdf
  • Arts/Design/Media: excalidraw, heartmula, songsee, youtube-content, gif-search

What We Don't Cover (High-Value Gaps)

Domain Digital % Avg Wage Hermes Skills Gap Severity
Management 88% $120,935 0 🔴 Critical
Legal 71% $116,645 0 🔴 Critical
Business & Financial Ops 83% 0 🔴 Critical
Architecture & Engineering 71% 0 (only tangential via diagrams) 🟡 High
Sales & Related 22% 0 🟡 Medium
Life/Physical/Social Science 52% 0 (research skills exist but not domain-specific) 🟡 Medium

Skill-Level Gaps

Skill Category Status
Coordinating Work ❌ No workflow/project management skills
Communicating with Others ❌ No structured communication skills (drafting, summarizing for audiences)
Making Decisions & Solving Problems ⚠️ Implicit in agent behavior, no structured frameworks
Scheduling Work ❌ Only cronjobs, no work/resource scheduling
Organizing/Planning Work ❌ Only todo list, no project planning skills
Resolving Conflicts & Negotiating ❌ None
Analyzing Data/Information ⚠️ No dedicated data analysis skill (pandas, SQL, visualization)

Implementation Plan

Skill vs. Tool Classification

These are all skills (Skills Hub, not bundled) because:

  • Each wraps existing tools (terminal, web_extract, execute_code) with domain-specific instructions
  • No custom Python integration or API key management needed in the harness
  • They're specialized by domain — not every user needs a legal research skill
  • They follow the pattern of existing skills (axolotl, google-workspace, etc.)

Priority 1: Management & Project Coordination

Rationale: 88% digital, highest economic value, largest gap.

Potential skills:

  • Project Management Skill — Task decomposition, milestone tracking, resource allocation using tools like todo, file-based project plans, and git-based tracking
  • Meeting/Communication Skill — Agenda creation, meeting notes summarization, follow-up tracking, stakeholder communication drafts
  • Data Analysis & Reporting Skill — pandas/SQL data analysis, chart generation, business report creation (distinct from ML — focused on business intelligence)

Priority 2: Legal & Compliance

Rationale: 71% digital, second-highest wages, near-zero benchmark coverage.

Potential skills:

  • Legal Research Skill — Case law search, statute lookup, contract analysis using public legal databases (CourtListener, law.cornell.edu, SEC EDGAR)
  • Contract Review Skill — Clause identification, risk flagging, term extraction from legal documents
  • Compliance Checklist Skill — Regulatory requirement tracking for common frameworks (GDPR, SOC2, HIPAA)

Priority 3: Business & Financial Operations

Rationale: 83% digital, large employment base.

Potential skills:

  • Financial Analysis Skill — Spreadsheet analysis, ratio calculations, budget forecasting using execute_code + pandas
  • Invoice & Expense Skill — Receipt parsing, expense categorization, invoice generation (extends google-workspace)

Priority 4: Interpersonal Skills

Rationale: AI4Work's biggest skill-level finding — "Interacting with Others" is pervasive in work but absent from benchmarks.

These are harder to implement as discrete skills but could manifest as:

Phased Rollout

Phase 1: Low-Hanging Fruit (Management + Data Analysis)

  • Project Management skill (task planning, milestone tracking)
  • Data Analysis & Reporting skill (pandas, visualization, business reporting)
  • These use existing tools and require no new infrastructure

Phase 2: Domain Expertise Skills (Legal + Financial)

  • Legal Research skill (public database integration)
  • Financial Analysis skill (spreadsheet + calculation workflows)
  • These require domain-specific knowledge but are still terminal/web-based

Phase 3: Interpersonal & Complex Skills

  • Communication drafting with audience adaptation
  • Coordination workflows for multi-stakeholder projects
  • These require more sophisticated prompting and potentially multi-agent patterns

Pros & Cons

Pros

  • Data-driven prioritization — Using rigorous research (72k tasks, 43 benchmarks, O*NET taxonomy) to guide development, not gut feelings
  • Massive untapped market — 92.4% of employment is outside our current focus area; even partial coverage of high-value domains expands Hermes's utility dramatically
  • Competitive differentiation — Most agent frameworks focus exclusively on coding/engineering; Hermes covering management, legal, and financial domains would be genuinely unique
  • Incremental effort — Each domain skill is independent; we can ship one at a time
  • Aligned with AI4Work's call to action — The paper explicitly calls for agents that cover broader work domains; building these skills helps answer that call

Cons / Risks

  • Domain expertise required — Building a useful legal research skill requires understanding legal databases, terminology, and workflows. Getting it wrong could be worse than not having it.
  • Validation difficulty — How do we know a management skill is actually good? Unlike code (which has tests), management/legal/business tasks have subjective quality measures.
  • Scope creep risk — "Cover all work domains" is an infinite project. Must be disciplined about which skills actually deliver value.
  • Not our core audience (yet) — Current Hermes users are primarily developers. Management/legal/financial skills target a different user base.
  • Quality vs. quantity — Better to have 3 excellent domain skills than 10 mediocre ones

Open Questions

  1. Which domain first? Management has the highest economic value, but Data Analysis might be easier to build well and validate. What's the right starting point?
  2. How to validate domain skills? We need subject matter experts to evaluate whether a legal research skill or project management skill actually produces useful outputs.
  3. Bundled vs. Hub? Are any of these domains universal enough to bundle? Data Analysis might be — most users benefit from pandas/SQL workflows.
  4. Should we build corresponding benchmarks? AI4Work's gap analysis suggests we should not only build skills but also create evaluation tasks for underrepresented domains. See companion issue for autonomy profiling.
  5. Relationship to existing issues? Feature: YC-Bench long-horizon agent benchmark environment #340 (YC-Bench) already targets the management/business domain via a benchmark. Should skills and benchmarks be developed together?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions