Feature: Work-Aligned Capability Expansion — Targeting Underrepresented High-Value Domains (inspired by AI4Work)

## Overview

[AI4Work](https://zorazrw.github.io/ai4work/) (arXiv: [2603.01203](https://arxiv.org/abs/2603.01203)) by Zora Wang et al. (CMU/Stanford, March 2026) maps 72,342 tasks from 43 agent benchmarks to 1,016 real-world U.S. occupations using O\*NET domain and skill taxonomies. The paper reveals a massive misalignment between what AI agents are benchmarked on and where real human work (and economic value) actually lies.

**The core finding:** Agent development is driven by "methodological convenience" — domains with easily specified NL instructions and easily verifiable rewards (programming, math) are disproportionately developed, while 92.4% of U.S. employment lies in domains with minimal agent capability coverage. Many of these underrepresented domains are **highly digital** (70-88% digital work ratio) and **highly paid** (avg $116-121K), meaning they are both automatable and economically valuable.

This issue proposes using AI4Work's findings to systematically prioritize Hermes Agent's skill development toward high-value, underrepresented work domains — expanding beyond the programming-centric capability set we have today.

---

## Research Findings

### The Benchmark-to-Reality Gap

AI4Work uses two parallel taxonomies from O\*NET:
- **23 job families** (domain taxonomy) mapped to BLS employment/wage data
- **41 fine-grained skills** (skill taxonomy) organized into 4 categories: Information Input, Mental Processes, Interacting With Others, Work Output

Key statistics:
- Computer & Mathematical: 8,622 benchmark tasks, but only 7.6% of U.S. employment
- Management: 88% digital work ratio, avg wage $120,935 — only 676 benchmark tasks (1.4%)
- Legal: 71% digital, avg wage $116,645 — only 71 benchmark tasks (0.3%)
- Business & Financial Operations: 83% digital — minimal benchmark coverage
- Office & Administrative Support: largest employment category — sparse coverage
- Architecture & Engineering: 71% digital — only 0.7% of benchmark examples

### Skill-Level Gaps

Benchmarks overfocus on "Getting Information" and "Working with Computers" — skills that account for <7% of human labor. Meanwhile:
- **"Interacting with Others"** skills (Communicating, Coordinating, Negotiating, Coaching, Resolving Conflicts) are pervasive in the labor market but essentially absent from agent benchmarks
- **"Mental Processes"** like Scheduling, Organizing, Making Decisions, and Thinking Creatively are underrepresented despite being core to management and operations work
- Even within tested domains, complex multi-skill tasks are rarely evaluated

### The Three Principles

AI4Work proposes three principles for better agent development:
1. **Domain & Skill Coverage**: Target underrepresented, high-capital domains
2. **Realism & Complexity**: Use human-annotated tasks spanning multiple domains/skills, not synthesized templates
3. **Granular Evaluation**: Move beyond binary pass/fail to intermediate checkpoints and workflow-level assessment

---

## Current State in Hermes Agent

### What We Cover Well (Computer & Mathematical domain)
- **GitHub skills**: code review, PR workflow, issues, repo management, codebase inspection
- **MLOps skills**: axolotl, unsloth, vllm, evaluating-llms-harness, weights-and-biases, modal, lambda-labs, etc. (19 skills)
- **Autonomous coding**: claude-code, codex, hermes-agent-spawning
- **Benchmarks**: TerminalBench2, TBLite, SWE environments

### What We Partially Cover
- **Office & Admin**: google-workspace, himalaya (email), notion, obsidian, powerpoint, nano-pdf
- **Arts/Design/Media**: excalidraw, heartmula, songsee, youtube-content, gif-search

### What We Don't Cover (High-Value Gaps)

| Domain | Digital % | Avg Wage | Hermes Skills | Gap Severity |
|--------|-----------|----------|---------------|-------------|
| **Management** | 88% | $120,935 | 0 | 🔴 Critical |
| **Legal** | 71% | $116,645 | 0 | 🔴 Critical |
| **Business & Financial Ops** | 83% | — | 0 | 🔴 Critical |
| **Architecture & Engineering** | 71% | — | 0 (only tangential via diagrams) | 🟡 High |
| **Sales & Related** | 22% | — | 0 | 🟡 Medium |
| **Life/Physical/Social Science** | 52% | — | 0 (research skills exist but not domain-specific) | 🟡 Medium |

### Skill-Level Gaps

| Skill Category | Status |
|---------------|--------|
| **Coordinating Work** | ❌ No workflow/project management skills |
| **Communicating with Others** | ❌ No structured communication skills (drafting, summarizing for audiences) |
| **Making Decisions & Solving Problems** | ⚠️ Implicit in agent behavior, no structured frameworks |
| **Scheduling Work** | ❌ Only cronjobs, no work/resource scheduling |
| **Organizing/Planning Work** | ❌ Only todo list, no project planning skills |
| **Resolving Conflicts & Negotiating** | ❌ None |
| **Analyzing Data/Information** | ⚠️ No dedicated data analysis skill (pandas, SQL, visualization) |

---

## Implementation Plan

### Skill vs. Tool Classification

These are all **skills** (Skills Hub, not bundled) because:
- Each wraps existing tools (terminal, web_extract, execute_code) with domain-specific instructions
- No custom Python integration or API key management needed in the harness
- They're specialized by domain — not every user needs a legal research skill
- They follow the pattern of existing skills (axolotl, google-workspace, etc.)

### Priority 1: Management & Project Coordination

**Rationale:** 88% digital, highest economic value, largest gap.

Potential skills:
- **Project Management Skill** — Task decomposition, milestone tracking, resource allocation using tools like `todo`, file-based project plans, and git-based tracking
- **Meeting/Communication Skill** — Agenda creation, meeting notes summarization, follow-up tracking, stakeholder communication drafts
- **Data Analysis & Reporting Skill** — pandas/SQL data analysis, chart generation, business report creation (distinct from ML — focused on business intelligence)

### Priority 2: Legal & Compliance

**Rationale:** 71% digital, second-highest wages, near-zero benchmark coverage.

Potential skills:
- **Legal Research Skill** — Case law search, statute lookup, contract analysis using public legal databases (CourtListener, law.cornell.edu, SEC EDGAR)
- **Contract Review Skill** — Clause identification, risk flagging, term extraction from legal documents
- **Compliance Checklist Skill** — Regulatory requirement tracking for common frameworks (GDPR, SOC2, HIPAA)

### Priority 3: Business & Financial Operations

**Rationale:** 83% digital, large employment base.

Potential skills:
- **Financial Analysis Skill** — Spreadsheet analysis, ratio calculations, budget forecasting using execute_code + pandas
- **Invoice & Expense Skill** — Receipt parsing, expense categorization, invoice generation (extends google-workspace)

### Priority 4: Interpersonal Skills

**Rationale:** AI4Work's biggest skill-level finding — "Interacting with Others" is pervasive in work but absent from benchmarks.

These are harder to implement as discrete skills but could manifest as:
- Enhanced delegation prompts that model coordination patterns (#375 Inception Prompting)
- Multi-agent debate/negotiation modes (#376 Adversarial Debate)
- Communication style adaptation (formal/informal, technical/non-technical audience)

### Phased Rollout

**Phase 1: Low-Hanging Fruit (Management + Data Analysis)**
- Project Management skill (task planning, milestone tracking)
- Data Analysis & Reporting skill (pandas, visualization, business reporting)
- These use existing tools and require no new infrastructure

**Phase 2: Domain Expertise Skills (Legal + Financial)**
- Legal Research skill (public database integration)
- Financial Analysis skill (spreadsheet + calculation workflows)
- These require domain-specific knowledge but are still terminal/web-based

**Phase 3: Interpersonal & Complex Skills**
- Communication drafting with audience adaptation
- Coordination workflows for multi-stakeholder projects
- These require more sophisticated prompting and potentially multi-agent patterns

---

## Pros & Cons

### Pros
- **Data-driven prioritization** — Using rigorous research (72k tasks, 43 benchmarks, O\*NET taxonomy) to guide development, not gut feelings
- **Massive untapped market** — 92.4% of employment is outside our current focus area; even partial coverage of high-value domains expands Hermes's utility dramatically
- **Competitive differentiation** — Most agent frameworks focus exclusively on coding/engineering; Hermes covering management, legal, and financial domains would be genuinely unique
- **Incremental effort** — Each domain skill is independent; we can ship one at a time
- **Aligned with AI4Work's call to action** — The paper explicitly calls for agents that cover broader work domains; building these skills helps answer that call

### Cons / Risks
- **Domain expertise required** — Building a useful legal research skill requires understanding legal databases, terminology, and workflows. Getting it wrong could be worse than not having it.
- **Validation difficulty** — How do we know a management skill is actually good? Unlike code (which has tests), management/legal/business tasks have subjective quality measures.
- **Scope creep risk** — "Cover all work domains" is an infinite project. Must be disciplined about which skills actually deliver value.
- **Not our core audience (yet)** — Current Hermes users are primarily developers. Management/legal/financial skills target a different user base.
- **Quality vs. quantity** — Better to have 3 excellent domain skills than 10 mediocre ones

---

## Open Questions

1. **Which domain first?** Management has the highest economic value, but Data Analysis might be easier to build well and validate. What's the right starting point?
2. **How to validate domain skills?** We need subject matter experts to evaluate whether a legal research skill or project management skill actually produces useful outputs.
3. **Bundled vs. Hub?** Are any of these domains universal enough to bundle? Data Analysis might be — most users benefit from pandas/SQL workflows.
4. **Should we build corresponding benchmarks?** AI4Work's gap analysis suggests we should not only build skills but also create evaluation tasks for underrepresented domains. See companion issue for autonomy profiling.
5. **Relationship to existing issues?** #340 (YC-Bench) already targets the management/business domain via a benchmark. Should skills and benchmarks be developed together?

---

## References

- [AI4Work Paper](https://arxiv.org/abs/2603.01203) — "How Well Does Agent Development Reflect Real-World Work?" (Wang et al., March 2026)
- [AI4Work Resources](https://github.com/zorazrw/ai4work-resources) — Companion data repository (43 benchmarks, 72K mapped tasks, O\*NET taxonomy)
- [AI4Work Explorer](https://zorazrw.github.io/ai4work/) — Interactive benchmark-to-occupation mapping database
- [O\*NET Online](https://www.onetonline.org/) — U.S. occupational taxonomy
- [BLS Employment Data](https://www.bls.gov/oes/) — Bureau of Labor Statistics wage/employment data
- #340 — YC-Bench (business strategy benchmark, related domain)
- #344 — Multi-Agent Architecture (relevant for coordination/interpersonal skills)
- #375 — Inception Prompting (relevant for delegation/communication quality)
- #376 — Adversarial Debate (relevant for negotiation/conflict resolution skills)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Work-Aligned Capability Expansion — Targeting Underrepresented High-Value Domains (inspired by AI4Work) #505

Overview

Research Findings

The Benchmark-to-Reality Gap

Skill-Level Gaps

The Three Principles

Current State in Hermes Agent

What We Cover Well (Computer & Mathematical domain)

What We Partially Cover

What We Don't Cover (High-Value Gaps)

Skill-Level Gaps

Implementation Plan

Skill vs. Tool Classification

Priority 1: Management & Project Coordination

Priority 2: Legal & Compliance

Priority 3: Business & Financial Operations

Priority 4: Interpersonal Skills

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Domain	Digital %	Avg Wage	Hermes Skills	Gap Severity
Management	88%	$120,935	0	🔴 Critical
Legal	71%	$116,645	0	🔴 Critical
Business & Financial Ops	83%	—	0	🔴 Critical
Architecture & Engineering	71%	—	0 (only tangential via diagrams)	🟡 High
Sales & Related	22%	—	0	🟡 Medium
Life/Physical/Social Science	52%	—	0 (research skills exist but not domain-specific)	🟡 Medium

Skill Category	Status
Coordinating Work	❌ No workflow/project management skills
Communicating with Others	❌ No structured communication skills (drafting, summarizing for audiences)
Making Decisions & Solving Problems	⚠️ Implicit in agent behavior, no structured frameworks
Scheduling Work	❌ Only cronjobs, no work/resource scheduling
Organizing/Planning Work	❌ Only todo list, no project planning skills
Resolving Conflicts & Negotiating	❌ None
Analyzing Data/Information	⚠️ No dedicated data analysis skill (pandas, SQL, visualization)

Feature: Work-Aligned Capability Expansion — Targeting Underrepresented High-Value Domains (inspired by AI4Work) #505

Description

Overview

Research Findings

The Benchmark-to-Reality Gap

Skill-Level Gaps

The Three Principles

Current State in Hermes Agent

What We Cover Well (Computer & Mathematical domain)

What We Partially Cover

What We Don't Cover (High-Value Gaps)

Skill-Level Gaps

Implementation Plan

Skill vs. Tool Classification

Priority 1: Management & Project Coordination

Priority 2: Legal & Compliance

Priority 3: Business & Financial Operations

Priority 4: Interpersonal Skills

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions