CoEvoSkills: Self-Evolving Agent Skills

01 / Abstract

The Skill Gap in Agentic LLMs

Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is label-intensive and suffers from human–machine cognitive misalignment, which degrades agent performance.

We propose CoEvoSkills, a self-evolving framework that enables agents to autonomously construct complex, multi-file skill packages. It couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, CoEvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization to six additional LLMs.

baselines beaten

agent backends

transfer LLMs

domains

02 / Motivation

What is a Skill?

A tool is a single function. A skill is an orchestrated package of instructions, scripts, and references for long-horizon professional tasks.

Tool vs. Skill — **Figure 1.** A *tool* is a single self-contained function; a *skill* is a structured, multi-file package with instructions, scripts, and assets.

03 / Method

The Co-Evolution Loop

Two components evolve together through iterative generate – verify – refine cycles.

🛠️

Skill Generator

Iteratively produces and refines structured multi-file skill bundles from the verifier's dense diagnostic feedback.

🧪

Surrogate Verifier

Information-isolated; independently evolves test assertions to provide actionable failure signals without ground-truth leakage.

🔒

Opaque Oracle

Returns only a pass/fail signal, triggering test escalation while preserving strict information isolation.

CoEvoSkills Framework — **Figure 2.** Overview of the CoEvoSkills co-evolutionary framework. The Skill Generator and Surrogate Verifier co-evolve via iterative refinement; a ground-truth oracle returns only an opaque pass/fail signal, triggering test escalation and preserving information isolation.

04 / Highlights

Why It Matters

🧩

First of its kind

First framework to produce structured, executable, multi-file skill packages via self-evolution.

🚫

No GT supervision

Dense diagnostic feedback without test-content leakage during co-evolution.

🏆

SOTA on SkillsBench

Highest pass rate among five baselines on Claude Code and Codex.

🌐

Cross-model transfer

Generated skills transfer effectively to six additional LLMs without retraining.

05 / Results

Experiments on SkillsBench

Main Results

Cross-Model Transferability

Per-Domain Breakdown

Evolution Trajectory

06 / Citation

Cite This Work

If you find CoEvoSkills useful, please consider citing:

@article{zhang2026coevoskills,
  title   = {CoEvoSkills: Self-Evolving Agent Skills via
             Co-Evolutionary Verification},
  author  = {Zhang, Hanrong and Fan, Shicheng and Zou, Henry Peng and
             Chen, Yankai and Wang, Zhenting and Zhou, Jiayu and
             Li, Chengze and Huang, Wei-Chieh and Yao, Yifei and
             Zheng, Kening and Liu, Xue and Li, Xiaoxiao and
             Yu, Philip S.},
  journal = {arXiv preprint arXiv:2604.01687},
  year    = {2026}
}