Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search

## Overview

Imbue's [Darwinian Evolver](https://imbue.com/research/2026-02-27-darwinian-evolver/) demonstrates that LLM-driven evolutionary optimization can achieve **2-3x performance improvements** over base model capabilities by maintaining populations of solutions, applying targeted mutations, and selecting based on fitness. Their [ARC-AGI-2 results](https://imbue.com/research/2026-02-27-arc-agi-2-evolution/) (95.1% with Gemini 3.1 Pro) validate this approach for code and prompt optimization.

This feature proposes bringing evolutionary self-improvement **natively into Hermes Agent** — using the same evolutionary patterns (population management, fitness-weighted selection, LLM-driven mutation, post-mutation verification) to automatically improve Hermes Agent's own skills, system prompts, and tool-use patterns based on real usage data. Unlike the companion skill issue (#336) which wraps the external evolver CLI, this feature implements the evolutionary patterns directly in the Hermes codebase (MIT-licensed) and connects them to existing infrastructure (batch_runner, trajectories, RL environments).

**Research source:** [LLM-based Evolution as a Universal Optimizer](https://imbue.com/research/2026-02-27-darwinian-evolver/) and [Beating ARC-AGI-2 with Code Evolution](https://imbue.com/research/2026-02-27-arc-agi-2-evolution/)

---

## Research Findings

### The Evolutionary Pattern (What Makes It Work)

The core insight from Imbue's research is that evolution is remarkably robust — it works even when the mutator only produces improvements 20% of the time. The key components:

1. **Population + Weighted Selection** — Sigmoid-scaled fitness × novelty bonus. Dynamic midpoint (Nth percentile) keeps selection pressure meaningful as population improves. Novelty bonus prevents over-exploiting a single branch.

2. **Failure-Driven Mutation** — Mutations are targeted at specific failure cases, not random. The LLM sees concrete examples of what went wrong and proposes fixes. This is dramatically more effective than random perturbation.

3. **Learning Logs** — A history of "what was tried → what happened" provided as context to the mutator. Strategies: `ancestors` (direct lineage) or `neighborhood-N` (siblings/cousins). This prevents re-trying failed approaches.

4. **Post-Mutation Verification** — Test the mutation *only* on the failure cases it targeted before running full evaluation. Imbue reports >10x cost reduction from this filter alone.

5. **Crossover** — 25% of mutations combine logic from multiple parents, enabling recombination of independently discovered improvements.

### Why This Matters for Hermes Agent

Hermes Agent's performance is determined by three layers:
- **Model capability** — Improved via RL training (Tinker-Atropos, already exists)
- **Instructions** — Skills, system prompts, tool descriptions. Currently hand-authored and static.
- **Tool quality** — Code quality of tool implementations. Currently hand-authored.

The **instructions layer** is the sweet spot for evolutionary optimization because:
- Instructions are text/code that LLMs can meaningfully mutate
- Changes can be evaluated against real tasks via batch_runner
- The search space is large enough that manual optimization misses good solutions
- Results are immediately deployable (update the skill file → improved performance)

### Key Design Decisions from the Evolver

Decisions we should adopt:
- **Separate train/score datasets** — Prevents overfitting to specific test cases
- **Atomic population updates** — No mid-iteration visibility prevents self-referential loops
- **Pickle/JSON snapshots** — Enables pause/resume of long optimization runs
- **Concurrent evaluation** — Parallel batch evaluation for throughput

Decisions to adapt:
- **Git-based tracking** — Use git commits to track skill evolution (the evolver has `GitBasedOrganism` for exactly this)
- **Provider routing** — Use Hermes Agent's existing OpenRouter routing instead of the evolver's direct API calls
- **Evaluation signal** — Use batch_runner trajectories and compute_reward instead of custom evaluators

---

## Current State in Hermes Agent

### Existing Infrastructure (Integration Points)

| Component | File | Relevance |
|:---|:---|:---|
| **Batch Runner** | `batch_runner.py` | Runs agent on multiple prompts in parallel, saves trajectories. Natural evaluation harness. |
| **Trajectory Saving** | `agent/trajectory.py` | Records conversations in ShareGPT format. Provides raw data for scoring. |
| **RL Environments** | `environments/hermes_base_env.py` | `compute_reward()` methods already score agent rollouts. Reusable as fitness functions. |
| **Skill System** | `tools/skill_tools.py`, `skills/` | Skills are SKILL.md files — text that can be evolved. Git-tracked. |
| **System Prompts** | `agent/prompt_builder.py` | Assembles system prompts. Components could be evolved. |
| **Session Database** | `hermes_state.py` | FTS5 search over past conversations. Source of real usage patterns. |

### What's Missing

- No population management for skill/prompt variants
- No fitness scoring of skills (which skill version produces better outcomes?)
- No mutation loop (propose skill changes → test → select)
- No mechanism to A/B test skill variants

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **core codebase feature** (not a skill, not a tool) because:
- It requires deep integration with batch_runner, trajectory saving, and the skill system
- It needs persistent state management (populations, evaluation history, snapshots)
- It involves automated code changes to skill files that must be carefully controlled
- It is part of the agent's self-improvement loop, not a user-facing capability

### What We'd Need

1. **Population manager** — Track multiple variants of a skill/prompt, their scores, and lineage
2. **Fitness evaluator** — Run skill variants through batch_runner on test tasks, score results
3. **Skill mutator** — LLM-powered mutation of SKILL.md files based on failure analysis
4. **Evolution loop** — Orchestrate selection → mutation → evaluation → integration
5. **Git integration** — Track all skill variants as branches/commits for traceability
6. **CLI commands** — `hermes evolve-skill <skill-name>` to trigger optimization

### Phased Rollout

**Phase 1: Skill-level evolution with manual evaluation**
- Implement population management for SKILL.md variants (store in `~/.hermes/evolution/`)
- LLM-based mutation of skills given failure case descriptions
- Manual fitness scoring (user provides "this version worked better")
- Git-tracked lineage of skill versions
- CLI: `hermes evolve-skill <skill-name> --failure "it did X wrong"`
- Deliverable: User can iteratively improve a skill through guided evolution

**Phase 2: Automated evaluation via batch_runner**
- Connect evolution loop to batch_runner for automated fitness scoring
- Define evaluation datasets per skill (set of tasks + expected outcomes)
- Implement post-mutation verification (quick check before full eval)
- Learning log integration (track what mutations worked/failed)
- CLI: `hermes evolve-skill <skill-name> --dataset eval_tasks.jsonl --iterations 10`
- Deliverable: Fully automated skill optimization pipeline

**Phase 3: Continuous improvement and prompt evolution**
- Extend beyond skills to system prompt components
- Use real session data (from hermes_state.py) as evaluation signal
- Implement crossover mutations between skill variants
- A/B testing framework for production skill variants
- Integration with Tinker-Atropos reward signals
- Deliverable: Self-improving agent system that gets better with use

---

## Pros & Cons

### Pros
- **High leverage** — Improving instructions is cheaper and faster than retraining models
- **Proven approach** — Imbue's results (2-3x improvement) validate the pattern
- **Builds on existing infra** — batch_runner, trajectories, RL environments already exist
- **Compounding returns** — Better skills → better agent → better trajectory data → better evolution
- **Transparent** — All evolution is git-tracked and human-readable (it's just text changes)
- **No license issues** — Native implementation under MIT (doesn't depend on AGPL evolver)

### Cons / Risks
- **Evaluation is hard** — Defining "good" for complex agent tasks is subjective. Bad evaluation → bad evolution. This is the critical challenge.
- **Cost** — Each evaluation run consumes API credits. 10 iterations × 5 parents × 50 eval tasks = 2,500 API calls per skill optimization.
- **Complexity** — Adds a meta-optimization layer to the codebase. Must be optional and well-contained.
- **Regression risk** — An evolved skill could perform better on the eval set but worse on real tasks (overfitting). Needs holdout sets and manual approval.
- **Scope creep** — "Self-improving agent" is an open-ended research problem. Phase 1 must be narrowly scoped.

---

## Open Questions

- What evaluation datasets should we ship with for initial skill optimization? (e.g., test tasks for the arxiv skill, github-code-review skill)
- Should evolved skills require human approval before deployment, or can we trust automated evaluation?
- How do we prevent overfitting — evolved skills that ace the eval set but fail on novel tasks?
- Should this integrate with Tinker-Atropos (RL for weights) or remain independent (evolution for instructions)?
- What's the right population size and iteration count for practical skill optimization? (Imbue used 2 parents/iteration, 16 iterations for ARC-AGI-2)

---

## References

- [LLM-based Evolution as a Universal Optimizer](https://imbue.com/research/2026-02-27-darwinian-evolver/) — Imbue blog post
- [Beating ARC-AGI-2 with Code Evolution](https://imbue.com/research/2026-02-27-arc-agi-2-evolution/) — Applied example
- [imbue-ai/darwinian_evolver](https://github.com/imbue-ai/darwinian_evolver/) — Reference implementation (AGPL v3 — study, don't import)
- [Darwin Gödel Machines](https://arxiv.org/abs/2505.22954) — Open-ended self-improvement theory
- [AlphaEvolve](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/) — DeepMind's evolutionary algorithm discovery
- [Promptbreeder](https://arxiv.org/abs/2309.16797) — Self-referential prompt evolution
- Hermes Agent #336 — Companion issue for the external evolver skill integration

Component	File	Relevance
Batch Runner	`batch_runner.py`	Runs agent on multiple prompts in parallel, saves trajectories. Natural evaluation harness.
Trajectory Saving	`agent/trajectory.py`	Records conversations in ShareGPT format. Provides raw data for scoring.
RL Environments	`environments/hermes_base_env.py`	`compute_reward()` methods already score agent rollouts. Reusable as fitness functions.
Skill System	`tools/skill_tools.py`, `skills/`	Skills are SKILL.md files — text that can be evolved. Git-tracked.
System Prompts	`agent/prompt_builder.py`	Assembles system prompts. Components could be evolved.
Session Database	`hermes_state.py`	FTS5 search over past conversations. Source of real usage patterns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search #337

Overview

Research Findings

The Evolutionary Pattern (What Makes It Work)

Why This Matters for Hermes Agent

Key Design Decisions from the Evolver

Current State in Hermes Agent

Existing Infrastructure (Integration Points)

What's Missing

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search #337

Description

Overview

Research Findings

The Evolutionary Pattern (What Makes It Work)

Why This Matters for Hermes Agent

Key Design Decisions from the Evolver

Current State in Hermes Agent

Existing Infrastructure (Integration Points)

What's Missing

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions