Skip to content

Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search #337

@teknium1

Description

@teknium1

Overview

Imbue's Darwinian Evolver demonstrates that LLM-driven evolutionary optimization can achieve 2-3x performance improvements over base model capabilities by maintaining populations of solutions, applying targeted mutations, and selecting based on fitness. Their ARC-AGI-2 results (95.1% with Gemini 3.1 Pro) validate this approach for code and prompt optimization.

This feature proposes bringing evolutionary self-improvement natively into Hermes Agent — using the same evolutionary patterns (population management, fitness-weighted selection, LLM-driven mutation, post-mutation verification) to automatically improve Hermes Agent's own skills, system prompts, and tool-use patterns based on real usage data. Unlike the companion skill issue (#336) which wraps the external evolver CLI, this feature implements the evolutionary patterns directly in the Hermes codebase (MIT-licensed) and connects them to existing infrastructure (batch_runner, trajectories, RL environments).

Research source: LLM-based Evolution as a Universal Optimizer and Beating ARC-AGI-2 with Code Evolution


Research Findings

The Evolutionary Pattern (What Makes It Work)

The core insight from Imbue's research is that evolution is remarkably robust — it works even when the mutator only produces improvements 20% of the time. The key components:

  1. Population + Weighted Selection — Sigmoid-scaled fitness × novelty bonus. Dynamic midpoint (Nth percentile) keeps selection pressure meaningful as population improves. Novelty bonus prevents over-exploiting a single branch.

  2. Failure-Driven Mutation — Mutations are targeted at specific failure cases, not random. The LLM sees concrete examples of what went wrong and proposes fixes. This is dramatically more effective than random perturbation.

  3. Learning Logs — A history of "what was tried → what happened" provided as context to the mutator. Strategies: ancestors (direct lineage) or neighborhood-N (siblings/cousins). This prevents re-trying failed approaches.

  4. Post-Mutation Verification — Test the mutation only on the failure cases it targeted before running full evaluation. Imbue reports >10x cost reduction from this filter alone.

  5. Crossover — 25% of mutations combine logic from multiple parents, enabling recombination of independently discovered improvements.

Why This Matters for Hermes Agent

Hermes Agent's performance is determined by three layers:

  • Model capability — Improved via RL training (Tinker-Atropos, already exists)
  • Instructions — Skills, system prompts, tool descriptions. Currently hand-authored and static.
  • Tool quality — Code quality of tool implementations. Currently hand-authored.

The instructions layer is the sweet spot for evolutionary optimization because:

  • Instructions are text/code that LLMs can meaningfully mutate
  • Changes can be evaluated against real tasks via batch_runner
  • The search space is large enough that manual optimization misses good solutions
  • Results are immediately deployable (update the skill file → improved performance)

Key Design Decisions from the Evolver

Decisions we should adopt:

  • Separate train/score datasets — Prevents overfitting to specific test cases
  • Atomic population updates — No mid-iteration visibility prevents self-referential loops
  • Pickle/JSON snapshots — Enables pause/resume of long optimization runs
  • Concurrent evaluation — Parallel batch evaluation for throughput

Decisions to adapt:

  • Git-based tracking — Use git commits to track skill evolution (the evolver has GitBasedOrganism for exactly this)
  • Provider routing — Use Hermes Agent's existing OpenRouter routing instead of the evolver's direct API calls
  • Evaluation signal — Use batch_runner trajectories and compute_reward instead of custom evaluators

Current State in Hermes Agent

Existing Infrastructure (Integration Points)

Component File Relevance
Batch Runner batch_runner.py Runs agent on multiple prompts in parallel, saves trajectories. Natural evaluation harness.
Trajectory Saving agent/trajectory.py Records conversations in ShareGPT format. Provides raw data for scoring.
RL Environments environments/hermes_base_env.py compute_reward() methods already score agent rollouts. Reusable as fitness functions.
Skill System tools/skill_tools.py, skills/ Skills are SKILL.md files — text that can be evolved. Git-tracked.
System Prompts agent/prompt_builder.py Assembles system prompts. Components could be evolved.
Session Database hermes_state.py FTS5 search over past conversations. Source of real usage patterns.

What's Missing

  • No population management for skill/prompt variants
  • No fitness scoring of skills (which skill version produces better outcomes?)
  • No mutation loop (propose skill changes → test → select)
  • No mechanism to A/B test skill variants

Implementation Plan

Skill vs. Tool Classification

This should be a core codebase feature (not a skill, not a tool) because:

  • It requires deep integration with batch_runner, trajectory saving, and the skill system
  • It needs persistent state management (populations, evaluation history, snapshots)
  • It involves automated code changes to skill files that must be carefully controlled
  • It is part of the agent's self-improvement loop, not a user-facing capability

What We'd Need

  1. Population manager — Track multiple variants of a skill/prompt, their scores, and lineage
  2. Fitness evaluator — Run skill variants through batch_runner on test tasks, score results
  3. Skill mutator — LLM-powered mutation of SKILL.md files based on failure analysis
  4. Evolution loop — Orchestrate selection → mutation → evaluation → integration
  5. Git integration — Track all skill variants as branches/commits for traceability
  6. CLI commandshermes evolve-skill <skill-name> to trigger optimization

Phased Rollout

Phase 1: Skill-level evolution with manual evaluation

  • Implement population management for SKILL.md variants (store in ~/.hermes/evolution/)
  • LLM-based mutation of skills given failure case descriptions
  • Manual fitness scoring (user provides "this version worked better")
  • Git-tracked lineage of skill versions
  • CLI: hermes evolve-skill <skill-name> --failure "it did X wrong"
  • Deliverable: User can iteratively improve a skill through guided evolution

Phase 2: Automated evaluation via batch_runner

  • Connect evolution loop to batch_runner for automated fitness scoring
  • Define evaluation datasets per skill (set of tasks + expected outcomes)
  • Implement post-mutation verification (quick check before full eval)
  • Learning log integration (track what mutations worked/failed)
  • CLI: hermes evolve-skill <skill-name> --dataset eval_tasks.jsonl --iterations 10
  • Deliverable: Fully automated skill optimization pipeline

Phase 3: Continuous improvement and prompt evolution

  • Extend beyond skills to system prompt components
  • Use real session data (from hermes_state.py) as evaluation signal
  • Implement crossover mutations between skill variants
  • A/B testing framework for production skill variants
  • Integration with Tinker-Atropos reward signals
  • Deliverable: Self-improving agent system that gets better with use

Pros & Cons

Pros

  • High leverage — Improving instructions is cheaper and faster than retraining models
  • Proven approach — Imbue's results (2-3x improvement) validate the pattern
  • Builds on existing infra — batch_runner, trajectories, RL environments already exist
  • Compounding returns — Better skills → better agent → better trajectory data → better evolution
  • Transparent — All evolution is git-tracked and human-readable (it's just text changes)
  • No license issues — Native implementation under MIT (doesn't depend on AGPL evolver)

Cons / Risks

  • Evaluation is hard — Defining "good" for complex agent tasks is subjective. Bad evaluation → bad evolution. This is the critical challenge.
  • Cost — Each evaluation run consumes API credits. 10 iterations × 5 parents × 50 eval tasks = 2,500 API calls per skill optimization.
  • Complexity — Adds a meta-optimization layer to the codebase. Must be optional and well-contained.
  • Regression risk — An evolved skill could perform better on the eval set but worse on real tasks (overfitting). Needs holdout sets and manual approval.
  • Scope creep — "Self-improving agent" is an open-ended research problem. Phase 1 must be narrowly scoped.

Open Questions

  • What evaluation datasets should we ship with for initial skill optimization? (e.g., test tasks for the arxiv skill, github-code-review skill)
  • Should evolved skills require human approval before deployment, or can we trust automated evaluation?
  • How do we prevent overfitting — evolved skills that ace the eval set but fail on novel tasks?
  • Should this integrate with Tinker-Atropos (RL for weights) or remain independent (evolution for instructions)?
  • What's the right population size and iteration count for practical skill optimization? (Imbue used 2 parents/iteration, 16 iterations for ARC-AGI-2)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions