feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code#12633
feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code#12633Bihruze wants to merge 3 commits into
Conversation
…egex/SQL/code
Adds optional-skills/research/darwinian-evolver, a new skill that evolves
text artifacts toward a user-supplied fitness function via LLM-driven
mutation and crossover.
Scope of this commit (Day 1+2 of plan — core library + CLI + tests).
Day 3+4 will add Tier 2 subprocess adapters (openevolve, Imbue's
darwinian-evolver CLI), Tier 3 DSPy bridge polish, code-sandbox, and
the end-to-end example under examples/summarize_10_words/.
Core library
------------
* storage.py — SQLite lineage graph; content-addressed genome IDs
(blake2b truncated), WAL journaling, idempotent
insert, ancestry BFS, budget ledger, lineage_hash for
replay determinism checks.
* llm.py — async OpenAI-compat client with seed propagation,
bounded Semaphore concurrency, exponential-backoff
retry respecting Retry-After, BudgetLedger with hard
cap that raises BudgetExceeded mid-run.
* algorithms.py — pure-function primitives:
- tournament + linear-rank selection
- (μ+λ)-ES survival (Bäck & Schwefel 1993)
- MAP-Elites archive with configurable bin grid
and behavioral descriptors (Mouret & Clune 2015)
- NSGA-II: fast non-dominated sort + crowding
distance (Deb et al. 2002)
- Exp3 bandit over operators (Auer et al. 2002)
- default 2-D prompt descriptor: length × CoT-presence
* operators.py — LLM mutation/crossover:
paraphrase, structural_edit, cot_inject,
novelty_seeking, critique_then_edit (GEPA-lite),
meta_mutate_operator_prompt (PromptBreeder-style),
semantic_crossover (Meyerson 2023), segment_splice.
Each returns a stable prompt_hash for lineage replay.
* evaluator.py — fitness_spec decorator, dynamic fitness.py import,
async batch eval with wall-clock timeout, held-out
reward-hacking guard (top-K re-score, gap-based
penalty), successive halving (Hyperband-lite).
* evolver.py — CLI entry: init/run/status/best/lineage/budget/
export/replay subcommands; JSON stdout, NDJSON
progress streaming from run; supports three
algorithms (es, map-elites, nsga2).
SKILL.md
--------
Full authoring spec per Hermes conventions, including a Theoretical
Foundations section citing the primary literature (Bäck & Schwefel,
Mouret & Clune, Deb, Meyerson, Fernando, Agrawal, Lehman & Stanley,
Auer, Li et al.), a tier table explaining the MIT/AGPL/bridge split,
and explicit scope guardrails.
Tests
-----
tests/skills/test_darwinian_evolver.py — 26 cases, 100% green:
* storage: content-address determinism, idempotent insert, ancestry
reconstruction with parent edges, held-out-preferring get_best,
budget totals, lineage_hash stability + change detection.
* algorithms: tournament bias (stat check), rank-select pressure
validation, (μ+λ) elitism, MAP-Elites place/coverage/empty-sample,
NSGA-II front identification on 2-objective pop, strict dominance,
crowding-distance boundary infinity, Exp3 reward clamping + weight
update.
* evaluator: fitness_spec metadata round-trip, scalar + dict fitness
via async batch, timeout → worst-score degradation, held-out guard
penalises a synthetic overfitter by >0.5, successive_halving
narrows 8 → 2 survivors at the expected fidelity schedule.
References issues
-----------------
* closes NousResearch#336 (scoped v1)
* partial for NousResearch#337 Phase 3
* bridges to NousResearch/hermes-agent-self-evolution via Tier 3
DSPy-jsonl export
Completes the skill scaffolded in the previous commit (Day 3-5 of the
staged plan).
Added
-----
* scripts/sandbox.py — subprocess sandbox for code candidates:
- POSIX rlimit caps (CPU, AS, DATA, CORE) applied as best-effort in
a preexec_fn; each limit failure is swallowed independently so
macOS's RLIMIT_AS-incompatibility doesn't break the others.
- Wall-clock timeout via subprocess.run(timeout=...) as a backstop.
- run_candidate_code() and run_pytest_suite() helpers; the latter
parses the pytest terse summary into a pass fraction.
* scripts/adapters.py — Tier 2 + Tier 3 wrappers:
- ExternalEvolverAdapter: lazy shutil.which detection, raises
AdapterUnavailable with install hint instead of crashing.
- openevolve_adapter (Apache 2.0, default Tier 2 recommendation).
- darwinian_evolver_adapter (Imbue, AGPL v3) — subprocess only,
never imported, so license-viral code never enters the Hermes
process.
- export_dspy_jsonl: DSPy-compatible offline records with full
lineage; default keeps one winner per generation, --all emits every
candidate.
- export_gepa_trace: reflective-operator edges (critique_then_edit,
meta_mutator) in the shape GEPA's trainer expects.
* templates/*.py — five copy-paste-ready fitness templates for prompt,
regex, SQL, code (uses sandbox), and multi-objective (NSGA-II) runs.
* demos/summarize_10_words/ — end-to-end packaged demo:
- fitness.py: deterministic scoring (word count proximity, brevity
keyword, char budget), so the demo runs cheaply without an LLM
judge and the improvement curve is visible on a small local model.
- seed/initial.txt, README.md with exact commands and expected
trajectory.
Changed
-------
* scripts/evolver.py: cmd_export now delegates to adapters.export_*;
added --all flag for dspy-jsonl exports; imports adapters module.
Tests (tests/skills/test_darwinian_evolver.py)
----------------------------------------------
Expanded from 26 to 39 cases — all green:
* TestSandbox — simple candidate runs, syntax error fails cleanly,
runaway while-True loop killed within wall-clock+overhead, pytest
terse-summary parser.
* TestAdapterGracefulAbsence — openevolve + darwinian-evolver adapters
raise AdapterUnavailable with license-informed install hint when the
binary is missing (monkeypatched shutil.which).
* TestDspyBridge — default export keeps one record per generation;
--all mode emits every candidate; GEPA export filters to reflective
operators and preserves parent/child metadata.
* TestLLMClient — seed propagation verified by intercepting the
AsyncClient.post body; BudgetLedger records spend across calls and
raises BudgetExceeded when cap is crossed.
* TestMapElitesCoverage — random 200-sample run fills ≥60 % of a 4×4
descriptor grid (acceptance checklist NousResearch#4).
* TestEndToEnd — single generation with a fully-mocked LLM produces an
offspring strictly better than the seed and yields a stable lineage
hash (acceptance checklist NousResearch#7, plus determinism smoke).
Acceptance checklist status: 10/10 covered.
Repo-level notes
----------------
The original examples/ subdirectory is renamed demos/ because the
repository-level .gitignore lists ``examples/``; keeping the demo as
``demos/`` means it ships to users who install the skill without
requiring an exception in .gitignore.
…dge · critic
Professional-grade upgrade of the skill landed in the previous two
commits on this branch. Four cohesive features that move the skill
from hobby-grade to production:
1. LLM response cache
-----------------------
* New `scripts/cache.py` — content-addressed SQLite cache keyed by a
blake2b-16 of the normalised request body (model, temp, max_tokens,
seed, messages). Backed by the existing `lineage.db` so the cache
ships with the experiment directory.
* `llm.LLMClient` gains an optional `cache` field; `complete()` checks
the cache before every HTTP call and writes on miss. Cache hits
short-circuit the network AND the BudgetLedger, so reruns are zero
cost and the budget accounting stays accurate.
* `evolver run` enables the cache by default; `--no-cache` disables it.
* New subcommand: `evolver cache <dir> stats|purge`.
* Acceptance: `test_end_to_end_replay_is_bit_identical` — a second run
with the cache populated makes ZERO HTTP calls (monkeypatched
`httpx.post` raises on invocation) and produces an identical
`lineage_hash`.
2. Pairwise judge + Bradley-Terry MLE
---------------------------------------
* New `scripts/judge.py` — implements:
* `aggregate_bradley_terry()` — Hunter 2004 MM iteration (pure
stdlib, ~30 LOC) with Laplace smoothing for candidates that
never won or never lost.
* `sample_pair_schedule()` — round-robin for small populations;
least-seen-anchor sampling for larger ones. Invariant: every
candidate appears in at least ceil(rounds × 2 / pop) matches.
* `PairwiseJudge` — LLM-backed judge with a position-bias guard
that randomly swaps LEFT / RIGHT per call; decodes first-line
verdicts robustly (LEFT / RIGHT / TIE with fallbacks).
* `fitness_spec` gains `judge="pairwise"`, `pairwise_rounds=40`.
* `evaluator.evaluate_pairwise()` runs the schedule, records votes
in `pairwise_votes`, solves the MLE, writes `bt_scores`, and copies
log-odds onto `Individual.fitness` so the existing selector pipeline
(tournament / MAP-Elites / (μ+λ)) works unchanged.
* Incompatibility with NSGA-II is enforced explicitly; the runner
errors out early instead of silently corrupting state.
* Acceptance: `test_condorcet_order_recovered_through_mocked_judge` —
a judge that prefers lex-higher genomes, combined with the MLE,
recovers the full six-candidate ranking through 30 pair rounds.
3. Constitutional reward-hacking critic
-----------------------------------------
* New `scripts/critic.py` — `ConstitutionalCritic` runs a second LLM
over the top-K of every generation with a structured JSON contract
({"risk", "evidence", "signal_tags"}).
* `templates/constitution.md` — a default rule-set covering literal
short-circuits, judge flattery, regex over-matching, test-harness
exploits, spurious correctness, and brittle templates.
* Penalty applies ONLY to the in-memory `Individual.fitness` that
feeds selection; the raw `fitness` SQLite row is untouched, keeping
the audit trail pristine.
* `fitness_spec` gains `critic="on"|"off"`, `critic_threshold`,
`critic_top_k`, and `critic_model` (cheap-model override).
* Runner hook fires after both seed evaluation and each offspring
generation; JSON parse failures soft-fail to risk=0 so a noisy
judge can't crash a run.
4. FastAPI + Plotly dashboard
-------------------------------
* New `scripts/dashboard.py` — read-only FastAPI app with endpoints
GET /api/{summary, fitness, pareto, lineage/{cid}, operators} and
WS /api/stream (polls `lineage.db` and pushes a JSON event when
a new generation lands).
* `templates/dashboard.html` — single-page, vanilla JS, Plotly and
Mermaid from CDN; no build step, no npm.
* New subcommand: `evolver dashboard <dir> [--host 127.0.0.1 --port 8787]`.
* Binds loopback by default; non-loopback hosts print a warning.
* Graceful absence: `dashboard` command prints an install hint and
exits non-zero if fastapi/uvicorn aren't installed. Every other
subcommand continues to work.
Storage schema
--------------
Four new tables, all additive (`CREATE TABLE IF NOT EXISTS`), so v0.1
experiments open and run unchanged:
* llm_cache (key, response, tokens, model, created_at)
* pairwise_votes (generation, left, right, winner, seed)
* bt_scores (candidate, generation, log_odds, iters)
* critic_evaluations (candidate, generation, risk, evidence,
signal_tags, model, evaluated_at)
Tests
-----
Four new test files, one per feature — 31 new cases, all green:
tests/skills/test_darwinian_evolver_v02_cache.py 8 cases
tests/skills/test_darwinian_evolver_v02_judge.py 12 cases
tests/skills/test_darwinian_evolver_v02_critic.py 6 cases
tests/skills/test_darwinian_evolver_v02_dashboard.py 5 cases
Full darwinian-evolver suite: 39 (v0.1) + 31 (v0.2) = 70, all green.
No regression in existing repo tests (full tests/skills/ run: 170/170).
Acceptance checklist (rows 11 + 12 from the plan): both pass.
Invariants maintained
---------------------
* Every v0.1 behaviour is preserved; new features are opt-in (except
the cache, which is transparent and correctness-preserving).
* `lineage_hash` gains determinism it didn't have before: with the
cache, identical seeds produce identical hashes across runs.
v0.2 changelog (pushed as
|
| # | Feature | Why it matters | New LOC |
|---|---|---|---|
| 1 | LLM response cache — content-addressed SQLite, keyed by blake2b-16 of the normalised request body | Replay is now bit-for-bit deterministic at zero cost. Cache hits skip both the HTTP call and the budget ledger. | cache.py (150) + edits |
| 2 | Bradley-Terry pairwise judge — Hunter (2004) MM aggregator over position-bias-guarded pairwise verdicts | Absolute LLM-as-judge scores drift (Zheng 2023, Chen 2024). Pairwise preferences are invariant to that drift. | judge.py (250) + evaluator wiring |
| 3 | Constitutional critic — second LLM inspects top-K for reward hacking per a markdown rule-set | Held-out guard catches train/test gaps; the critic catches generalised cheating (judge flattery, regex over-matching, harness exploits). | critic.py (180) + constitution.md |
| 4 | FastAPI + Plotly dashboard — read-only live view with Mermaid lineage | Runs become visible — fitness curves, Pareto, operator attribution, budget burndown — with WebSocket push on new generations. | dashboard.py (230) + dashboard.html (180) |
CLI additions
evolver run <dir> [--no-cache]
evolver cache <dir> stats|purge
evolver dashboard <dir> [--host 127.0.0.1 --port 8787]
fitness_spec extensions
@fitness_spec(
judge="pairwise", pairwise_rounds=40, # feature 2
critic="on", critic_threshold=0.5, # feature 3
critic_top_k=5, critic_model="haiku-cheap", # feature 3 — cheap-model route
)
def fitness(candidate, context): ...Storage
Four additive tables (CREATE TABLE IF NOT EXISTS — v0.1 experiments auto-upgrade): llm_cache, pairwise_votes, bt_scores, critic_evaluations.
Tests
31 new cases across four test files, all green:
test_darwinian_evolver_v02_cache.py 8 cases
test_darwinian_evolver_v02_judge.py 12 cases
test_darwinian_evolver_v02_critic.py 6 cases
test_darwinian_evolver_v02_dashboard.py 5 cases
Full darwinian-evolver suite: 39 (v0.1) + 31 (v0.2) = 70/70 green. No regression in other skills.
Acceptance rows 11 (replay bit-identity) and 12 (judge calibration via Condorcet recovery) from the v0.2 plan both pass.
License / safety
- Dashboard binds
127.0.0.1by default;--hostnon-loopback prints a prominent warning. No auth shipped. - FastAPI / uvicorn are optional deps; dashboard subcommand fails closed with an install hint when absent.
- Constitutional critic is off by default.
- Cache is per-experiment (scoped to its
lineage.db), so cross-experiment leakage is structurally impossible.
What's explicitly out of scope for v0.2
No RL fine-tuning (that's Phase 4 of #337); no dashboard auth/write endpoints; no distributed workers; no automatic hyperparameter tuning.
Happy to split the commit into four (one per feature) if that eases review.
|
Closing while v1.0 (A1-C5) lands locally; will reopen with the full roadmap once Bihruze reviews. See /Users/seher/.claude/plans/nested-gathering-kazoo.md for the plan. |
Single commit bundling the rest of the v1.0 roadmap (phases 2-5 of
the approved plan). Per-feature structure preserved in this message;
file diffs in storage.py / SKILL.md cross-cut features and cannot be
split without interactive rebase.
================================================================
PHASE 2 — Research core (v0.4)
================================================================
A2 self-modifying bandit — scripts/bandit_director.py
------------------------------------------------------
LLM proposes add / retire / merge actions on the Exp3 arm set every
R generations; safety rails (consecutive-floor requirement for
retire, max_arms cap for add, no-op on missing arms) keep a bad LLM
reply from corrupting the bandit. DynamicOperator wraps a prompt
template without evaluating Python. New `generated_operators`
table records the library so runs replay deterministically.
A3 co-evolution — scripts/coevolve.py
-------------------------------------
Dual populations (solvers vs adversaries) alternate steps; the
adversary's fitness is the fraction of solvers it defeats on a
threshold-0.5 test. Red-Queen dynamic bounded by
max_adversary_generations. `red_team_inputs` table audits every
adversarial input.
A4 auto-fitness synthesis — scripts/fitness_synth.py
----------------------------------------------------
Given ≤20 labelled I/O pairs, an LLM picks one of three archetypes
(exact / soft / judge) and we emit a runnable fitness.py text.
Meta-APE pattern; user reviews before accepting. Soft archetype
ships a pure-stdlib Levenshtein ratio; judge archetype wires into
the existing LLMClient without extra deps.
================================================================
PHASE 3 — Scale + Safety (v0.5)
================================================================
B1 distributed — scripts/distributed.py
---------------------------------------
WorkerBackend protocol with three implementations:
* LocalBackend — v0.2 asyncio Semaphore path
* RaySimBackend — stdlib shim mimicking Ray's API
* RayBackend — real Ray (lazy import behind optional extra)
`select_backend()` factory dispatches; graceful absence with a
clear install hint when ray isn't on PATH.
B4 sandbox backends — scripts/sandbox_wasm.py · sandbox_firecracker.py
----------------------------------------------------------------------
* WasmSandbox — wasmtime-py, cross-platform, zero file/net
exposure. Graceful availability check.
* FirecrackerSandbox — Linux-only; KVM + firecracker binary + user
kernel/rootfs paths. Fails closed with the
right message on macOS.
B2 nightly repo sweep — scripts/repo_sweep.py +
.github/workflows/darwinian-evolver-nightly.yml
------------------------------------------------------------------
Discovers every SKILL.md in the repo, runs a baseline-score +
(dry-run-only for v0.5 ship) evolve pass, writes a JSON report.
Per-skill 72h cooldown prevents PR churn. Workflow runs at
03:17 UTC nightly + `workflow_dispatch` with `dry_run` input.
Opens NO PRs this release — the workflow uploads a report
artefact only; PR creation ships in a follow-up once maintainers
vet the fitness proxy.
================================================================
PHASE 4 — Transfer + Distill (v0.6)
================================================================
A5 cross-task transfer — scripts/task_features.py + transfer.py
---------------------------------------------------------------
9-D task-feature vector (fitness surface size, objectives, judge
mode, critic flag, archetype hash, char-n-gram entropy, …). k-NN
meta-policy over cosine distance; pickleable for `--transfer-from`.
New `task_features` table records per-experiment features +
policy hash for reproducibility.
B5 evolve → distill — scripts/distill.py
----------------------------------------
Thin LoRA fine-tune pipeline over `transformers + peft +
accelerate` (optional deps group
`darwinian-evolver-distill`). Teacher-callback indirection keeps
the provider swap easy. Availability check raises a helpful
DistillUnavailable when deps absent.
================================================================
PHASE 5 — Ecosystem (v1.0)
================================================================
C1 benchmark hub — scripts/bench.py
-----------------------------------
Registry + 3 canonical fitnesses (email-regex/v1,
ten-word-summary/v1, sql-select-easy/v1); `score_archive(conn, id)`
re-scores a lineage.db's top-K against a registered benchmark.
C2 cross-model validation — scripts/validate.py
-----------------------------------------------
`cross_model_validate` re-scores candidates under a user-supplied
target scorer coroutine and reports Spearman-ρ between local and
target rankings.
C5 forkable marketplace — scripts/marketplace.py
------------------------------------------------
Layer on v0.3 hub: `prepare_listing` makes a tarball, `fork_listing`
extracts into a warm-start experiment dir, `listing_summary`
inspects a tarball without extracting.
Deferred (documented in SKILL.md):
* C3 HITL dashboard editor
* C4 VS Code extension (TypeScript, separate package)
================================================================
Storage schema
================================================================
Five new tables, all additive:
* `generated_operators` — A2, name/template/temperature/retired_at
* `red_team_inputs` — A3, adversarial input corpus
* `fitness_syntheses` — A4, history of synthesised fitnesses
* `task_features` — A5, per-experiment feature vector
(The A1 `descriptor_history` and B3 `hub_imports` already shipped
in v0.3.)
================================================================
Tests — 58 new cases, all green
================================================================
tests/skills/test_darwinian_evolver_v04.py 25 (A2+A3+A4+storage)
tests/skills/test_darwinian_evolver_v05.py 14 (B1+B4+B2)
tests/skills/test_darwinian_evolver_v06_v1.py 19 (A5+B5+C1+C2+C5)
Cumulative suite status:
* v0.1 — 39
* v0.2 — 31
* v0.3 — 20
* v0.4 — 25
* v0.5 — 14
* v0.6 + v1.0 — 19
Total: 148 / 148 green. No regression in other skills.
================================================================
What's explicitly NOT in this commit
================================================================
* evolver.py CLI hooks for A2-A5 — library-level only this round;
wiring into `_run_loop` / subcommands ships when the runner
refactor for Phase 2 lands (cleaner than weaving 5 hooks into the
existing single-path loop right now).
* Ablation experiment scripts under `experiments/phase*/` — blocked
on GPU budget; scripts land when cluster access is confirmed.
* C3 dashboard HITL edit endpoint and C4 IDE extension.
================================================================
Branch discipline
================================================================
LOCAL ONLY. The branch is 5 commits ahead of origin/main and NOT
pushed. PR NousResearch#12633 is closed; Bihruze will re-open with the full v1.0
history once this commit passes his review.
Summary
Adds a new optional skill —
darwinian-evolver— that evolves text artifacts (prompts, regexes, SQL queries, small code snippets) toward a user-supplied fitness function via LLM-driven mutation and crossover over a quality-diversity archive.References issues:
NousResearch/hermes-agent-self-evolutionvia a Tier 3 DSPy-jsonl export — no import, just a data contract.No other skill or tool in the repo performs prompt / code optimization today (verified by scanning
skills/andoptional-skills/for DSPy / GEPA / evolutionary keywords).Architecture — three tiers
Tier 1 has no non-stdlib runtime deps beyond
httpx(already in Hermes core). Tiers 2 and 3 are thin adapters that fail gracefully when their externals are absent.Algorithmic core (with citations)
All primitives are implemented from scratch (no large dependencies) and exposed as pure-function library code in
scripts/algorithms.py;evolver.pycomposes them.Fitness contract
Users drop a
fitness.pyinto their experiment directory:The evaluator guarantees:
context[\"seed\"]) for reproducible LLM calls (OpenAI / Anthropic / vLLM all honorseedas a hint or hard param).resource.setrlimitcaps (CPU, address space, data segment) plus a wall-clock timeout.--budget(USD or tokens via rate args) hard-kills runs viaBudgetExceededbefore blowing past the cap.License handling
Imbue's
darwinian-evolveris AGPL v3. This PR never imports it. TheExternalEvolverAdapterwraps it as an opaque subprocess invocation ("mere aggregation" exemption) and gates onshutil.which, raising a clearAdapterUnavailablewith an install hint when absent. The install hint surfaces the license explicitly ("AGPL v3 — review license before use"). The default Tier 2 backend is OpenEvolve (Apache 2.0) so users never touch AGPL unless they explicitly opt in.Validation
tests/skills/test_darwinian_evolver.py— 39 cases, all green locally:while Truekilled within wall-clock, pytest summary parserseedpropagated into request body, BudgetLedger raises at cap across multi-call runsAcceptance checklist (10/10)
lineage_hashstability asserted.BudgetExceededverified.while Truekilled within wall-clock + overhead.seedin request body asserted.AdapterUnavailablewith install hint.tests/skills/test_darwinian_evolver.pygreen.Scope — explicit guardrails
Shipped:
openevolve,darwinian-evolver).summarize_10_wordsdemo with deterministic fitness.Not in scope for v0.1 (deliberately deferred):
File map
Diff shape
~2,900 insertions across 17 new files + minor edits to the two files that existed in my first commit. Total skill surface is ~2,000 LOC production, ~600 LOC tests, and ~300 LOC templates/demos.
Note on the directory rename
The repo-level
.gitignoreincludesexamples/. To keep the end-to-end walkthrough tracked without touching the ignore list, I placed it underdemos/summarize_10_words/. Happy to change the name if maintainers prefer a different convention.