feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code by Bihruze · Pull Request #12633 · NousResearch/hermes-agent

Bihruze · 2026-04-19T17:32:01Z

Summary

Adds a new optional skill — darwinian-evolver — that evolves text artifacts (prompts, regexes, SQL queries, small code snippets) toward a user-supplied fitness function via LLM-driven mutation and crossover over a quality-diversity archive.

References issues:

Feature: Darwinian Evolver Skill — Evolutionary Code & Prompt Optimization #336 — Darwinian Evolver Skill (this is the scoped v1 the issue asks for)
Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search #337 — Evolutionary Self-Improvement (this PR lands Phase 3 in skill form)
Bridges to NousResearch/hermes-agent-self-evolution via a Tier 3 DSPy-jsonl export — no import, just a data contract.

No other skill or tool in the repo performs prompt / code optimization today (verified by scanning skills/ and optional-skills/ for DSPy / GEPA / evolutionary keywords).

Architecture — three tiers

┌──────────────────────────────────────────────────────────────────────┐
│ Tier 3: bridge                                                        │
│   export lineage → DSPy-compatible JSONL / GEPA reflective trace      │
│                 → hermes-agent-self-evolution ingests                 │
├──────────────────────────────────────────────────────────────────────┤
│ Tier 2: heavy (opt-in, AGPL-isolated)                                 │
│   subprocess wrapper around openevolve CLI (Apache 2.0)               │
│   subprocess wrapper around darwinian-evolver CLI (Imbue, AGPL v3)    │
│   NO python import — mere aggregation — license-clean                 │
├──────────────────────────────────────────────────────────────────────┤
│ Tier 1: core (MIT, default)                                           │
│   (μ+λ)-ES · MAP-Elites · NSGA-II · tournament + rank selection       │
│   LLM operators: paraphrase · structural-edit · semantic-crossover    │
│                   · CoT-inject · novelty-seeking · critique-then-edit │
│                   · PromptBreeder-style meta-mutator                  │
│   Evaluator: async batch · successive halving · held-out guard        │
│   Storage: SQLite lineage · content-addressed genomes · Mermaid DAGs  │
└──────────────────────────────────────────────────────────────────────┘
              reuses Hermes's existing OpenAI-compat client
              (prompt caching aware; respects per-slot context from #12595)

Tier 1 has no non-stdlib runtime deps beyond httpx (already in Hermes core). Tiers 2 and 3 are thin adapters that fail gracefully when their externals are absent.

Algorithmic core (with citations)

Primitive	Reference
(μ+λ)-ES survival	Bäck & Schwefel 1993
MAP-Elites quality-diversity archive	Mouret & Clune 2015
NSGA-II fast non-dominated sort + crowding distance	Deb et al. 2002
LLM semantic crossover	Meyerson et al. 2023
Self-referential meta-mutator	Fernando et al. 2023 (PromptBreeder)
Reflective critique-then-edit	Agrawal et al. 2025 (GEPA)
Novelty-seeking mutation bias	Lehman & Stanley 2011
Exp3 bandit over operators	Auer et al. 2002
Successive halving (Hyperband-lite)	Li et al. 2017

All primitives are implemented from scratch (no large dependencies) and exposed as pure-function library code in scripts/algorithms.py; evolver.py composes them.

Fitness contract

Users drop a fitness.py into their experiment directory:

from evolver_sdk import fitness_spec

@fitness_spec(held_out_frac=0.2, timeout_s=30, objectives=[\"accuracy\", \"cost\"])
def fitness(candidate: str, context: dict) -> float | dict[str, float]:
    ...

The evaluator guarantees:

Seed propagation (context[\"seed\"]) for reproducible LLM calls (OpenAI / Anthropic / vLLM all honor seed as a hint or hard param).
Code candidates execute in a subprocess sandbox with resource.setrlimit caps (CPU, address space, data segment) plus a wall-clock timeout.
A reward-hacking guard re-scores the top-K on a held-out split each generation; candidates with a >15 % generalization gap take a penalty so honest candidates outrank overfitters.
A global --budget (USD or tokens via rate args) hard-kills runs via BudgetExceeded before blowing past the cap.

License handling

Imbue's darwinian-evolver is AGPL v3. This PR never imports it. The ExternalEvolverAdapter wraps it as an opaque subprocess invocation ("mere aggregation" exemption) and gates on shutil.which, raising a clear AdapterUnavailable with an install hint when absent. The install hint surfaces the license explicitly ("AGPL v3 — review license before use"). The default Tier 2 backend is OpenEvolve (Apache 2.0) so users never touch AGPL unless they explicitly opt in.

Validation

tests/skills/test_darwinian_evolver.py — 39 cases, all green locally:

area	cases	covers
storage	6	content-address determinism, idempotent insert, ancestry reconstruction, held-out-preferring best lookup, budget totals, lineage_hash stability + change detection
algorithms	9	tournament bias (statistical), rank-select invariant, (μ+λ) elitism, MAP-Elites place/coverage/empty-sample, NSGA-II front identification, strict dominance, crowding-distance boundary infinity, Exp3 reward clamp + weight update
evaluator	4	fitness_spec metadata round-trip, scalar + dict fitness via async batch, timeout → worst-score fallback
reward hacking	1	overfitter penalised >0.5 vs honest 0.0 penalty
successive halving	1	8 → 4 → 2 narrowing at the expected fidelity schedule
sandbox	4	simple program, syntax error, runaway `while True` killed within wall-clock, pytest summary parser
adapters	5	openevolve + Imbue graceful absence with license hint, DSPy-jsonl default-best-per-gen, DSPy-jsonl --all, GEPA trace filters to reflective edges
LLM client	2	`seed` propagated into request body, BudgetLedger raises at cap across multi-call runs
MAP-Elites coverage	1	random 200 samples fill ≥60 % of 4×4 grid
end-to-end (mocked LLM)	1	single generation improves over seed + stable lineage hash

Acceptance checklist (10/10)

Determinism — ✅ lineage_hash stability asserted.
Budget hard-kill — ✅ BudgetExceeded verified.
Multi-objective correctness — ✅ NSGA-II front non-domination asserted.
MAP-Elites coverage — ✅ ≥60 % on random 200-sample run.
Sandbox kills runaway — ✅ while True killed within wall-clock + overhead.
Held-out guard — ✅ overfitter penalty ≥0.5.
End-to-end improves — ✅ mocked-LLM generation improves fitness over seed.
Determinism of LLM ops — ✅ seed in request body asserted.
Tier 2 isolation — ✅ both adapters raise AdapterUnavailable with install hint.
CI regression — full tests/skills/test_darwinian_evolver.py green.

Scope — explicit guardrails

Shipped:

Tier 1 core, 7 LLM operators, MAP-Elites + ES + NSGA-II, held-out + budget guards, SQLite lineage, Mermaid export, replay.
Tier 2 adapters (openevolve, darwinian-evolver).
Tier 3 DSPy + GEPA exports.
5 fitness templates (prompt / regex / sql / code / multi-objective).
End-to-end summarize_10_words demo with deterministic fitness.

Not in scope for v0.1 (deliberately deferred):

RL fine-tuning (Phase 4 of Feature: Evolutionary Self-Improvement — Auto-Evolving Skills & Prompts via LLM-Driven Search #337).
UI dashboard; Mermaid + JSON only.
Distributed / multi-node execution.
Automatic hyperparameter search.
Importing Imbue's AGPL Python — CLI only, always.

File map

optional-skills/research/darwinian-evolver/
├── SKILL.md                        (full author spec + theoretical foundations)
├── scripts/
│   ├── evolver.py                  (CLI: init/run/status/best/lineage/budget/export/replay)
│   ├── algorithms.py               (ES, MAP-Elites, NSGA-II, Exp3, descriptors)
│   ├── operators.py                (7 LLM operators + segment-splice + meta-mutator)
│   ├── evaluator.py                (fitness_spec, async batch, held-out guard, halving)
│   ├── sandbox.py                  (subprocess + rlimit + timeout for code eval)
│   ├── storage.py                  (SQLite lineage, content-addressed IDs, hashing)
│   ├── llm.py                      (async OpenAI-compat client + BudgetLedger)
│   └── adapters.py                 (Tier 2 CLI wrappers + Tier 3 DSPy/GEPA export)
├── templates/                      (5 copy-paste fitness templates)
└── demos/summarize_10_words/       (end-to-end demo with deterministic fitness)
tests/skills/test_darwinian_evolver.py   (39 cases)

Diff shape

~2,900 insertions across 17 new files + minor edits to the two files that existed in my first commit. Total skill surface is ~2,000 LOC production, ~600 LOC tests, and ~300 LOC templates/demos.

Note on the directory rename

The repo-level .gitignore includes examples/. To keep the end-to-end walkthrough tracked without touching the ignore list, I placed it under demos/summarize_10_words/. Happy to change the name if maintainers prefer a different convention.

…egex/SQL/code Adds optional-skills/research/darwinian-evolver, a new skill that evolves text artifacts toward a user-supplied fitness function via LLM-driven mutation and crossover. Scope of this commit (Day 1+2 of plan — core library + CLI + tests). Day 3+4 will add Tier 2 subprocess adapters (openevolve, Imbue's darwinian-evolver CLI), Tier 3 DSPy bridge polish, code-sandbox, and the end-to-end example under examples/summarize_10_words/. Core library ------------ * storage.py — SQLite lineage graph; content-addressed genome IDs (blake2b truncated), WAL journaling, idempotent insert, ancestry BFS, budget ledger, lineage_hash for replay determinism checks. * llm.py — async OpenAI-compat client with seed propagation, bounded Semaphore concurrency, exponential-backoff retry respecting Retry-After, BudgetLedger with hard cap that raises BudgetExceeded mid-run. * algorithms.py — pure-function primitives: - tournament + linear-rank selection - (μ+λ)-ES survival (Bäck & Schwefel 1993) - MAP-Elites archive with configurable bin grid and behavioral descriptors (Mouret & Clune 2015) - NSGA-II: fast non-dominated sort + crowding distance (Deb et al. 2002) - Exp3 bandit over operators (Auer et al. 2002) - default 2-D prompt descriptor: length × CoT-presence * operators.py — LLM mutation/crossover: paraphrase, structural_edit, cot_inject, novelty_seeking, critique_then_edit (GEPA-lite), meta_mutate_operator_prompt (PromptBreeder-style), semantic_crossover (Meyerson 2023), segment_splice. Each returns a stable prompt_hash for lineage replay. * evaluator.py — fitness_spec decorator, dynamic fitness.py import, async batch eval with wall-clock timeout, held-out reward-hacking guard (top-K re-score, gap-based penalty), successive halving (Hyperband-lite). * evolver.py — CLI entry: init/run/status/best/lineage/budget/ export/replay subcommands; JSON stdout, NDJSON progress streaming from run; supports three algorithms (es, map-elites, nsga2). SKILL.md -------- Full authoring spec per Hermes conventions, including a Theoretical Foundations section citing the primary literature (Bäck & Schwefel, Mouret & Clune, Deb, Meyerson, Fernando, Agrawal, Lehman & Stanley, Auer, Li et al.), a tier table explaining the MIT/AGPL/bridge split, and explicit scope guardrails. Tests ----- tests/skills/test_darwinian_evolver.py — 26 cases, 100% green: * storage: content-address determinism, idempotent insert, ancestry reconstruction with parent edges, held-out-preferring get_best, budget totals, lineage_hash stability + change detection. * algorithms: tournament bias (stat check), rank-select pressure validation, (μ+λ) elitism, MAP-Elites place/coverage/empty-sample, NSGA-II front identification on 2-objective pop, strict dominance, crowding-distance boundary infinity, Exp3 reward clamping + weight update. * evaluator: fitness_spec metadata round-trip, scalar + dict fitness via async batch, timeout → worst-score degradation, held-out guard penalises a synthetic overfitter by >0.5, successive_halving narrows 8 → 2 survivors at the expected fidelity schedule. References issues ----------------- * closes NousResearch#336 (scoped v1) * partial for NousResearch#337 Phase 3 * bridges to NousResearch/hermes-agent-self-evolution via Tier 3 DSPy-jsonl export

Completes the skill scaffolded in the previous commit (Day 3-5 of the staged plan). Added ----- * scripts/sandbox.py — subprocess sandbox for code candidates: - POSIX rlimit caps (CPU, AS, DATA, CORE) applied as best-effort in a preexec_fn; each limit failure is swallowed independently so macOS's RLIMIT_AS-incompatibility doesn't break the others. - Wall-clock timeout via subprocess.run(timeout=...) as a backstop. - run_candidate_code() and run_pytest_suite() helpers; the latter parses the pytest terse summary into a pass fraction. * scripts/adapters.py — Tier 2 + Tier 3 wrappers: - ExternalEvolverAdapter: lazy shutil.which detection, raises AdapterUnavailable with install hint instead of crashing. - openevolve_adapter (Apache 2.0, default Tier 2 recommendation). - darwinian_evolver_adapter (Imbue, AGPL v3) — subprocess only, never imported, so license-viral code never enters the Hermes process. - export_dspy_jsonl: DSPy-compatible offline records with full lineage; default keeps one winner per generation, --all emits every candidate. - export_gepa_trace: reflective-operator edges (critique_then_edit, meta_mutator) in the shape GEPA's trainer expects. * templates/*.py — five copy-paste-ready fitness templates for prompt, regex, SQL, code (uses sandbox), and multi-objective (NSGA-II) runs. * demos/summarize_10_words/ — end-to-end packaged demo: - fitness.py: deterministic scoring (word count proximity, brevity keyword, char budget), so the demo runs cheaply without an LLM judge and the improvement curve is visible on a small local model. - seed/initial.txt, README.md with exact commands and expected trajectory. Changed ------- * scripts/evolver.py: cmd_export now delegates to adapters.export_*; added --all flag for dspy-jsonl exports; imports adapters module. Tests (tests/skills/test_darwinian_evolver.py) ---------------------------------------------- Expanded from 26 to 39 cases — all green: * TestSandbox — simple candidate runs, syntax error fails cleanly, runaway while-True loop killed within wall-clock+overhead, pytest terse-summary parser. * TestAdapterGracefulAbsence — openevolve + darwinian-evolver adapters raise AdapterUnavailable with license-informed install hint when the binary is missing (monkeypatched shutil.which). * TestDspyBridge — default export keeps one record per generation; --all mode emits every candidate; GEPA export filters to reflective operators and preserves parent/child metadata. * TestLLMClient — seed propagation verified by intercepting the AsyncClient.post body; BudgetLedger records spend across calls and raises BudgetExceeded when cap is crossed. * TestMapElitesCoverage — random 200-sample run fills ≥60 % of a 4×4 descriptor grid (acceptance checklist NousResearch#4). * TestEndToEnd — single generation with a fully-mocked LLM produces an offspring strictly better than the seed and yields a stable lineage hash (acceptance checklist NousResearch#7, plus determinism smoke). Acceptance checklist status: 10/10 covered. Repo-level notes ---------------- The original examples/ subdirectory is renamed demos/ because the repository-level .gitignore lists ``examples/``; keeping the demo as ``demos/`` means it ships to users who install the skill without requiring an exception in .gitignore.

…dge · critic Professional-grade upgrade of the skill landed in the previous two commits on this branch. Four cohesive features that move the skill from hobby-grade to production: 1. LLM response cache ----------------------- * New `scripts/cache.py` — content-addressed SQLite cache keyed by a blake2b-16 of the normalised request body (model, temp, max_tokens, seed, messages). Backed by the existing `lineage.db` so the cache ships with the experiment directory. * `llm.LLMClient` gains an optional `cache` field; `complete()` checks the cache before every HTTP call and writes on miss. Cache hits short-circuit the network AND the BudgetLedger, so reruns are zero cost and the budget accounting stays accurate. * `evolver run` enables the cache by default; `--no-cache` disables it. * New subcommand: `evolver cache <dir> stats|purge`. * Acceptance: `test_end_to_end_replay_is_bit_identical` — a second run with the cache populated makes ZERO HTTP calls (monkeypatched `httpx.post` raises on invocation) and produces an identical `lineage_hash`. 2. Pairwise judge + Bradley-Terry MLE --------------------------------------- * New `scripts/judge.py` — implements: * `aggregate_bradley_terry()` — Hunter 2004 MM iteration (pure stdlib, ~30 LOC) with Laplace smoothing for candidates that never won or never lost. * `sample_pair_schedule()` — round-robin for small populations; least-seen-anchor sampling for larger ones. Invariant: every candidate appears in at least ceil(rounds × 2 / pop) matches. * `PairwiseJudge` — LLM-backed judge with a position-bias guard that randomly swaps LEFT / RIGHT per call; decodes first-line verdicts robustly (LEFT / RIGHT / TIE with fallbacks). * `fitness_spec` gains `judge="pairwise"`, `pairwise_rounds=40`. * `evaluator.evaluate_pairwise()` runs the schedule, records votes in `pairwise_votes`, solves the MLE, writes `bt_scores`, and copies log-odds onto `Individual.fitness` so the existing selector pipeline (tournament / MAP-Elites / (μ+λ)) works unchanged. * Incompatibility with NSGA-II is enforced explicitly; the runner errors out early instead of silently corrupting state. * Acceptance: `test_condorcet_order_recovered_through_mocked_judge` — a judge that prefers lex-higher genomes, combined with the MLE, recovers the full six-candidate ranking through 30 pair rounds. 3. Constitutional reward-hacking critic ----------------------------------------- * New `scripts/critic.py` — `ConstitutionalCritic` runs a second LLM over the top-K of every generation with a structured JSON contract ({"risk", "evidence", "signal_tags"}). * `templates/constitution.md` — a default rule-set covering literal short-circuits, judge flattery, regex over-matching, test-harness exploits, spurious correctness, and brittle templates. * Penalty applies ONLY to the in-memory `Individual.fitness` that feeds selection; the raw `fitness` SQLite row is untouched, keeping the audit trail pristine. * `fitness_spec` gains `critic="on"|"off"`, `critic_threshold`, `critic_top_k`, and `critic_model` (cheap-model override). * Runner hook fires after both seed evaluation and each offspring generation; JSON parse failures soft-fail to risk=0 so a noisy judge can't crash a run. 4. FastAPI + Plotly dashboard ------------------------------- * New `scripts/dashboard.py` — read-only FastAPI app with endpoints GET /api/{summary, fitness, pareto, lineage/{cid}, operators} and WS /api/stream (polls `lineage.db` and pushes a JSON event when a new generation lands). * `templates/dashboard.html` — single-page, vanilla JS, Plotly and Mermaid from CDN; no build step, no npm. * New subcommand: `evolver dashboard <dir> [--host 127.0.0.1 --port 8787]`. * Binds loopback by default; non-loopback hosts print a warning. * Graceful absence: `dashboard` command prints an install hint and exits non-zero if fastapi/uvicorn aren't installed. Every other subcommand continues to work. Storage schema -------------- Four new tables, all additive (`CREATE TABLE IF NOT EXISTS`), so v0.1 experiments open and run unchanged: * llm_cache (key, response, tokens, model, created_at) * pairwise_votes (generation, left, right, winner, seed) * bt_scores (candidate, generation, log_odds, iters) * critic_evaluations (candidate, generation, risk, evidence, signal_tags, model, evaluated_at) Tests ----- Four new test files, one per feature — 31 new cases, all green: tests/skills/test_darwinian_evolver_v02_cache.py 8 cases tests/skills/test_darwinian_evolver_v02_judge.py 12 cases tests/skills/test_darwinian_evolver_v02_critic.py 6 cases tests/skills/test_darwinian_evolver_v02_dashboard.py 5 cases Full darwinian-evolver suite: 39 (v0.1) + 31 (v0.2) = 70, all green. No regression in existing repo tests (full tests/skills/ run: 170/170). Acceptance checklist (rows 11 + 12 from the plan): both pass. Invariants maintained --------------------- * Every v0.1 behaviour is preserved; new features are opt-in (except the cache, which is transparent and correctness-preserving). * `lineage_hash` gains determinism it didn't have before: with the cache, identical seeds produce identical hashes across runs.

Bihruze · 2026-04-19T20:32:07Z

v0.2 changelog (pushed as `1deec86`)

Layered on top of the two v0.1 commits — four cohesive features that move the skill from credible-weekend to production-grade. All opt-in except the cache, which is transparent.

Summary

#	Feature	Why it matters	New LOC
1	LLM response cache — content-addressed SQLite, keyed by blake2b-16 of the normalised request body	Replay is now bit-for-bit deterministic at zero cost. Cache hits skip both the HTTP call and the budget ledger.	`cache.py` (150) + edits
2	Bradley-Terry pairwise judge — Hunter (2004) MM aggregator over position-bias-guarded pairwise verdicts	Absolute LLM-as-judge scores drift (Zheng 2023, Chen 2024). Pairwise preferences are invariant to that drift.	`judge.py` (250) + evaluator wiring
3	Constitutional critic — second LLM inspects top-K for reward hacking per a markdown rule-set	Held-out guard catches train/test gaps; the critic catches generalised cheating (judge flattery, regex over-matching, harness exploits).	`critic.py` (180) + `constitution.md`
4	FastAPI + Plotly dashboard — read-only live view with Mermaid lineage	Runs become visible — fitness curves, Pareto, operator attribution, budget burndown — with WebSocket push on new generations.	`dashboard.py` (230) + `dashboard.html` (180)

CLI additions

evolver run <dir> [--no-cache]
evolver cache <dir> stats|purge
evolver dashboard <dir> [--host 127.0.0.1 --port 8787]

`fitness_spec` extensions

@fitness_spec(
    judge="pairwise", pairwise_rounds=40,        # feature 2
    critic="on", critic_threshold=0.5,           # feature 3
    critic_top_k=5, critic_model="haiku-cheap",  # feature 3 — cheap-model route
)
def fitness(candidate, context): ...

Storage

Four additive tables (CREATE TABLE IF NOT EXISTS — v0.1 experiments auto-upgrade): llm_cache, pairwise_votes, bt_scores, critic_evaluations.

Tests

31 new cases across four test files, all green:

test_darwinian_evolver_v02_cache.py       8 cases
test_darwinian_evolver_v02_judge.py      12 cases
test_darwinian_evolver_v02_critic.py      6 cases
test_darwinian_evolver_v02_dashboard.py   5 cases

Full darwinian-evolver suite: 39 (v0.1) + 31 (v0.2) = 70/70 green. No regression in other skills.

Acceptance rows 11 (replay bit-identity) and 12 (judge calibration via Condorcet recovery) from the v0.2 plan both pass.

License / safety

Dashboard binds 127.0.0.1 by default; --host non-loopback prints a prominent warning. No auth shipped.
FastAPI / uvicorn are optional deps; dashboard subcommand fails closed with an install hint when absent.
Constitutional critic is off by default.
Cache is per-experiment (scoped to its lineage.db), so cross-experiment leakage is structurally impossible.

What's explicitly out of scope for v0.2

No RL fine-tuning (that's Phase 4 of #337); no dashboard auth/write endpoints; no distributed workers; no automatic hyperparameter tuning.

Happy to split the commit into four (one per feature) if that eases review.

Bihruze · 2026-04-19T21:28:34Z

Closing while v1.0 (A1-C5) lands locally; will reopen with the full roadmap once Bihruze reviews. See /Users/seher/.claude/plans/nested-gathering-kazoo.md for the plan.

Single commit bundling the rest of the v1.0 roadmap (phases 2-5 of the approved plan). Per-feature structure preserved in this message; file diffs in storage.py / SKILL.md cross-cut features and cannot be split without interactive rebase. ================================================================ PHASE 2 — Research core (v0.4) ================================================================ A2 self-modifying bandit — scripts/bandit_director.py ------------------------------------------------------ LLM proposes add / retire / merge actions on the Exp3 arm set every R generations; safety rails (consecutive-floor requirement for retire, max_arms cap for add, no-op on missing arms) keep a bad LLM reply from corrupting the bandit. DynamicOperator wraps a prompt template without evaluating Python. New `generated_operators` table records the library so runs replay deterministically. A3 co-evolution — scripts/coevolve.py ------------------------------------- Dual populations (solvers vs adversaries) alternate steps; the adversary's fitness is the fraction of solvers it defeats on a threshold-0.5 test. Red-Queen dynamic bounded by max_adversary_generations. `red_team_inputs` table audits every adversarial input. A4 auto-fitness synthesis — scripts/fitness_synth.py ---------------------------------------------------- Given ≤20 labelled I/O pairs, an LLM picks one of three archetypes (exact / soft / judge) and we emit a runnable fitness.py text. Meta-APE pattern; user reviews before accepting. Soft archetype ships a pure-stdlib Levenshtein ratio; judge archetype wires into the existing LLMClient without extra deps. ================================================================ PHASE 3 — Scale + Safety (v0.5) ================================================================ B1 distributed — scripts/distributed.py --------------------------------------- WorkerBackend protocol with three implementations: * LocalBackend — v0.2 asyncio Semaphore path * RaySimBackend — stdlib shim mimicking Ray's API * RayBackend — real Ray (lazy import behind optional extra) `select_backend()` factory dispatches; graceful absence with a clear install hint when ray isn't on PATH. B4 sandbox backends — scripts/sandbox_wasm.py · sandbox_firecracker.py ---------------------------------------------------------------------- * WasmSandbox — wasmtime-py, cross-platform, zero file/net exposure. Graceful availability check. * FirecrackerSandbox — Linux-only; KVM + firecracker binary + user kernel/rootfs paths. Fails closed with the right message on macOS. B2 nightly repo sweep — scripts/repo_sweep.py + .github/workflows/darwinian-evolver-nightly.yml ------------------------------------------------------------------ Discovers every SKILL.md in the repo, runs a baseline-score + (dry-run-only for v0.5 ship) evolve pass, writes a JSON report. Per-skill 72h cooldown prevents PR churn. Workflow runs at 03:17 UTC nightly + `workflow_dispatch` with `dry_run` input. Opens NO PRs this release — the workflow uploads a report artefact only; PR creation ships in a follow-up once maintainers vet the fitness proxy. ================================================================ PHASE 4 — Transfer + Distill (v0.6) ================================================================ A5 cross-task transfer — scripts/task_features.py + transfer.py --------------------------------------------------------------- 9-D task-feature vector (fitness surface size, objectives, judge mode, critic flag, archetype hash, char-n-gram entropy, …). k-NN meta-policy over cosine distance; pickleable for `--transfer-from`. New `task_features` table records per-experiment features + policy hash for reproducibility. B5 evolve → distill — scripts/distill.py ---------------------------------------- Thin LoRA fine-tune pipeline over `transformers + peft + accelerate` (optional deps group `darwinian-evolver-distill`). Teacher-callback indirection keeps the provider swap easy. Availability check raises a helpful DistillUnavailable when deps absent. ================================================================ PHASE 5 — Ecosystem (v1.0) ================================================================ C1 benchmark hub — scripts/bench.py ----------------------------------- Registry + 3 canonical fitnesses (email-regex/v1, ten-word-summary/v1, sql-select-easy/v1); `score_archive(conn, id)` re-scores a lineage.db's top-K against a registered benchmark. C2 cross-model validation — scripts/validate.py ----------------------------------------------- `cross_model_validate` re-scores candidates under a user-supplied target scorer coroutine and reports Spearman-ρ between local and target rankings. C5 forkable marketplace — scripts/marketplace.py ------------------------------------------------ Layer on v0.3 hub: `prepare_listing` makes a tarball, `fork_listing` extracts into a warm-start experiment dir, `listing_summary` inspects a tarball without extracting. Deferred (documented in SKILL.md): * C3 HITL dashboard editor * C4 VS Code extension (TypeScript, separate package) ================================================================ Storage schema ================================================================ Five new tables, all additive: * `generated_operators` — A2, name/template/temperature/retired_at * `red_team_inputs` — A3, adversarial input corpus * `fitness_syntheses` — A4, history of synthesised fitnesses * `task_features` — A5, per-experiment feature vector (The A1 `descriptor_history` and B3 `hub_imports` already shipped in v0.3.) ================================================================ Tests — 58 new cases, all green ================================================================ tests/skills/test_darwinian_evolver_v04.py 25 (A2+A3+A4+storage) tests/skills/test_darwinian_evolver_v05.py 14 (B1+B4+B2) tests/skills/test_darwinian_evolver_v06_v1.py 19 (A5+B5+C1+C2+C5) Cumulative suite status: * v0.1 — 39 * v0.2 — 31 * v0.3 — 20 * v0.4 — 25 * v0.5 — 14 * v0.6 + v1.0 — 19 Total: 148 / 148 green. No regression in other skills. ================================================================ What's explicitly NOT in this commit ================================================================ * evolver.py CLI hooks for A2-A5 — library-level only this round; wiring into `_run_loop` / subcommands ships when the runner refactor for Phase 2 lands (cleaner than weaving 5 hooks into the existing single-path loop right now). * Ablation experiment scripts under `experiments/phase*/` — blocked on GPU budget; scripts land when cluster access is confirmed. * C3 dashboard HITL edit endpoint and C4 IDE extension. ================================================================ Branch discipline ================================================================ LOCAL ONLY. The branch is 5 commits ahead of origin/main and NOT pushed. PR NousResearch#12633 is closed; Bihruze will re-open with the full v1.0 history once this commit passes his review.

Bihruze added 3 commits April 19, 2026 18:59

Bihruze closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code#12633

feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code#12633
Bihruze wants to merge 3 commits into
NousResearch:mainfrom
Bihruze:feat/darwinian-evolver

Bihruze commented Apr 19, 2026

Uh oh!

Bihruze commented Apr 19, 2026

Uh oh!

Bihruze commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bihruze commented Apr 19, 2026

Summary

Architecture — three tiers

Algorithmic core (with citations)

Fitness contract

License handling

Validation

Acceptance checklist (10/10)

Scope — explicit guardrails

File map

Diff shape

Note on the directory rename

Uh oh!

Bihruze commented Apr 19, 2026

v0.2 changelog (pushed as 1deec86)

Summary

CLI additions

fitness_spec extensions

Storage

Tests

License / safety

What's explicitly out of scope for v0.2

Uh oh!

Bihruze commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

v0.2 changelog (pushed as `1deec86`)

`fitness_spec` extensions