Skip to content

feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code#12633

Closed
Bihruze wants to merge 3 commits into
NousResearch:mainfrom
Bihruze:feat/darwinian-evolver
Closed

feat(skill): darwinian-evolver — evolutionary optimizer for prompts, regex, SQL, and code#12633
Bihruze wants to merge 3 commits into
NousResearch:mainfrom
Bihruze:feat/darwinian-evolver

Conversation

@Bihruze

@Bihruze Bihruze commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a new optional skill — darwinian-evolver — that evolves text artifacts (prompts, regexes, SQL queries, small code snippets) toward a user-supplied fitness function via LLM-driven mutation and crossover over a quality-diversity archive.

References issues:

No other skill or tool in the repo performs prompt / code optimization today (verified by scanning skills/ and optional-skills/ for DSPy / GEPA / evolutionary keywords).

Architecture — three tiers

┌──────────────────────────────────────────────────────────────────────┐
│ Tier 3: bridge                                                        │
│   export lineage → DSPy-compatible JSONL / GEPA reflective trace      │
│                 → hermes-agent-self-evolution ingests                 │
├──────────────────────────────────────────────────────────────────────┤
│ Tier 2: heavy (opt-in, AGPL-isolated)                                 │
│   subprocess wrapper around openevolve CLI (Apache 2.0)               │
│   subprocess wrapper around darwinian-evolver CLI (Imbue, AGPL v3)    │
│   NO python import — mere aggregation — license-clean                 │
├──────────────────────────────────────────────────────────────────────┤
│ Tier 1: core (MIT, default)                                           │
│   (μ+λ)-ES · MAP-Elites · NSGA-II · tournament + rank selection       │
│   LLM operators: paraphrase · structural-edit · semantic-crossover    │
│                   · CoT-inject · novelty-seeking · critique-then-edit │
│                   · PromptBreeder-style meta-mutator                  │
│   Evaluator: async batch · successive halving · held-out guard        │
│   Storage: SQLite lineage · content-addressed genomes · Mermaid DAGs  │
└──────────────────────────────────────────────────────────────────────┘
              reuses Hermes's existing OpenAI-compat client
              (prompt caching aware; respects per-slot context from #12595)

Tier 1 has no non-stdlib runtime deps beyond httpx (already in Hermes core). Tiers 2 and 3 are thin adapters that fail gracefully when their externals are absent.

Algorithmic core (with citations)

Primitive Reference
(μ+λ)-ES survival Bäck & Schwefel 1993
MAP-Elites quality-diversity archive Mouret & Clune 2015
NSGA-II fast non-dominated sort + crowding distance Deb et al. 2002
LLM semantic crossover Meyerson et al. 2023
Self-referential meta-mutator Fernando et al. 2023 (PromptBreeder)
Reflective critique-then-edit Agrawal et al. 2025 (GEPA)
Novelty-seeking mutation bias Lehman & Stanley 2011
Exp3 bandit over operators Auer et al. 2002
Successive halving (Hyperband-lite) Li et al. 2017

All primitives are implemented from scratch (no large dependencies) and exposed as pure-function library code in scripts/algorithms.py; evolver.py composes them.

Fitness contract

Users drop a fitness.py into their experiment directory:

from evolver_sdk import fitness_spec

@fitness_spec(held_out_frac=0.2, timeout_s=30, objectives=[\"accuracy\", \"cost\"])
def fitness(candidate: str, context: dict) -> float | dict[str, float]:
    ...

The evaluator guarantees:

  • Seed propagation (context[\"seed\"]) for reproducible LLM calls (OpenAI / Anthropic / vLLM all honor seed as a hint or hard param).
  • Code candidates execute in a subprocess sandbox with resource.setrlimit caps (CPU, address space, data segment) plus a wall-clock timeout.
  • A reward-hacking guard re-scores the top-K on a held-out split each generation; candidates with a >15 % generalization gap take a penalty so honest candidates outrank overfitters.
  • A global --budget (USD or tokens via rate args) hard-kills runs via BudgetExceeded before blowing past the cap.

License handling

Imbue's darwinian-evolver is AGPL v3. This PR never imports it. The ExternalEvolverAdapter wraps it as an opaque subprocess invocation ("mere aggregation" exemption) and gates on shutil.which, raising a clear AdapterUnavailable with an install hint when absent. The install hint surfaces the license explicitly ("AGPL v3 — review license before use"). The default Tier 2 backend is OpenEvolve (Apache 2.0) so users never touch AGPL unless they explicitly opt in.

Validation

tests/skills/test_darwinian_evolver.py39 cases, all green locally:

area cases covers
storage 6 content-address determinism, idempotent insert, ancestry reconstruction, held-out-preferring best lookup, budget totals, lineage_hash stability + change detection
algorithms 9 tournament bias (statistical), rank-select invariant, (μ+λ) elitism, MAP-Elites place/coverage/empty-sample, NSGA-II front identification, strict dominance, crowding-distance boundary infinity, Exp3 reward clamp + weight update
evaluator 4 fitness_spec metadata round-trip, scalar + dict fitness via async batch, timeout → worst-score fallback
reward hacking 1 overfitter penalised >0.5 vs honest 0.0 penalty
successive halving 1 8 → 4 → 2 narrowing at the expected fidelity schedule
sandbox 4 simple program, syntax error, runaway while True killed within wall-clock, pytest summary parser
adapters 5 openevolve + Imbue graceful absence with license hint, DSPy-jsonl default-best-per-gen, DSPy-jsonl --all, GEPA trace filters to reflective edges
LLM client 2 seed propagated into request body, BudgetLedger raises at cap across multi-call runs
MAP-Elites coverage 1 random 200 samples fill ≥60 % of 4×4 grid
end-to-end (mocked LLM) 1 single generation improves over seed + stable lineage hash

Acceptance checklist (10/10)

  1. Determinism — ✅ lineage_hash stability asserted.
  2. Budget hard-kill — ✅ BudgetExceeded verified.
  3. Multi-objective correctness — ✅ NSGA-II front non-domination asserted.
  4. MAP-Elites coverage — ✅ ≥60 % on random 200-sample run.
  5. Sandbox kills runaway — ✅ while True killed within wall-clock + overhead.
  6. Held-out guard — ✅ overfitter penalty ≥0.5.
  7. End-to-end improves — ✅ mocked-LLM generation improves fitness over seed.
  8. Determinism of LLM ops — ✅ seed in request body asserted.
  9. Tier 2 isolation — ✅ both adapters raise AdapterUnavailable with install hint.
  10. CI regression — full tests/skills/test_darwinian_evolver.py green.

Scope — explicit guardrails

Shipped:

  • Tier 1 core, 7 LLM operators, MAP-Elites + ES + NSGA-II, held-out + budget guards, SQLite lineage, Mermaid export, replay.
  • Tier 2 adapters (openevolve, darwinian-evolver).
  • Tier 3 DSPy + GEPA exports.
  • 5 fitness templates (prompt / regex / sql / code / multi-objective).
  • End-to-end summarize_10_words demo with deterministic fitness.

Not in scope for v0.1 (deliberately deferred):

File map

optional-skills/research/darwinian-evolver/
├── SKILL.md                        (full author spec + theoretical foundations)
├── scripts/
│   ├── evolver.py                  (CLI: init/run/status/best/lineage/budget/export/replay)
│   ├── algorithms.py               (ES, MAP-Elites, NSGA-II, Exp3, descriptors)
│   ├── operators.py                (7 LLM operators + segment-splice + meta-mutator)
│   ├── evaluator.py                (fitness_spec, async batch, held-out guard, halving)
│   ├── sandbox.py                  (subprocess + rlimit + timeout for code eval)
│   ├── storage.py                  (SQLite lineage, content-addressed IDs, hashing)
│   ├── llm.py                      (async OpenAI-compat client + BudgetLedger)
│   └── adapters.py                 (Tier 2 CLI wrappers + Tier 3 DSPy/GEPA export)
├── templates/                      (5 copy-paste fitness templates)
└── demos/summarize_10_words/       (end-to-end demo with deterministic fitness)
tests/skills/test_darwinian_evolver.py   (39 cases)

Diff shape

~2,900 insertions across 17 new files + minor edits to the two files that existed in my first commit. Total skill surface is ~2,000 LOC production, ~600 LOC tests, and ~300 LOC templates/demos.

Note on the directory rename

The repo-level .gitignore includes examples/. To keep the end-to-end walkthrough tracked without touching the ignore list, I placed it under demos/summarize_10_words/. Happy to change the name if maintainers prefer a different convention.

Bihruze added 3 commits April 19, 2026 18:59
…egex/SQL/code

Adds optional-skills/research/darwinian-evolver, a new skill that evolves
text artifacts toward a user-supplied fitness function via LLM-driven
mutation and crossover.

Scope of this commit (Day 1+2 of plan — core library + CLI + tests).
Day 3+4 will add Tier 2 subprocess adapters (openevolve, Imbue's
darwinian-evolver CLI), Tier 3 DSPy bridge polish, code-sandbox, and
the end-to-end example under examples/summarize_10_words/.

Core library
------------
* storage.py    — SQLite lineage graph; content-addressed genome IDs
                  (blake2b truncated), WAL journaling, idempotent
                  insert, ancestry BFS, budget ledger, lineage_hash for
                  replay determinism checks.
* llm.py        — async OpenAI-compat client with seed propagation,
                  bounded Semaphore concurrency, exponential-backoff
                  retry respecting Retry-After, BudgetLedger with hard
                  cap that raises BudgetExceeded mid-run.
* algorithms.py — pure-function primitives:
                  - tournament + linear-rank selection
                  - (μ+λ)-ES survival (Bäck & Schwefel 1993)
                  - MAP-Elites archive with configurable bin grid
                    and behavioral descriptors (Mouret & Clune 2015)
                  - NSGA-II: fast non-dominated sort + crowding
                    distance (Deb et al. 2002)
                  - Exp3 bandit over operators (Auer et al. 2002)
                  - default 2-D prompt descriptor: length × CoT-presence
* operators.py  — LLM mutation/crossover:
                  paraphrase, structural_edit, cot_inject,
                  novelty_seeking, critique_then_edit (GEPA-lite),
                  meta_mutate_operator_prompt (PromptBreeder-style),
                  semantic_crossover (Meyerson 2023), segment_splice.
                  Each returns a stable prompt_hash for lineage replay.
* evaluator.py  — fitness_spec decorator, dynamic fitness.py import,
                  async batch eval with wall-clock timeout, held-out
                  reward-hacking guard (top-K re-score, gap-based
                  penalty), successive halving (Hyperband-lite).
* evolver.py    — CLI entry: init/run/status/best/lineage/budget/
                  export/replay subcommands; JSON stdout, NDJSON
                  progress streaming from run; supports three
                  algorithms (es, map-elites, nsga2).

SKILL.md
--------
Full authoring spec per Hermes conventions, including a Theoretical
Foundations section citing the primary literature (Bäck & Schwefel,
Mouret & Clune, Deb, Meyerson, Fernando, Agrawal, Lehman & Stanley,
Auer, Li et al.), a tier table explaining the MIT/AGPL/bridge split,
and explicit scope guardrails.

Tests
-----
tests/skills/test_darwinian_evolver.py — 26 cases, 100% green:
* storage: content-address determinism, idempotent insert, ancestry
  reconstruction with parent edges, held-out-preferring get_best,
  budget totals, lineage_hash stability + change detection.
* algorithms: tournament bias (stat check), rank-select pressure
  validation, (μ+λ) elitism, MAP-Elites place/coverage/empty-sample,
  NSGA-II front identification on 2-objective pop, strict dominance,
  crowding-distance boundary infinity, Exp3 reward clamping + weight
  update.
* evaluator: fitness_spec metadata round-trip, scalar + dict fitness
  via async batch, timeout → worst-score degradation, held-out guard
  penalises a synthetic overfitter by >0.5, successive_halving
  narrows 8 → 2 survivors at the expected fidelity schedule.

References issues
-----------------
* closes NousResearch#336 (scoped v1)
* partial for NousResearch#337 Phase 3
* bridges to NousResearch/hermes-agent-self-evolution via Tier 3
  DSPy-jsonl export
Completes the skill scaffolded in the previous commit (Day 3-5 of the
staged plan).

Added
-----
* scripts/sandbox.py — subprocess sandbox for code candidates:
  - POSIX rlimit caps (CPU, AS, DATA, CORE) applied as best-effort in
    a preexec_fn; each limit failure is swallowed independently so
    macOS's RLIMIT_AS-incompatibility doesn't break the others.
  - Wall-clock timeout via subprocess.run(timeout=...) as a backstop.
  - run_candidate_code() and run_pytest_suite() helpers; the latter
    parses the pytest terse summary into a pass fraction.

* scripts/adapters.py — Tier 2 + Tier 3 wrappers:
  - ExternalEvolverAdapter: lazy shutil.which detection, raises
    AdapterUnavailable with install hint instead of crashing.
  - openevolve_adapter (Apache 2.0, default Tier 2 recommendation).
  - darwinian_evolver_adapter (Imbue, AGPL v3) — subprocess only,
    never imported, so license-viral code never enters the Hermes
    process.
  - export_dspy_jsonl: DSPy-compatible offline records with full
    lineage; default keeps one winner per generation, --all emits every
    candidate.
  - export_gepa_trace: reflective-operator edges (critique_then_edit,
    meta_mutator) in the shape GEPA's trainer expects.

* templates/*.py — five copy-paste-ready fitness templates for prompt,
  regex, SQL, code (uses sandbox), and multi-objective (NSGA-II) runs.

* demos/summarize_10_words/ — end-to-end packaged demo:
  - fitness.py: deterministic scoring (word count proximity, brevity
    keyword, char budget), so the demo runs cheaply without an LLM
    judge and the improvement curve is visible on a small local model.
  - seed/initial.txt, README.md with exact commands and expected
    trajectory.

Changed
-------
* scripts/evolver.py: cmd_export now delegates to adapters.export_*;
  added --all flag for dspy-jsonl exports; imports adapters module.

Tests (tests/skills/test_darwinian_evolver.py)
----------------------------------------------
Expanded from 26 to 39 cases — all green:

* TestSandbox — simple candidate runs, syntax error fails cleanly,
  runaway while-True loop killed within wall-clock+overhead, pytest
  terse-summary parser.
* TestAdapterGracefulAbsence — openevolve + darwinian-evolver adapters
  raise AdapterUnavailable with license-informed install hint when the
  binary is missing (monkeypatched shutil.which).
* TestDspyBridge — default export keeps one record per generation;
  --all mode emits every candidate; GEPA export filters to reflective
  operators and preserves parent/child metadata.
* TestLLMClient — seed propagation verified by intercepting the
  AsyncClient.post body; BudgetLedger records spend across calls and
  raises BudgetExceeded when cap is crossed.
* TestMapElitesCoverage — random 200-sample run fills ≥60 % of a 4×4
  descriptor grid (acceptance checklist NousResearch#4).
* TestEndToEnd — single generation with a fully-mocked LLM produces an
  offspring strictly better than the seed and yields a stable lineage
  hash (acceptance checklist NousResearch#7, plus determinism smoke).

Acceptance checklist status: 10/10 covered.

Repo-level notes
----------------
The original examples/ subdirectory is renamed demos/ because the
repository-level .gitignore lists ``examples/``; keeping the demo as
``demos/`` means it ships to users who install the skill without
requiring an exception in .gitignore.
…dge · critic

Professional-grade upgrade of the skill landed in the previous two
commits on this branch. Four cohesive features that move the skill
from hobby-grade to production:

1. LLM response cache
-----------------------
* New `scripts/cache.py` — content-addressed SQLite cache keyed by a
  blake2b-16 of the normalised request body (model, temp, max_tokens,
  seed, messages). Backed by the existing `lineage.db` so the cache
  ships with the experiment directory.
* `llm.LLMClient` gains an optional `cache` field; `complete()` checks
  the cache before every HTTP call and writes on miss. Cache hits
  short-circuit the network AND the BudgetLedger, so reruns are zero
  cost and the budget accounting stays accurate.
* `evolver run` enables the cache by default; `--no-cache` disables it.
* New subcommand: `evolver cache <dir> stats|purge`.
* Acceptance: `test_end_to_end_replay_is_bit_identical` — a second run
  with the cache populated makes ZERO HTTP calls (monkeypatched
  `httpx.post` raises on invocation) and produces an identical
  `lineage_hash`.

2. Pairwise judge + Bradley-Terry MLE
---------------------------------------
* New `scripts/judge.py` — implements:
  * `aggregate_bradley_terry()` — Hunter 2004 MM iteration (pure
    stdlib, ~30 LOC) with Laplace smoothing for candidates that
    never won or never lost.
  * `sample_pair_schedule()` — round-robin for small populations;
    least-seen-anchor sampling for larger ones. Invariant: every
    candidate appears in at least ceil(rounds × 2 / pop) matches.
  * `PairwiseJudge` — LLM-backed judge with a position-bias guard
    that randomly swaps LEFT / RIGHT per call; decodes first-line
    verdicts robustly (LEFT / RIGHT / TIE with fallbacks).
* `fitness_spec` gains `judge="pairwise"`, `pairwise_rounds=40`.
* `evaluator.evaluate_pairwise()` runs the schedule, records votes
  in `pairwise_votes`, solves the MLE, writes `bt_scores`, and copies
  log-odds onto `Individual.fitness` so the existing selector pipeline
  (tournament / MAP-Elites / (μ+λ)) works unchanged.
* Incompatibility with NSGA-II is enforced explicitly; the runner
  errors out early instead of silently corrupting state.
* Acceptance: `test_condorcet_order_recovered_through_mocked_judge` —
  a judge that prefers lex-higher genomes, combined with the MLE,
  recovers the full six-candidate ranking through 30 pair rounds.

3. Constitutional reward-hacking critic
-----------------------------------------
* New `scripts/critic.py` — `ConstitutionalCritic` runs a second LLM
  over the top-K of every generation with a structured JSON contract
  ({"risk", "evidence", "signal_tags"}).
* `templates/constitution.md` — a default rule-set covering literal
  short-circuits, judge flattery, regex over-matching, test-harness
  exploits, spurious correctness, and brittle templates.
* Penalty applies ONLY to the in-memory `Individual.fitness` that
  feeds selection; the raw `fitness` SQLite row is untouched, keeping
  the audit trail pristine.
* `fitness_spec` gains `critic="on"|"off"`, `critic_threshold`,
  `critic_top_k`, and `critic_model` (cheap-model override).
* Runner hook fires after both seed evaluation and each offspring
  generation; JSON parse failures soft-fail to risk=0 so a noisy
  judge can't crash a run.

4. FastAPI + Plotly dashboard
-------------------------------
* New `scripts/dashboard.py` — read-only FastAPI app with endpoints
  GET /api/{summary, fitness, pareto, lineage/{cid}, operators} and
  WS /api/stream (polls `lineage.db` and pushes a JSON event when
  a new generation lands).
* `templates/dashboard.html` — single-page, vanilla JS, Plotly and
  Mermaid from CDN; no build step, no npm.
* New subcommand: `evolver dashboard <dir> [--host 127.0.0.1 --port 8787]`.
* Binds loopback by default; non-loopback hosts print a warning.
* Graceful absence: `dashboard` command prints an install hint and
  exits non-zero if fastapi/uvicorn aren't installed. Every other
  subcommand continues to work.

Storage schema
--------------
Four new tables, all additive (`CREATE TABLE IF NOT EXISTS`), so v0.1
experiments open and run unchanged:
  * llm_cache (key, response, tokens, model, created_at)
  * pairwise_votes (generation, left, right, winner, seed)
  * bt_scores (candidate, generation, log_odds, iters)
  * critic_evaluations (candidate, generation, risk, evidence,
    signal_tags, model, evaluated_at)

Tests
-----
Four new test files, one per feature — 31 new cases, all green:

  tests/skills/test_darwinian_evolver_v02_cache.py      8 cases
  tests/skills/test_darwinian_evolver_v02_judge.py     12 cases
  tests/skills/test_darwinian_evolver_v02_critic.py     6 cases
  tests/skills/test_darwinian_evolver_v02_dashboard.py  5 cases

Full darwinian-evolver suite: 39 (v0.1) + 31 (v0.2) = 70, all green.
No regression in existing repo tests (full tests/skills/ run: 170/170).

Acceptance checklist (rows 11 + 12 from the plan): both pass.

Invariants maintained
---------------------
* Every v0.1 behaviour is preserved; new features are opt-in (except
  the cache, which is transparent and correctness-preserving).
* `lineage_hash` gains determinism it didn't have before: with the
  cache, identical seeds produce identical hashes across runs.
@Bihruze

Bihruze commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

v0.2 changelog (pushed as 1deec86)

Layered on top of the two v0.1 commits — four cohesive features that move the skill from credible-weekend to production-grade. All opt-in except the cache, which is transparent.

Summary

# Feature Why it matters New LOC
1 LLM response cache — content-addressed SQLite, keyed by blake2b-16 of the normalised request body Replay is now bit-for-bit deterministic at zero cost. Cache hits skip both the HTTP call and the budget ledger. cache.py (150) + edits
2 Bradley-Terry pairwise judge — Hunter (2004) MM aggregator over position-bias-guarded pairwise verdicts Absolute LLM-as-judge scores drift (Zheng 2023, Chen 2024). Pairwise preferences are invariant to that drift. judge.py (250) + evaluator wiring
3 Constitutional critic — second LLM inspects top-K for reward hacking per a markdown rule-set Held-out guard catches train/test gaps; the critic catches generalised cheating (judge flattery, regex over-matching, harness exploits). critic.py (180) + constitution.md
4 FastAPI + Plotly dashboard — read-only live view with Mermaid lineage Runs become visible — fitness curves, Pareto, operator attribution, budget burndown — with WebSocket push on new generations. dashboard.py (230) + dashboard.html (180)

CLI additions

evolver run <dir> [--no-cache]
evolver cache <dir> stats|purge
evolver dashboard <dir> [--host 127.0.0.1 --port 8787]

fitness_spec extensions

@fitness_spec(
    judge="pairwise", pairwise_rounds=40,        # feature 2
    critic="on", critic_threshold=0.5,           # feature 3
    critic_top_k=5, critic_model="haiku-cheap",  # feature 3 — cheap-model route
)
def fitness(candidate, context): ...

Storage

Four additive tables (CREATE TABLE IF NOT EXISTS — v0.1 experiments auto-upgrade): llm_cache, pairwise_votes, bt_scores, critic_evaluations.

Tests

31 new cases across four test files, all green:

test_darwinian_evolver_v02_cache.py       8 cases
test_darwinian_evolver_v02_judge.py      12 cases
test_darwinian_evolver_v02_critic.py      6 cases
test_darwinian_evolver_v02_dashboard.py   5 cases

Full darwinian-evolver suite: 39 (v0.1) + 31 (v0.2) = 70/70 green. No regression in other skills.

Acceptance rows 11 (replay bit-identity) and 12 (judge calibration via Condorcet recovery) from the v0.2 plan both pass.

License / safety

  • Dashboard binds 127.0.0.1 by default; --host non-loopback prints a prominent warning. No auth shipped.
  • FastAPI / uvicorn are optional deps; dashboard subcommand fails closed with an install hint when absent.
  • Constitutional critic is off by default.
  • Cache is per-experiment (scoped to its lineage.db), so cross-experiment leakage is structurally impossible.

What's explicitly out of scope for v0.2

No RL fine-tuning (that's Phase 4 of #337); no dashboard auth/write endpoints; no distributed workers; no automatic hyperparameter tuning.

Happy to split the commit into four (one per feature) if that eases review.

@Bihruze

Bihruze commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Closing while v1.0 (A1-C5) lands locally; will reopen with the full roadmap once Bihruze reviews. See /Users/seher/.claude/plans/nested-gathering-kazoo.md for the plan.

@Bihruze Bihruze closed this Apr 19, 2026
Bihruze added a commit to Bihruze/hermes-agent that referenced this pull request Apr 19, 2026
Single commit bundling the rest of the v1.0 roadmap (phases 2-5 of
the approved plan). Per-feature structure preserved in this message;
file diffs in storage.py / SKILL.md cross-cut features and cannot be
split without interactive rebase.

================================================================
PHASE 2 — Research core (v0.4)
================================================================

A2 self-modifying bandit — scripts/bandit_director.py
------------------------------------------------------
LLM proposes add / retire / merge actions on the Exp3 arm set every
R generations; safety rails (consecutive-floor requirement for
retire, max_arms cap for add, no-op on missing arms) keep a bad LLM
reply from corrupting the bandit. DynamicOperator wraps a prompt
template without evaluating Python. New `generated_operators`
table records the library so runs replay deterministically.

A3 co-evolution — scripts/coevolve.py
-------------------------------------
Dual populations (solvers vs adversaries) alternate steps; the
adversary's fitness is the fraction of solvers it defeats on a
threshold-0.5 test. Red-Queen dynamic bounded by
max_adversary_generations. `red_team_inputs` table audits every
adversarial input.

A4 auto-fitness synthesis — scripts/fitness_synth.py
----------------------------------------------------
Given ≤20 labelled I/O pairs, an LLM picks one of three archetypes
(exact / soft / judge) and we emit a runnable fitness.py text.
Meta-APE pattern; user reviews before accepting. Soft archetype
ships a pure-stdlib Levenshtein ratio; judge archetype wires into
the existing LLMClient without extra deps.

================================================================
PHASE 3 — Scale + Safety (v0.5)
================================================================

B1 distributed — scripts/distributed.py
---------------------------------------
WorkerBackend protocol with three implementations:
* LocalBackend     — v0.2 asyncio Semaphore path
* RaySimBackend    — stdlib shim mimicking Ray's API
* RayBackend       — real Ray (lazy import behind optional extra)
`select_backend()` factory dispatches; graceful absence with a
clear install hint when ray isn't on PATH.

B4 sandbox backends — scripts/sandbox_wasm.py · sandbox_firecracker.py
----------------------------------------------------------------------
* WasmSandbox      — wasmtime-py, cross-platform, zero file/net
                     exposure. Graceful availability check.
* FirecrackerSandbox — Linux-only; KVM + firecracker binary + user
                       kernel/rootfs paths. Fails closed with the
                       right message on macOS.

B2 nightly repo sweep — scripts/repo_sweep.py +
                         .github/workflows/darwinian-evolver-nightly.yml
------------------------------------------------------------------
Discovers every SKILL.md in the repo, runs a baseline-score +
(dry-run-only for v0.5 ship) evolve pass, writes a JSON report.
Per-skill 72h cooldown prevents PR churn. Workflow runs at
03:17 UTC nightly + `workflow_dispatch` with `dry_run` input.
Opens NO PRs this release — the workflow uploads a report
artefact only; PR creation ships in a follow-up once maintainers
vet the fitness proxy.

================================================================
PHASE 4 — Transfer + Distill (v0.6)
================================================================

A5 cross-task transfer — scripts/task_features.py + transfer.py
---------------------------------------------------------------
9-D task-feature vector (fitness surface size, objectives, judge
mode, critic flag, archetype hash, char-n-gram entropy, …). k-NN
meta-policy over cosine distance; pickleable for `--transfer-from`.
New `task_features` table records per-experiment features +
policy hash for reproducibility.

B5 evolve → distill — scripts/distill.py
----------------------------------------
Thin LoRA fine-tune pipeline over `transformers + peft +
accelerate` (optional deps group
`darwinian-evolver-distill`). Teacher-callback indirection keeps
the provider swap easy. Availability check raises a helpful
DistillUnavailable when deps absent.

================================================================
PHASE 5 — Ecosystem (v1.0)
================================================================

C1 benchmark hub — scripts/bench.py
-----------------------------------
Registry + 3 canonical fitnesses (email-regex/v1,
ten-word-summary/v1, sql-select-easy/v1); `score_archive(conn, id)`
re-scores a lineage.db's top-K against a registered benchmark.

C2 cross-model validation — scripts/validate.py
-----------------------------------------------
`cross_model_validate` re-scores candidates under a user-supplied
target scorer coroutine and reports Spearman-ρ between local and
target rankings.

C5 forkable marketplace — scripts/marketplace.py
------------------------------------------------
Layer on v0.3 hub: `prepare_listing` makes a tarball, `fork_listing`
extracts into a warm-start experiment dir, `listing_summary`
inspects a tarball without extracting.

Deferred (documented in SKILL.md):
* C3 HITL dashboard editor
* C4 VS Code extension (TypeScript, separate package)

================================================================
Storage schema
================================================================

Five new tables, all additive:
* `generated_operators`  — A2, name/template/temperature/retired_at
* `red_team_inputs`      — A3, adversarial input corpus
* `fitness_syntheses`    — A4, history of synthesised fitnesses
* `task_features`        — A5, per-experiment feature vector
(The A1 `descriptor_history` and B3 `hub_imports` already shipped
in v0.3.)

================================================================
Tests — 58 new cases, all green
================================================================

tests/skills/test_darwinian_evolver_v04.py       25 (A2+A3+A4+storage)
tests/skills/test_darwinian_evolver_v05.py       14 (B1+B4+B2)
tests/skills/test_darwinian_evolver_v06_v1.py    19 (A5+B5+C1+C2+C5)

Cumulative suite status:
* v0.1 — 39
* v0.2 — 31
* v0.3 — 20
* v0.4 — 25
* v0.5 — 14
* v0.6 + v1.0 — 19
Total: 148 / 148 green. No regression in other skills.

================================================================
What's explicitly NOT in this commit
================================================================

* evolver.py CLI hooks for A2-A5 — library-level only this round;
  wiring into `_run_loop` / subcommands ships when the runner
  refactor for Phase 2 lands (cleaner than weaving 5 hooks into the
  existing single-path loop right now).
* Ablation experiment scripts under `experiments/phase*/` — blocked
  on GPU budget; scripts land when cluster access is confirmed.
* C3 dashboard HITL edit endpoint and C4 IDE extension.

================================================================
Branch discipline
================================================================

LOCAL ONLY. The branch is 5 commits ahead of origin/main and NOT
pushed. PR NousResearch#12633 is closed; Bihruze will re-open with the full v1.0
history once this commit passes his review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant