feat(evals): isolated eval environment with ephemeral daemon

## Problem

The eval suite runs against the same Netclaw instance used for real work. This creates two conflicts:

1. **Evals contaminate the production DB** — seeding test documents, forming LLM memories from eval conversations, and wiping the DB between runs destroys real user memories.
2. **Production state contaminates evals** — existing memories from real conversations compete with seeded eval documents for recall slots, causing false failures unrelated to code quality.

Observed in #560: memory-score eval runs accumulate LLM-formed memories across sequential runs. By run 3, documents like `doc-80680e...` (formed from eval conversations) crowd out seeded eval docs from the 3 recall slots, dropping recall hit rate from 100% to 30%.

## Proposed Solution

Run evals against an **ephemeral daemon instance** with its own `NETCLAW_HOME`, sharing only the LLM endpoint.

### What must be shared
- **LLM model endpoint** — the local GPU-hosted model (Qwen3.5-27B). Can't spin up a second instance. Both production and eval daemons connect to the same inference server.

### What gets isolated
- **SQLite database** — fresh per eval run, no cross-contamination
- **Session logs** — separate log directory
- **Persistence journals** — separate Akka persistence
- **Daemon port** — different HTTP port to avoid conflicts

### Eval lifecycle

```
1. Create temp dir:  EVAL_HOME=/tmp/netclaw-eval-$(uuidgen)
2. Bootstrap config: copy ~/.netclaw/config.yaml + identity/ to EVAL_HOME
                     (or generate minimal eval-specific config)
3. Start eval daemon: NETCLAW_HOME=$EVAL_HOME netclawd --urls http://127.0.0.1:$EVAL_PORT
4. Wait for healthy:  curl $EVAL_PORT/health
5. Seed DB:           insert eval documents into $EVAL_HOME/netclaw.db
6. Run eval cases:    NETCLAW_HOME=$EVAL_HOME netclaw -p "prompt..."
7. Collect results:   parse daemon logs from $EVAL_HOME/logs/
8. Tear down:         stop eval daemon, rm -rf $EVAL_HOME
```

### Per-run isolation within a suite

For multi-run suites (5 iterations), clean non-seeded memories between runs to prevent LLM-formed memory accumulation:

```sql
DELETE FROM memory_documents WHERE document_id NOT LIKE 'doc-eval-%';
DELETE FROM memory_documents_fts WHERE document_id NOT LIKE 'doc-eval-%';
DELETE FROM memory_records;
```

This ensures each run starts with only the seeded eval documents, regardless of what the LLM formed during previous runs.

### Configuration requirements

Need to identify the minimal config surface for an eval daemon:
- LLM provider endpoint (inherited from production config)
- Model selection (same model as production for meaningful results)
- Daemon port (unique per eval run)
- Memory/session tuning (match production defaults)
- Skills directory (can be shared read-only)

### Behavioral eval (`run-evals.sh`)

Currently uses `netclaw -p "prompt"` which connects to the running daemon. Would need to either:
- Set `NETCLAW_HOME` per-invocation so the CLI connects to the eval daemon
- Or use a `--endpoint` flag to override the daemon URL

### CI considerations

GitHub Actions runners don't have a GPU, so CI evals would need either:
- A remote LLM endpoint (API-based model)
- Or be limited to non-LLM assertions (log pattern checks, seeding verification)

## Related

- #560 — memory overfetch fix, where eval contamination was first observed
- Memory-score eval currently at 47/100 due to cross-run contamination

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): isolated eval environment with ephemeral daemon #569

Problem

Proposed Solution

What must be shared

What gets isolated

Eval lifecycle

Per-run isolation within a suite

Configuration requirements

Behavioral eval (`run-evals.sh`)

CI considerations

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(evals): isolated eval environment with ephemeral daemon #569

Description

Problem

Proposed Solution

What must be shared

What gets isolated

Eval lifecycle

Per-run isolation within a suite

Configuration requirements

Behavioral eval (run-evals.sh)

CI considerations

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Behavioral eval (`run-evals.sh`)