Skip to content

feat(evals): isolated eval environment with ephemeral daemon #569

@Aaronontheweb

Description

@Aaronontheweb

Problem

The eval suite runs against the same Netclaw instance used for real work. This creates two conflicts:

  1. Evals contaminate the production DB — seeding test documents, forming LLM memories from eval conversations, and wiping the DB between runs destroys real user memories.
  2. Production state contaminates evals — existing memories from real conversations compete with seeded eval documents for recall slots, causing false failures unrelated to code quality.

Observed in #560: memory-score eval runs accumulate LLM-formed memories across sequential runs. By run 3, documents like doc-80680e... (formed from eval conversations) crowd out seeded eval docs from the 3 recall slots, dropping recall hit rate from 100% to 30%.

Proposed Solution

Run evals against an ephemeral daemon instance with its own NETCLAW_HOME, sharing only the LLM endpoint.

What must be shared

  • LLM model endpoint — the local GPU-hosted model (Qwen3.5-27B). Can't spin up a second instance. Both production and eval daemons connect to the same inference server.

What gets isolated

  • SQLite database — fresh per eval run, no cross-contamination
  • Session logs — separate log directory
  • Persistence journals — separate Akka persistence
  • Daemon port — different HTTP port to avoid conflicts

Eval lifecycle

1. Create temp dir:  EVAL_HOME=/tmp/netclaw-eval-$(uuidgen)
2. Bootstrap config: copy ~/.netclaw/config.yaml + identity/ to EVAL_HOME
                     (or generate minimal eval-specific config)
3. Start eval daemon: NETCLAW_HOME=$EVAL_HOME netclawd --urls http://127.0.0.1:$EVAL_PORT
4. Wait for healthy:  curl $EVAL_PORT/health
5. Seed DB:           insert eval documents into $EVAL_HOME/netclaw.db
6. Run eval cases:    NETCLAW_HOME=$EVAL_HOME netclaw -p "prompt..."
7. Collect results:   parse daemon logs from $EVAL_HOME/logs/
8. Tear down:         stop eval daemon, rm -rf $EVAL_HOME

Per-run isolation within a suite

For multi-run suites (5 iterations), clean non-seeded memories between runs to prevent LLM-formed memory accumulation:

DELETE FROM memory_documents WHERE document_id NOT LIKE 'doc-eval-%';
DELETE FROM memory_documents_fts WHERE document_id NOT LIKE 'doc-eval-%';
DELETE FROM memory_records;

This ensures each run starts with only the seeded eval documents, regardless of what the LLM formed during previous runs.

Configuration requirements

Need to identify the minimal config surface for an eval daemon:

  • LLM provider endpoint (inherited from production config)
  • Model selection (same model as production for meaningful results)
  • Daemon port (unique per eval run)
  • Memory/session tuning (match production defaults)
  • Skills directory (can be shared read-only)

Behavioral eval (run-evals.sh)

Currently uses netclaw -p "prompt" which connects to the running daemon. Would need to either:

  • Set NETCLAW_HOME per-invocation so the CLI connects to the eval daemon
  • Or use a --endpoint flag to override the daemon URL

CI considerations

GitHub Actions runners don't have a GPU, so CI evals would need either:

  • A remote LLM endpoint (API-based model)
  • Or be limited to non-LLM assertions (log pattern checks, seeding verification)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    context-pipelineLLM context assembly: prompt layers, dynamic injection, memory recall, temporal groundingmemoryMemory formation, recall, curation pipeline

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions