Skip to content

feature: formal eval suite for identity, skills, memory, and compaction quality #319

@Aaronontheweb

Description

@Aaronontheweb

Problem

There is no systematic way to test that Netclaw's identity alignment, skill loading, memory recall, and compaction quality meet acceptable thresholds. Issues are discovered only through production use.

Proposal

Build a formal eval suite that can be run against any inference provider to measure:

1. Identity Alignment Evals

  • Does the bot know it is Netclaw?
  • Does it know its capabilities (can reference skill index)?
  • Does it correctly refuse things outside its capability?
  • Does it maintain identity after compaction?

2. Skill Auto-Loading Evals

  • Given a user message about scheduling, does `netclaw-manual` auto-load?
  • Given a message about "what's wrong with my session", does `netclaw-diagnostics` auto-load?
  • Do skills re-load correctly after compaction?
  • Measure keyword match scores for canonical user intents

3. Memory Recall Evals

  • Seed database with known facts, query with natural language, verify recall
  • Test cross-domain widening behavior
  • Test recall under 300ms timeout pressure
  • Test recall after compaction (do memories about the conversation topic surface?)

4. Compaction Quality Evals

  • Feed a known conversation history, compact, verify summary preserves key facts
  • Measure information loss (what was in the conversation vs what's in the summary)
  • Test that post-compaction turns have sufficient context to continue

Infrastructure

  • Use `netclaw -p` headless mode for single-shot evaluations
  • Allow configuring inference provider via environment variable (`NETCLAW_EVAL_PROVIDER_URL`, `NETCLAW_EVAL_MODEL`)
  • Eval database: separate SQLite file that can be seeded from fixtures
  • Eval results: structured output (JSON) with pass/fail per case + scores
  • Separate project: `src/Netclaw.Evals/` or `evals/` directory
  • Extend existing `MemoryRedesignedEvalSuiteTests` and `MemoryEvalSeedSuiteTests` patterns

A/B Testing Support

  • Parameterize evals by model, system prompt variant, compaction settings
  • Output structured comparison data to enable prompt/config iteration

What Should NOT Require an Inference Provider

Many deterministic components can be tested without LLM:

  • Skill keyword matching (already has `SkillRegistryMatchTests`)
  • Memory recall planning (already has `DeterministicRetrievalPlanningTests`)
  • Compaction reducer (Phase 1) behavior
  • Session state transitions

Only identity alignment and compaction summary quality require actual LLM inference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions