Problem
There is no systematic way to test that Netclaw's identity alignment, skill loading, memory recall, and compaction quality meet acceptable thresholds. Issues are discovered only through production use.
Proposal
Build a formal eval suite that can be run against any inference provider to measure:
1. Identity Alignment Evals
- Does the bot know it is Netclaw?
- Does it know its capabilities (can reference skill index)?
- Does it correctly refuse things outside its capability?
- Does it maintain identity after compaction?
2. Skill Auto-Loading Evals
- Given a user message about scheduling, does `netclaw-manual` auto-load?
- Given a message about "what's wrong with my session", does `netclaw-diagnostics` auto-load?
- Do skills re-load correctly after compaction?
- Measure keyword match scores for canonical user intents
3. Memory Recall Evals
- Seed database with known facts, query with natural language, verify recall
- Test cross-domain widening behavior
- Test recall under 300ms timeout pressure
- Test recall after compaction (do memories about the conversation topic surface?)
4. Compaction Quality Evals
- Feed a known conversation history, compact, verify summary preserves key facts
- Measure information loss (what was in the conversation vs what's in the summary)
- Test that post-compaction turns have sufficient context to continue
Infrastructure
- Use `netclaw -p` headless mode for single-shot evaluations
- Allow configuring inference provider via environment variable (`NETCLAW_EVAL_PROVIDER_URL`, `NETCLAW_EVAL_MODEL`)
- Eval database: separate SQLite file that can be seeded from fixtures
- Eval results: structured output (JSON) with pass/fail per case + scores
- Separate project: `src/Netclaw.Evals/` or `evals/` directory
- Extend existing `MemoryRedesignedEvalSuiteTests` and `MemoryEvalSeedSuiteTests` patterns
A/B Testing Support
- Parameterize evals by model, system prompt variant, compaction settings
- Output structured comparison data to enable prompt/config iteration
What Should NOT Require an Inference Provider
Many deterministic components can be tested without LLM:
- Skill keyword matching (already has `SkillRegistryMatchTests`)
- Memory recall planning (already has `DeterministicRetrievalPlanningTests`)
- Compaction reducer (Phase 1) behavior
- Session state transitions
Only identity alignment and compaction summary quality require actual LLM inference.
Problem
There is no systematic way to test that Netclaw's identity alignment, skill loading, memory recall, and compaction quality meet acceptable thresholds. Issues are discovered only through production use.
Proposal
Build a formal eval suite that can be run against any inference provider to measure:
1. Identity Alignment Evals
2. Skill Auto-Loading Evals
3. Memory Recall Evals
4. Compaction Quality Evals
Infrastructure
A/B Testing Support
What Should NOT Require an Inference Provider
Many deterministic components can be tested without LLM:
Only identity alignment and compaction summary quality require actual LLM inference.