feature: formal eval suite for identity, skills, memory, and compaction quality

## Problem

There is no systematic way to test that Netclaw's identity alignment, skill loading, memory recall, and compaction quality meet acceptable thresholds. Issues are discovered only through production use.

## Proposal

Build a formal eval suite that can be run against any inference provider to measure:

### 1. Identity Alignment Evals
- Does the bot know it is Netclaw?
- Does it know its capabilities (can reference skill index)?
- Does it correctly refuse things outside its capability?
- Does it maintain identity after compaction?

### 2. Skill Auto-Loading Evals
- Given a user message about scheduling, does \`netclaw-manual\` auto-load?
- Given a message about "what's wrong with my session", does \`netclaw-diagnostics\` auto-load?
- Do skills re-load correctly after compaction?
- Measure keyword match scores for canonical user intents

### 3. Memory Recall Evals
- Seed database with known facts, query with natural language, verify recall
- Test cross-domain widening behavior
- Test recall under 300ms timeout pressure
- Test recall after compaction (do memories about the conversation topic surface?)

### 4. Compaction Quality Evals
- Feed a known conversation history, compact, verify summary preserves key facts
- Measure information loss (what was in the conversation vs what's in the summary)
- Test that post-compaction turns have sufficient context to continue

### Infrastructure

- Use \`netclaw -p\` headless mode for single-shot evaluations
- Allow configuring inference provider via environment variable (\`NETCLAW_EVAL_PROVIDER_URL\`, \`NETCLAW_EVAL_MODEL\`)
- Eval database: separate SQLite file that can be seeded from fixtures
- Eval results: structured output (JSON) with pass/fail per case + scores
- Separate project: \`src/Netclaw.Evals/\` or \`evals/\` directory
- Extend existing \`MemoryRedesignedEvalSuiteTests\` and \`MemoryEvalSeedSuiteTests\` patterns

### A/B Testing Support

- Parameterize evals by model, system prompt variant, compaction settings
- Output structured comparison data to enable prompt/config iteration

### What Should NOT Require an Inference Provider

Many deterministic components can be tested without LLM:
- Skill keyword matching (already has \`SkillRegistryMatchTests\`)
- Memory recall planning (already has \`DeterministicRetrievalPlanningTests\`)
- Compaction reducer (Phase 1) behavior
- Session state transitions

Only identity alignment and compaction summary quality require actual LLM inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: formal eval suite for identity, skills, memory, and compaction quality #319

Problem

Proposal

1. Identity Alignment Evals

2. Skill Auto-Loading Evals

3. Memory Recall Evals

4. Compaction Quality Evals

Infrastructure

A/B Testing Support

What Should NOT Require an Inference Provider

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feature: formal eval suite for identity, skills, memory, and compaction quality #319

Description

Problem

Proposal

1. Identity Alignment Evals

2. Skill Auto-Loading Evals

3. Memory Recall Evals

4. Compaction Quality Evals

Infrastructure

A/B Testing Support

What Should NOT Require an Inference Provider

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions