fix(memory): reduce recall overfetch and formation over-production by Aaronontheweb · Pull Request #567 · netclaw-dev/netclaw

Aaronontheweb · 2026-04-08T02:30:42Z

Summary

Closes #560. Two compounding problems investigated via session signalr/e0a774299a5b40c48d85121e2303785b:

Recall overfetch: the deterministic candidate selector never rejected candidates — rawCount=30, selectedCount=30 on every turn. Baseline-only candidates (score 1.0, zero feature matches) consumed context slots.
Formation over-production: a 5-turn introductory chat produced 7 memories across 2 distillation runs, including semantic duplicates, raw transcript titles, and deeply personal content the specs say should not be stored.

Recall fixes

Raise minimum selector score from >0 to >=2.0 — candidates must have at least one feature match beyond baseline
Cap lexical terms at 12 (anchor-promoted, then longest-first) to prevent 70+ term FTS queries from long messages
Blend SelectorScore + RecallRank/100 instead of discarding relevance and re-sorting purely by memory class
Reduce CandidateLimit from 60/30 to 20/15

Formation fixes

Rewrite distillation prompt: curate-over-create, prefer updating existing memories via anchor reuse, yield 1-3 proposals (was 2-5)
Add sensitivity categories: store operational facts, never store credentials/health, skip family dynamics unless explicitly asked
Add title quality rules (no raw transcript, no class prefixes)
Track proposed memory context (anchor+title+content) via MemoriesDistilledV2 event for semantic dedup across distillation runs
Cap accepted proposals at 3 per run in MemoryProposalGate

Eval updates

Add recall precision cases (kegerator, reddit) with forbiddenRecallIds to test cross-topic noise
Add noise suppression cases (common terms, conversational)
Add memory_recall_filters behavioral case
Fix eval seeding: populate FTS5 index, set boundary/audience columns, dynamic doc cleanup

Eval Results (5 runs per case, local Qwen3.5-27B)

Behavioral Suite — 19/23 (82.6%)

Category: Identity & Self-Awareness     4/4 GREEN  (all 5/5)
Category: Skill Discovery               0/4 RED    (pre-existing)
Category: Memory Pipeline               3/3 GREEN  (all 5/5) ✓
Category: Tool Discovery & Use          4/4 GREEN  (all 5/5)
Category: Grounding & Alignment         3/3 GREEN  (all 5/5)
Category: Autonomy & Execution          2/2 GREEN  (all 5/5)
Category: Complex Task Execution        3/3 GREEN  (all 5/5)

Memory Pipeline: memory_recall_active 5/5, memory_formation 5/5, memory_recall_filters 5/5.
Skill Discovery failures are pre-existing (not related to this PR).

Memory Score Suite — 47.3 (smoke, 5 runs)

Privacy leaks: 0 across all runs (hard gate pass)
Recall hit rate: 30% (impacted by LLM-formed memory accumulation across sequential eval runs — a pre-existing eval infrastructure issue, not a code regression)
Noise suppression (NONCE): 100% pass rate
Score spread: 0.3 (highly consistent)

Test plan

dotnet test — 806 tests pass, 0 failures
Behavioral eval suite (5 runs) — Memory Pipeline GREEN
Memory score suite (5 runs) — 0 privacy leaks
Manual: verify selectedCount < rawCount in daemon logs after code changes
Manual: verify distillation produces ≤3 proposals per run

) Recall: the deterministic candidate selector never rejected candidates (baseline 1.0 passed the >0 threshold on every turn). Three fixes: - Raise minimum selector score to >=2.0 so baseline-only candidates are filtered out - Cap lexical terms at 12 (anchor-promoted, then longest-first) to prevent long messages from generating 70+ FTS query terms - Blend SelectorScore with RecallRank/100 instead of discarding relevance and re-sorting purely by memory class Formation: a 5-turn introductory session produced 7 memories including semantic duplicates and raw transcript titles. Three fixes: - Rewrite distillation prompt to curate-over-create: prefer updating existing memories via anchor reuse, yield 1-3 proposals (was 2-5), add sensitivity categories and title quality rules - Track proposed memory context (anchor+title+content) via new MemoriesDistilledV2 event so subsequent runs can deduplicate semantically, not just by anchor slug - Cap accepted proposals at 3 per distillation run in the gate

Add kegerator and reddit-scanner seed documents to both smoke and realistic fixtures so precision can be tested — querying for one topic should recall the correct doc and not pull in unrelated ones. New eval cases: - recall-precision-kegerator / recall-precision-reddit (smoke) - recall-precision-kegerator-indirect / recall-precision-reddit-indirect (realistic) - noise-suppression-common-terms (smoke) - noise-suppression-conversational (realistic) - memory_recall_filters behavioral case Update memory-score.py to check forbiddenRecallIds on recall_positive cases (was only checked on privacy cases) and dynamically count seeded documents instead of hardcoding the expected count.

…ience The memory eval seeding was inserting documents into memory_documents but not populating the memory_documents_fts virtual table, so deterministic retrieval (which uses FTS5) could not find seeded documents. Also missing boundary and audience columns, causing SQL filter mismatches. Fixes: - Insert into memory_documents_fts alongside memory_documents - Set boundary='boundary:trusted-instance' and audience='public' - Dynamic cleanup of seeded docs from fixture (was hardcoded to 3 IDs) - Dynamic seeded doc count for update_correctness scoring

…e hot-path waste Code quality and efficiency fixes from review: - Name all scoring weights in DeterministicCandidateSelector as constants (BaselineScore, LexicalMatchWeight, FacetMatchWeight, AnchorMatchWeight, SoftScopeWeight, DomainAffinityWeight) - Eliminate double ToLowerInvariant in Score() — TextTokenizer.Tokenize already lowercases, and the HashSet uses OrdinalIgnoreCase - Add AppendDocument to MemoryUpdateSemantics enum with wire value mapping, replacing the "append-document" magic string in RecallRank - Pre-compute composite score before sort to avoid O(n log n) RecallRank calls; name the dampening factor as RecallRankDampeningFactor - Name CandidateLimit values (RankedCandidateLimit, BundleCandidateLimit) - Cache BuildDistillationSystemPrompt in a static readonly field - Use parameterized SQL in eval script seeded doc count query - Clarify _proposedAnchors as V1 backward-compat state

Store only gated distillation proposals in observer dedup state so rejected proposals do not poison future recall, and tighten recall diagnostics to match the runtime ranking and filtering behavior.

Aaronontheweb added 4 commits April 7, 2026 22:26

Aaronontheweb marked this pull request as draft April 8, 2026 02:45

fix(memory): persist accepted distillation context

6e2a673

Store only gated distillation proposals in observer dedup state so rejected proposals do not poison future recall, and tighten recall diagnostics to match the runtime ranking and filtering behavior.

Aaronontheweb marked this pull request as ready for review April 8, 2026 19:19

Aaronontheweb merged commit 5314144 into dev Apr 8, 2026
3 checks passed

Aaronontheweb deleted the fix/memory-recall-overfetch-560 branch April 8, 2026 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): reduce recall overfetch and formation over-production#567

fix(memory): reduce recall overfetch and formation over-production#567
Aaronontheweb merged 5 commits into
devfrom
fix/memory-recall-overfetch-560

Aaronontheweb commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Apr 8, 2026

Summary

Recall fixes

Formation fixes

Eval updates

Eval Results (5 runs per case, local Qwen3.5-27B)

Behavioral Suite — 19/23 (82.6%)

Memory Score Suite — 47.3 (smoke, 5 runs)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant