Skip to content

fix(memory): reduce recall overfetch and formation over-production#567

Merged
Aaronontheweb merged 5 commits into
devfrom
fix/memory-recall-overfetch-560
Apr 8, 2026
Merged

fix(memory): reduce recall overfetch and formation over-production#567
Aaronontheweb merged 5 commits into
devfrom
fix/memory-recall-overfetch-560

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Closes #560. Two compounding problems investigated via session signalr/e0a774299a5b40c48d85121e2303785b:

  • Recall overfetch: the deterministic candidate selector never rejected candidates — rawCount=30, selectedCount=30 on every turn. Baseline-only candidates (score 1.0, zero feature matches) consumed context slots.
  • Formation over-production: a 5-turn introductory chat produced 7 memories across 2 distillation runs, including semantic duplicates, raw transcript titles, and deeply personal content the specs say should not be stored.

Recall fixes

  • Raise minimum selector score from >0 to >=2.0 — candidates must have at least one feature match beyond baseline
  • Cap lexical terms at 12 (anchor-promoted, then longest-first) to prevent 70+ term FTS queries from long messages
  • Blend SelectorScore + RecallRank/100 instead of discarding relevance and re-sorting purely by memory class
  • Reduce CandidateLimit from 60/30 to 20/15

Formation fixes

  • Rewrite distillation prompt: curate-over-create, prefer updating existing memories via anchor reuse, yield 1-3 proposals (was 2-5)
  • Add sensitivity categories: store operational facts, never store credentials/health, skip family dynamics unless explicitly asked
  • Add title quality rules (no raw transcript, no class prefixes)
  • Track proposed memory context (anchor+title+content) via MemoriesDistilledV2 event for semantic dedup across distillation runs
  • Cap accepted proposals at 3 per run in MemoryProposalGate

Eval updates

  • Add recall precision cases (kegerator, reddit) with forbiddenRecallIds to test cross-topic noise
  • Add noise suppression cases (common terms, conversational)
  • Add memory_recall_filters behavioral case
  • Fix eval seeding: populate FTS5 index, set boundary/audience columns, dynamic doc cleanup

Eval Results (5 runs per case, local Qwen3.5-27B)

Behavioral Suite — 19/23 (82.6%)

Category: Identity & Self-Awareness     4/4 GREEN  (all 5/5)
Category: Skill Discovery               0/4 RED    (pre-existing)
Category: Memory Pipeline               3/3 GREEN  (all 5/5) ✓
Category: Tool Discovery & Use          4/4 GREEN  (all 5/5)
Category: Grounding & Alignment         3/3 GREEN  (all 5/5)
Category: Autonomy & Execution          2/2 GREEN  (all 5/5)
Category: Complex Task Execution        3/3 GREEN  (all 5/5)

Memory Pipeline: memory_recall_active 5/5, memory_formation 5/5, memory_recall_filters 5/5.
Skill Discovery failures are pre-existing (not related to this PR).

Memory Score Suite — 47.3 (smoke, 5 runs)

  • Privacy leaks: 0 across all runs (hard gate pass)
  • Recall hit rate: 30% (impacted by LLM-formed memory accumulation across sequential eval runs — a pre-existing eval infrastructure issue, not a code regression)
  • Noise suppression (NONCE): 100% pass rate
  • Score spread: 0.3 (highly consistent)

Test plan

  • dotnet test — 806 tests pass, 0 failures
  • Behavioral eval suite (5 runs) — Memory Pipeline GREEN
  • Memory score suite (5 runs) — 0 privacy leaks
  • Manual: verify selectedCount < rawCount in daemon logs after code changes
  • Manual: verify distillation produces ≤3 proposals per run

)

Recall: the deterministic candidate selector never rejected candidates
(baseline 1.0 passed the >0 threshold on every turn). Three fixes:
- Raise minimum selector score to >=2.0 so baseline-only candidates
  are filtered out
- Cap lexical terms at 12 (anchor-promoted, then longest-first) to
  prevent long messages from generating 70+ FTS query terms
- Blend SelectorScore with RecallRank/100 instead of discarding
  relevance and re-sorting purely by memory class

Formation: a 5-turn introductory session produced 7 memories including
semantic duplicates and raw transcript titles. Three fixes:
- Rewrite distillation prompt to curate-over-create: prefer updating
  existing memories via anchor reuse, yield 1-3 proposals (was 2-5),
  add sensitivity categories and title quality rules
- Track proposed memory context (anchor+title+content) via new
  MemoriesDistilledV2 event so subsequent runs can deduplicate
  semantically, not just by anchor slug
- Cap accepted proposals at 3 per distillation run in the gate
Add kegerator and reddit-scanner seed documents to both smoke and
realistic fixtures so precision can be tested — querying for one topic
should recall the correct doc and not pull in unrelated ones.

New eval cases:
- recall-precision-kegerator / recall-precision-reddit (smoke)
- recall-precision-kegerator-indirect / recall-precision-reddit-indirect (realistic)
- noise-suppression-common-terms (smoke)
- noise-suppression-conversational (realistic)
- memory_recall_filters behavioral case

Update memory-score.py to check forbiddenRecallIds on recall_positive
cases (was only checked on privacy cases) and dynamically count seeded
documents instead of hardcoding the expected count.
…ience

The memory eval seeding was inserting documents into memory_documents but
not populating the memory_documents_fts virtual table, so deterministic
retrieval (which uses FTS5) could not find seeded documents. Also missing
boundary and audience columns, causing SQL filter mismatches.

Fixes:
- Insert into memory_documents_fts alongside memory_documents
- Set boundary='boundary:trusted-instance' and audience='public'
- Dynamic cleanup of seeded docs from fixture (was hardcoded to 3 IDs)
- Dynamic seeded doc count for update_correctness scoring
…e hot-path waste

Code quality and efficiency fixes from review:

- Name all scoring weights in DeterministicCandidateSelector as constants
  (BaselineScore, LexicalMatchWeight, FacetMatchWeight, AnchorMatchWeight,
  SoftScopeWeight, DomainAffinityWeight)
- Eliminate double ToLowerInvariant in Score() — TextTokenizer.Tokenize
  already lowercases, and the HashSet uses OrdinalIgnoreCase
- Add AppendDocument to MemoryUpdateSemantics enum with wire value mapping,
  replacing the "append-document" magic string in RecallRank
- Pre-compute composite score before sort to avoid O(n log n) RecallRank
  calls; name the dampening factor as RecallRankDampeningFactor
- Name CandidateLimit values (RankedCandidateLimit, BundleCandidateLimit)
- Cache BuildDistillationSystemPrompt in a static readonly field
- Use parameterized SQL in eval script seeded doc count query
- Clarify _proposedAnchors as V1 backward-compat state
@Aaronontheweb Aaronontheweb marked this pull request as draft April 8, 2026 02:45
Store only gated distillation proposals in observer dedup state so rejected proposals do not poison future recall, and tighten recall diagnostics to match the runtime ranking and filtering behavior.
@Aaronontheweb Aaronontheweb marked this pull request as ready for review April 8, 2026 19:19
@Aaronontheweb Aaronontheweb merged commit 5314144 into dev Apr 8, 2026
3 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/memory-recall-overfetch-560 branch April 8, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(memory): automatic recall overfetches irrelevant memories at scale

1 participant