fix(memory): reduce recall overfetch and formation over-production#567
Merged
Conversation
) Recall: the deterministic candidate selector never rejected candidates (baseline 1.0 passed the >0 threshold on every turn). Three fixes: - Raise minimum selector score to >=2.0 so baseline-only candidates are filtered out - Cap lexical terms at 12 (anchor-promoted, then longest-first) to prevent long messages from generating 70+ FTS query terms - Blend SelectorScore with RecallRank/100 instead of discarding relevance and re-sorting purely by memory class Formation: a 5-turn introductory session produced 7 memories including semantic duplicates and raw transcript titles. Three fixes: - Rewrite distillation prompt to curate-over-create: prefer updating existing memories via anchor reuse, yield 1-3 proposals (was 2-5), add sensitivity categories and title quality rules - Track proposed memory context (anchor+title+content) via new MemoriesDistilledV2 event so subsequent runs can deduplicate semantically, not just by anchor slug - Cap accepted proposals at 3 per distillation run in the gate
Add kegerator and reddit-scanner seed documents to both smoke and realistic fixtures so precision can be tested — querying for one topic should recall the correct doc and not pull in unrelated ones. New eval cases: - recall-precision-kegerator / recall-precision-reddit (smoke) - recall-precision-kegerator-indirect / recall-precision-reddit-indirect (realistic) - noise-suppression-common-terms (smoke) - noise-suppression-conversational (realistic) - memory_recall_filters behavioral case Update memory-score.py to check forbiddenRecallIds on recall_positive cases (was only checked on privacy cases) and dynamically count seeded documents instead of hardcoding the expected count.
…ience The memory eval seeding was inserting documents into memory_documents but not populating the memory_documents_fts virtual table, so deterministic retrieval (which uses FTS5) could not find seeded documents. Also missing boundary and audience columns, causing SQL filter mismatches. Fixes: - Insert into memory_documents_fts alongside memory_documents - Set boundary='boundary:trusted-instance' and audience='public' - Dynamic cleanup of seeded docs from fixture (was hardcoded to 3 IDs) - Dynamic seeded doc count for update_correctness scoring
…e hot-path waste Code quality and efficiency fixes from review: - Name all scoring weights in DeterministicCandidateSelector as constants (BaselineScore, LexicalMatchWeight, FacetMatchWeight, AnchorMatchWeight, SoftScopeWeight, DomainAffinityWeight) - Eliminate double ToLowerInvariant in Score() — TextTokenizer.Tokenize already lowercases, and the HashSet uses OrdinalIgnoreCase - Add AppendDocument to MemoryUpdateSemantics enum with wire value mapping, replacing the "append-document" magic string in RecallRank - Pre-compute composite score before sort to avoid O(n log n) RecallRank calls; name the dampening factor as RecallRankDampeningFactor - Name CandidateLimit values (RankedCandidateLimit, BundleCandidateLimit) - Cache BuildDistillationSystemPrompt in a static readonly field - Use parameterized SQL in eval script seeded doc count query - Clarify _proposedAnchors as V1 backward-compat state
Store only gated distillation proposals in observer dedup state so rejected proposals do not poison future recall, and tighten recall diagnostics to match the runtime ranking and filtering behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #560. Two compounding problems investigated via session
signalr/e0a774299a5b40c48d85121e2303785b:rawCount=30, selectedCount=30on every turn. Baseline-only candidates (score 1.0, zero feature matches) consumed context slots.Recall fixes
>0to>=2.0— candidates must have at least one feature match beyond baselineSelectorScore + RecallRank/100instead of discarding relevance and re-sorting purely by memory classFormation fixes
MemoriesDistilledV2event for semantic dedup across distillation runsMemoryProposalGateEval updates
forbiddenRecallIdsto test cross-topic noisememory_recall_filtersbehavioral caseEval Results (5 runs per case, local Qwen3.5-27B)
Behavioral Suite — 19/23 (82.6%)
Memory Pipeline:
memory_recall_active5/5,memory_formation5/5,memory_recall_filters5/5.Skill Discovery failures are pre-existing (not related to this PR).
Memory Score Suite — 47.3 (smoke, 5 runs)
Test plan
dotnet test— 806 tests pass, 0 failuresselectedCount < rawCountin daemon logs after code changes