-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Summary
When a single session.commit() produces multiple candidate memories about the same entity, the deduplicator fails to detect them as duplicates. This results in near-identical memory files being created in the same category directory.
Root Cause
The dedup flow in compressor.py processes candidates sequentially. For each candidate, it:
- Calls
deduplicator.deduplicate(candidate)which uses vector search to find similar existing memories - If the decision is CREATE, calls
extractor.create_memory()then_index_memory() _index_memory()enqueues the new memory for async vectorization
The problem: when candidate N+1 runs its dedup vector search, candidate N's vectors have not been indexed yet (still in the async embedding queue). So the search cannot find the just-created memory, and the LLM receives an incomplete picture of existing memories.
Reproduction
- Have a conversation that covers multiple aspects of a single entity (e.g., a company's org structure, business model, and strategy)
- Commit the session
- The extractor generates multiple candidate memories (e.g., 3-4
entitiescandidates about the same subject) - Observe: instead of 1 merged memory, you get 3-4 near-duplicate files in
viking://user/.../memories/entities/
Observed Behavior
After a single commit, the entities directory contained 4 memory files with 80%+ overlapping content about the same subject. The extraction log showed:
Memory extraction: created=1, merged=4, deleted=0, skipped=0
Yet 4 separate entity files remained, each with slightly different detail but the same core information.
Contributing Factor
When the embedding pipeline has previously failed (e.g., due to oversized input — see #686), the vector index accumulates corrupted entries (Candidate data is None for label index N, skipping). This further reduces dedup's ability to find existing memories, compounding the duplicate problem.
Expected Behavior
Within a single commit batch, the dedup should be aware of candidates already processed in the same batch, either by:
- Maintaining an in-memory index of candidates processed so far in the current batch
- Synchronously indexing each memory before processing the next candidate
- Or performing intra-batch dedup before the vector search step
Suggested Fix
Option A: In-memory batch dedup — Before running vector search, compare the current candidate against all previously processed candidates in the same batch (e.g., via embedding cosine similarity or LLM comparison).
Option B: Synchronous vectorization within batch — Wait for each memory's embedding to complete before processing the next candidate in the same batch.
Option C: Pre-merge candidates — After extraction but before dedup, group candidates by semantic similarity and merge them into consolidated candidates.
Environment
- OpenViking: latest via pipx (as of 2026-03-17)
- Embedding: Ollama nomic-embed-text (local)
- VLM: Alibaba Qwen (via OpenAI-compatible API)
- Integration: OpenClaw memory-openviking plugin
Related Issues
- [Bug]: Memory commit triggers oversized embedding input → unhandled exception hangs uvicorn #686 — Embedding overflow causes vector index corruption, which worsens dedup recall
- Memory extraction triggers O(n²) semantic reprocessing — token cost grows quadratically with memory count #505 — O(n²) semantic reprocessing (same commit pipeline)
- Embedding truncation and chunking should have clearer responsibilities across memory, file, and directory vectorization #531 — Embedding truncation/chunking responsibilities
Metadata
Metadata
Assignees
Labels
Type
Projects
Status