Skip to content

[Bug]: Dedup misses duplicates within same commit batch due to async vectorization gap #687

@laofahai

Description

@laofahai

Summary

When a single session.commit() produces multiple candidate memories about the same entity, the deduplicator fails to detect them as duplicates. This results in near-identical memory files being created in the same category directory.

Root Cause

The dedup flow in compressor.py processes candidates sequentially. For each candidate, it:

  1. Calls deduplicator.deduplicate(candidate) which uses vector search to find similar existing memories
  2. If the decision is CREATE, calls extractor.create_memory() then _index_memory()
  3. _index_memory() enqueues the new memory for async vectorization

The problem: when candidate N+1 runs its dedup vector search, candidate N's vectors have not been indexed yet (still in the async embedding queue). So the search cannot find the just-created memory, and the LLM receives an incomplete picture of existing memories.

Reproduction

  1. Have a conversation that covers multiple aspects of a single entity (e.g., a company's org structure, business model, and strategy)
  2. Commit the session
  3. The extractor generates multiple candidate memories (e.g., 3-4 entities candidates about the same subject)
  4. Observe: instead of 1 merged memory, you get 3-4 near-duplicate files in viking://user/.../memories/entities/

Observed Behavior

After a single commit, the entities directory contained 4 memory files with 80%+ overlapping content about the same subject. The extraction log showed:

Memory extraction: created=1, merged=4, deleted=0, skipped=0

Yet 4 separate entity files remained, each with slightly different detail but the same core information.

Contributing Factor

When the embedding pipeline has previously failed (e.g., due to oversized input — see #686), the vector index accumulates corrupted entries (Candidate data is None for label index N, skipping). This further reduces dedup's ability to find existing memories, compounding the duplicate problem.

Expected Behavior

Within a single commit batch, the dedup should be aware of candidates already processed in the same batch, either by:

  • Maintaining an in-memory index of candidates processed so far in the current batch
  • Synchronously indexing each memory before processing the next candidate
  • Or performing intra-batch dedup before the vector search step

Suggested Fix

Option A: In-memory batch dedup — Before running vector search, compare the current candidate against all previously processed candidates in the same batch (e.g., via embedding cosine similarity or LLM comparison).

Option B: Synchronous vectorization within batch — Wait for each memory's embedding to complete before processing the next candidate in the same batch.

Option C: Pre-merge candidates — After extraction but before dedup, group candidates by semantic similarity and merge them into consolidated candidates.

Environment

  • OpenViking: latest via pipx (as of 2026-03-17)
  • Embedding: Ollama nomic-embed-text (local)
  • VLM: Alibaba Qwen (via OpenAI-compatible API)
  • Integration: OpenClaw memory-openviking plugin

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions