[Bug]: Dedup misses duplicates within same commit batch due to async vectorization gap

## Summary

When a single `session.commit()` produces multiple candidate memories about the same entity, the deduplicator fails to detect them as duplicates. This results in near-identical memory files being created in the same category directory.

## Root Cause

The dedup flow in `compressor.py` processes candidates sequentially. For each candidate, it:
1. Calls `deduplicator.deduplicate(candidate)` which uses **vector search** to find similar existing memories
2. If the decision is CREATE, calls `extractor.create_memory()` then `_index_memory()`
3. `_index_memory()` enqueues the new memory for **async vectorization**

The problem: when candidate N+1 runs its dedup vector search, candidate N's vectors **have not been indexed yet** (still in the async embedding queue). So the search cannot find the just-created memory, and the LLM receives an incomplete picture of existing memories.

## Reproduction

1. Have a conversation that covers multiple aspects of a single entity (e.g., a company's org structure, business model, and strategy)
2. Commit the session
3. The extractor generates multiple candidate memories (e.g., 3-4 `entities` candidates about the same subject)
4. Observe: instead of 1 merged memory, you get 3-4 near-duplicate files in `viking://user/.../memories/entities/`

## Observed Behavior

After a single commit, the `entities` directory contained 4 memory files with 80%+ overlapping content about the same subject. The extraction log showed:
```
Memory extraction: created=1, merged=4, deleted=0, skipped=0
```

Yet 4 separate entity files remained, each with slightly different detail but the same core information.

## Contributing Factor

When the embedding pipeline has previously failed (e.g., due to oversized input — see #686), the vector index accumulates corrupted entries (`Candidate data is None for label index N, skipping`). This further reduces dedup's ability to find existing memories, compounding the duplicate problem.

## Expected Behavior

Within a single commit batch, the dedup should be aware of candidates already processed in the same batch, either by:
- Maintaining an in-memory index of candidates processed so far in the current batch
- Synchronously indexing each memory before processing the next candidate
- Or performing intra-batch dedup before the vector search step

## Suggested Fix

Option A: **In-memory batch dedup** — Before running vector search, compare the current candidate against all previously processed candidates in the same batch (e.g., via embedding cosine similarity or LLM comparison).

Option B: **Synchronous vectorization within batch** — Wait for each memory's embedding to complete before processing the next candidate in the same batch.

Option C: **Pre-merge candidates** — After extraction but before dedup, group candidates by semantic similarity and merge them into consolidated candidates.

## Environment

- OpenViking: latest via pipx (as of 2026-03-17)
- Embedding: Ollama nomic-embed-text (local)
- VLM: Alibaba Qwen (via OpenAI-compatible API)
- Integration: OpenClaw memory-openviking plugin

## Related Issues

- #686 — Embedding overflow causes vector index corruption, which worsens dedup recall
- #505 — O(n²) semantic reprocessing (same commit pipeline)
- #531 — Embedding truncation/chunking responsibilities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Dedup misses duplicates within same commit batch due to async vectorization gap #687

Summary

Root Cause

Reproduction

Observed Behavior

Contributing Factor

Expected Behavior

Suggested Fix

Environment

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Dedup misses duplicates within same commit batch due to async vectorization gap #687

Description

Summary

Root Cause

Reproduction

Observed Behavior

Contributing Factor

Expected Behavior

Suggested Fix

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions