fix(session): add batch-internal dedup to prevent duplicates within same commit (#687) by Astro-Han · Pull Request #701 · volcengine/OpenViking

Astro-Han · 2026-03-17T11:54:01Z

Description

When session.commit() produces multiple candidate memories about the same entity, dedup fails to detect duplicates because earlier candidates' embeddings are still in the async vectorization queue and not yet indexed. This PR adds in-memory batch tracking so candidates within the same batch are compared against each other via cosine similarity before consulting the vector DB.

Related Issue

Fixes #687

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Add batch_memories parameter to MemoryDeduplicator.deduplicate() and _find_similar_memories() for batch-internal cosine comparison
Return query_vector in DedupResult so compressor can track embeddings
Maintain batch_memories list in SessionCompressor.extract_long_term_memories() loop: append after CREATE, remove after DELETE or MERGE
Use copy.copy() for batch Context to preserve tenant metadata (account_id, owner_space, etc.)

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- macOS

New tests (6):

test_find_similar_includes_batch_memories - batch memory with high cosine similarity appears in results
test_find_similar_excludes_dissimilar_batch_memories - opposite-direction vector is excluded
test_find_similar_deduplicates_batch_and_db_by_uri - DB+batch URI overlap keeps DB version only
test_deduplicate_returns_query_vector_in_result - DedupResult carries embedding for batch tracking
test_batch_dedup_passes_batch_memories_to_deduplicate - compressor correctly threads batch_memories
test_batch_dedup_real_cosine_path - end-to-end with _llm_decision spy confirming batch match triggers LLM

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings

Additional Notes

All new parameters use keyword-only with None default, fully backward compatible
Profile/Tool/Skill categories intentionally skip batch tracking (they bypass dedup entirely)
After MERGE, stale batch entries are removed but not re-added (would require extra embedding API call); subsequent candidates rely on async vector DB for merged memories

qin-ctx

Clean fix for #687. Batch-internal cosine dedup effectively addresses the async vectorization gap. Well tested with 6 new test cases. Two minor non-blocking suggestions below.

qin-ctx · 2026-03-17T12:35:07Z

openviking/session/compressor.py

+                                    # We intentionally do NOT re-add the merged memory because
+                                    # we lack its updated embedding vector (would need an extra
+                                    # API call). A 3rd+ candidate in the same batch will rely
+                                    # on the async vector DB for this memory.


[Design] (non-blocking) After MERGE, the stale batch entry is correctly removed, but the merged memory is not re-added because its updated embedding is unknown. This means if a 3rd+ candidate in the same batch is similar to the merged memory, it won't find it in either batch_memories or the async vector DB.

This trade-off is reasonable and well-documented here. If the 3+ duplicate scenario turns out to be common in practice, one option would be to re-embed the merged content synchronously (one extra embedder call per merge) and re-add to batch_memories:

# After successful merge: new_text = f"{target_memory.abstract} {payload.content}" new_embed = self.deduplicator.embedder.embed(new_text) batch_memories.append((new_embed.dense_vector, target_memory))

qin-ctx · 2026-03-17T12:35:07Z

openviking/session/memory_deduplicator.py

    candidate: CandidateMemory
-    similar_memories: List[Context]  # Similar existing memories
-    actions: Optional[List[ExistingMemoryAction]] = None
+    similar_memories: list[Context]  # Similar existing memories


[Suggestion] (non-blocking) The modified fields now use PEP 604 syntax (list[X], X | None) while the rest of the file still uses typing.List, typing.Optional, etc. (e.g. _cosine_similarity(vec_a: List[float], ...)). Consider either keeping the old style for consistency within this PR, or updating the whole file in a separate commit.

…ame commit (volcengine#687) When session.commit() produces multiple candidate memories about the same entity, dedup misses duplicates because earlier candidates' embeddings are still in the async vectorization queue. Add in-memory batch tracking so candidates within the same batch are compared via cosine similarity. - Add batch_memories parameter to deduplicate() / _find_similar_memories() - Return query_vector in DedupResult for batch tracking - Maintain batch_memories in extract_long_term_memories loop: append after CREATE, remove after DELETE or MERGE - Use copy.copy() for batch Context to preserve tenant metadata

Astro-Han · 2026-03-17T12:43:34Z

@qin-ctx Thanks for the thorough review! Both suggestions adopted:

Re-embed after MERGE: After successful merge, we now re-embed the merged content and re-add to batch_memories, so 3rd+ candidates in the same batch can still find the merged memory. Used action.memory.abstract (updated by _merge_into_existing) + candidate.content as the embedding text.
Type annotation style: Reverted DedupResult's existing fields to their original List[Context] / Optional[List[ExistingMemoryAction]] style. New additions keep PEP 604 to match the file's forward direction (e.g. _llm_decision return type).

…ame commit (volcengine#687) (volcengine#701) When session.commit() produces multiple candidate memories about the same entity, dedup misses duplicates because earlier candidates' embeddings are still in the async vectorization queue. Add in-memory batch tracking so candidates within the same batch are compared via cosine similarity. - Add batch_memories parameter to deduplicate() / _find_similar_memories() - Return query_vector in DedupResult for batch tracking - Maintain batch_memories in extract_long_term_memories loop: append after CREATE, remove after DELETE or MERGE - Use copy.copy() for batch Context to preserve tenant metadata

github-project-automation bot moved this to Backlog in OpenViking project Mar 17, 2026

github-project-automation bot added this to OpenViking project Mar 17, 2026

Astro-Han force-pushed the worktree-fix-batch-dedup-687 branch from f8352eb to ecb3e75 Compare March 17, 2026 11:55

qin-ctx approved these changes Mar 17, 2026

View reviewed changes

Astro-Han force-pushed the worktree-fix-batch-dedup-687 branch from ecb3e75 to c134d9e Compare March 17, 2026 12:43

qin-ctx approved these changes Mar 17, 2026

View reviewed changes

qin-ctx merged commit 1e475c2 into volcengine:main Mar 17, 2026
5 of 6 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 17, 2026

laofahai mentioned this pull request Mar 18, 2026

fix(dedup): detect intra-batch duplicates via in-memory batch contexts #721

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session): add batch-internal dedup to prevent duplicates within same commit (#687)#701

fix(session): add batch-internal dedup to prevent duplicates within same commit (#687)#701
qin-ctx merged 1 commit intovolcengine:mainfrom
Astro-Han:worktree-fix-batch-dedup-687

Astro-Han commented Mar 17, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

qin-ctx Mar 17, 2026

Uh oh!

qin-ctx Mar 17, 2026

Uh oh!

Astro-Han commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Astro-Han commented Mar 17, 2026

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Additional Notes

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

qin-ctx Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Astro-Han commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants