Skip to content

fix(session): add batch-internal dedup to prevent duplicates within same commit (#687)#701

Merged
qin-ctx merged 1 commit intovolcengine:mainfrom
Astro-Han:worktree-fix-batch-dedup-687
Mar 17, 2026
Merged

fix(session): add batch-internal dedup to prevent duplicates within same commit (#687)#701
qin-ctx merged 1 commit intovolcengine:mainfrom
Astro-Han:worktree-fix-batch-dedup-687

Conversation

@Astro-Han
Copy link
Copy Markdown
Contributor

Description

When session.commit() produces multiple candidate memories about the same entity, dedup fails to detect duplicates because earlier candidates' embeddings are still in the async vectorization queue and not yet indexed. This PR adds in-memory batch tracking so candidates within the same batch are compared against each other via cosine similarity before consulting the vector DB.

Related Issue

Fixes #687

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Add batch_memories parameter to MemoryDeduplicator.deduplicate() and _find_similar_memories() for batch-internal cosine comparison
  • Return query_vector in DedupResult so compressor can track embeddings
  • Maintain batch_memories list in SessionCompressor.extract_long_term_memories() loop: append after CREATE, remove after DELETE or MERGE
  • Use copy.copy() for batch Context to preserve tenant metadata (account_id, owner_space, etc.)

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • macOS

New tests (6):

  • test_find_similar_includes_batch_memories - batch memory with high cosine similarity appears in results
  • test_find_similar_excludes_dissimilar_batch_memories - opposite-direction vector is excluded
  • test_find_similar_deduplicates_batch_and_db_by_uri - DB+batch URI overlap keeps DB version only
  • test_deduplicate_returns_query_vector_in_result - DedupResult carries embedding for batch tracking
  • test_batch_dedup_passes_batch_memories_to_deduplicate - compressor correctly threads batch_memories
  • test_batch_dedup_real_cosine_path - end-to-end with _llm_decision spy confirming batch match triggers LLM

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings

Additional Notes

  • All new parameters use keyword-only with None default, fully backward compatible
  • Profile/Tool/Skill categories intentionally skip batch tracking (they bypass dedup entirely)
  • After MERGE, stale batch entries are removed but not re-added (would require extra embedding API call); subsequent candidates rely on async vector DB for merged memories

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for #687. Batch-internal cosine dedup effectively addresses the async vectorization gap. Well tested with 6 new test cases. Two minor non-blocking suggestions below.

# We intentionally do NOT re-add the merged memory because
# we lack its updated embedding vector (would need an extra
# API call). A 3rd+ candidate in the same batch will rely
# on the async vector DB for this memory.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (non-blocking) After MERGE, the stale batch entry is correctly removed, but the merged memory is not re-added because its updated embedding is unknown. This means if a 3rd+ candidate in the same batch is similar to the merged memory, it won't find it in either batch_memories or the async vector DB.

This trade-off is reasonable and well-documented here. If the 3+ duplicate scenario turns out to be common in practice, one option would be to re-embed the merged content synchronously (one extra embedder call per merge) and re-add to batch_memories:

# After successful merge:
new_text = f"{target_memory.abstract} {payload.content}"
new_embed = self.deduplicator.embedder.embed(new_text)
batch_memories.append((new_embed.dense_vector, target_memory))

candidate: CandidateMemory
similar_memories: List[Context] # Similar existing memories
actions: Optional[List[ExistingMemoryAction]] = None
similar_memories: list[Context] # Similar existing memories
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) The modified fields now use PEP 604 syntax (list[X], X | None) while the rest of the file still uses typing.List, typing.Optional, etc. (e.g. _cosine_similarity(vec_a: List[float], ...)). Consider either keeping the old style for consistency within this PR, or updating the whole file in a separate commit.

…ame commit (volcengine#687)

When session.commit() produces multiple candidate memories about the same
entity, dedup misses duplicates because earlier candidates' embeddings are
still in the async vectorization queue. Add in-memory batch tracking so
candidates within the same batch are compared via cosine similarity.

- Add batch_memories parameter to deduplicate() / _find_similar_memories()
- Return query_vector in DedupResult for batch tracking
- Maintain batch_memories in extract_long_term_memories loop: append after
  CREATE, remove after DELETE or MERGE
- Use copy.copy() for batch Context to preserve tenant metadata
@Astro-Han Astro-Han force-pushed the worktree-fix-batch-dedup-687 branch from ecb3e75 to c134d9e Compare March 17, 2026 12:43
@Astro-Han
Copy link
Copy Markdown
Contributor Author

@qin-ctx Thanks for the thorough review! Both suggestions adopted:

  1. Re-embed after MERGE: After successful merge, we now re-embed the merged content and re-add to batch_memories, so 3rd+ candidates in the same batch can still find the merged memory. Used action.memory.abstract (updated by _merge_into_existing) + candidate.content as the embedding text.

  2. Type annotation style: Reverted DedupResult's existing fields to their original List[Context] / Optional[List[ExistingMemoryAction]] style. New additions keep PEP 604 to match the file's forward direction (e.g. _llm_decision return type).

@qin-ctx qin-ctx merged commit 1e475c2 into volcengine:main Mar 17, 2026
5 of 6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 17, 2026
ZaynJarvis pushed a commit to ZaynJarvis/OpenViking that referenced this pull request Mar 17, 2026
…ame commit (volcengine#687) (volcengine#701)

When session.commit() produces multiple candidate memories about the same
entity, dedup misses duplicates because earlier candidates' embeddings are
still in the async vectorization queue. Add in-memory batch tracking so
candidates within the same batch are compared via cosine similarity.

- Add batch_memories parameter to deduplicate() / _find_similar_memories()
- Return query_vector in DedupResult for batch tracking
- Maintain batch_memories in extract_long_term_memories loop: append after
  CREATE, remove after DELETE or MERGE
- Use copy.copy() for batch Context to preserve tenant metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Dedup misses duplicates within same commit batch due to async vectorization gap

2 participants