fix(session): add batch-internal dedup to prevent duplicates within same commit (#687)#701
Conversation
f8352eb to
ecb3e75
Compare
openviking/session/compressor.py
Outdated
| # We intentionally do NOT re-add the merged memory because | ||
| # we lack its updated embedding vector (would need an extra | ||
| # API call). A 3rd+ candidate in the same batch will rely | ||
| # on the async vector DB for this memory. |
There was a problem hiding this comment.
[Design] (non-blocking) After MERGE, the stale batch entry is correctly removed, but the merged memory is not re-added because its updated embedding is unknown. This means if a 3rd+ candidate in the same batch is similar to the merged memory, it won't find it in either batch_memories or the async vector DB.
This trade-off is reasonable and well-documented here. If the 3+ duplicate scenario turns out to be common in practice, one option would be to re-embed the merged content synchronously (one extra embedder call per merge) and re-add to batch_memories:
# After successful merge:
new_text = f"{target_memory.abstract} {payload.content}"
new_embed = self.deduplicator.embedder.embed(new_text)
batch_memories.append((new_embed.dense_vector, target_memory))| candidate: CandidateMemory | ||
| similar_memories: List[Context] # Similar existing memories | ||
| actions: Optional[List[ExistingMemoryAction]] = None | ||
| similar_memories: list[Context] # Similar existing memories |
There was a problem hiding this comment.
[Suggestion] (non-blocking) The modified fields now use PEP 604 syntax (list[X], X | None) while the rest of the file still uses typing.List, typing.Optional, etc. (e.g. _cosine_similarity(vec_a: List[float], ...)). Consider either keeping the old style for consistency within this PR, or updating the whole file in a separate commit.
…ame commit (volcengine#687) When session.commit() produces multiple candidate memories about the same entity, dedup misses duplicates because earlier candidates' embeddings are still in the async vectorization queue. Add in-memory batch tracking so candidates within the same batch are compared via cosine similarity. - Add batch_memories parameter to deduplicate() / _find_similar_memories() - Return query_vector in DedupResult for batch tracking - Maintain batch_memories in extract_long_term_memories loop: append after CREATE, remove after DELETE or MERGE - Use copy.copy() for batch Context to preserve tenant metadata
ecb3e75 to
c134d9e
Compare
|
@qin-ctx Thanks for the thorough review! Both suggestions adopted:
|
…ame commit (volcengine#687) (volcengine#701) When session.commit() produces multiple candidate memories about the same entity, dedup misses duplicates because earlier candidates' embeddings are still in the async vectorization queue. Add in-memory batch tracking so candidates within the same batch are compared via cosine similarity. - Add batch_memories parameter to deduplicate() / _find_similar_memories() - Return query_vector in DedupResult for batch tracking - Maintain batch_memories in extract_long_term_memories loop: append after CREATE, remove after DELETE or MERGE - Use copy.copy() for batch Context to preserve tenant metadata
Description
When
session.commit()produces multiple candidate memories about the same entity, dedup fails to detect duplicates because earlier candidates' embeddings are still in the async vectorization queue and not yet indexed. This PR adds in-memory batch tracking so candidates within the same batch are compared against each other via cosine similarity before consulting the vector DB.Related Issue
Fixes #687
Type of Change
Changes Made
batch_memoriesparameter toMemoryDeduplicator.deduplicate()and_find_similar_memories()for batch-internal cosine comparisonquery_vectorinDedupResultso compressor can track embeddingsbatch_memorieslist inSessionCompressor.extract_long_term_memories()loop: append after CREATE, remove after DELETE or MERGEcopy.copy()for batch Context to preserve tenant metadata (account_id, owner_space, etc.)Testing
New tests (6):
test_find_similar_includes_batch_memories- batch memory with high cosine similarity appears in resultstest_find_similar_excludes_dissimilar_batch_memories- opposite-direction vector is excludedtest_find_similar_deduplicates_batch_and_db_by_uri- DB+batch URI overlap keeps DB version onlytest_deduplicate_returns_query_vector_in_result- DedupResult carries embedding for batch trackingtest_batch_dedup_passes_batch_memories_to_deduplicate- compressor correctly threads batch_memoriestest_batch_dedup_real_cosine_path- end-to-end with _llm_decision spy confirming batch match triggers LLMChecklist
Additional Notes
Nonedefault, fully backward compatible