feat(session): add chunked vectorization for long memories#734
Conversation
Long memories are now split into multiple vector records for better retrieval precision. Each chunk gets its own URI (memory_uri#chunk_NNNN) and is independently searchable, while the base record retains the abstract as its vector. Chunking prefers paragraph boundaries and uses configurable overlap for context continuity. Short memories (under memory_chunk_chars) are unchanged — single vector as before. New SemanticConfig fields: - memory_chunk_chars (default 2000): chunk threshold - memory_chunk_overlap (default 200): overlap between chunks Fixes volcengine#530. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thank you so much for your contribution! The issue with the average vector you mentioned is indeed lossy — this problem was introduced by the commit at #642, and we will roll it back. |
|
Understood — the 1:1 URI-to-embedding design makes sense as the long-term direction. Splitting memories into smaller semantic units during preprocessing is cleaner than chunking at embedding time. Happy to close this PR since the memory refactoring will address the root issue properly. The lossy averaging rollback from #642 should resolve the immediate problem. If there's anything I can help with on the memory module refactoring when it starts, let me know. |
We have an ongoing discussion, and your participation and suggestions are warmly welcome. #712 The current solution does not yet include a mechanism for splitting memory files. |
|
We have discussed this issue internally. In fact, we already enforce size limits per URI for PDF, Markdown, and HTML files. However, this constraint has not yet been implemented for code and memory. Until this constraint is in place, the 1:N vectorization approach is better than nothing — so we have decided to accept this change. Thank you again for your contribution! |
qin-ptr
left a comment
There was a problem hiding this comment.
Review Summary
This PR introduces chunked vectorization for long memories to improve retrieval precision. The approach is sound, but there's a critical implementation bug: the base record still embeds the full content instead of just the abstract, defeating the purpose of chunking.
Main Issue: The code comment claims "uses abstract as vector text" for the base record, but EmbeddingMsgConverter.from_context(memory) actually uses memory.vectorize.text, which contains the full content (set in memory_extractor.py:456). This needs to be fixed before merge.
See inline comments for details.
🤖 I am a bot owned by @qin-ctx.
| change_type: One of "added" or "modified" | ||
| """ | ||
| from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverter | ||
| from openviking_cli.utils.config import get_openviking_config |
There was a problem hiding this comment.
[Bug] (blocking)
The base record still uses full content instead of abstract, contradicting the PR description and comment.
Root cause: When a memory is created, memory_extractor.py:456 sets memory.set_vectorize(Vectorize(text=payload.content)), so memory.vectorize.text contains the full content. When you call EmbeddingMsgConverter.from_context(memory) without resetting it, the base record will still try to embed the full long content.
Impact:
- Base record may exceed embedding token limits and fail
- Even if successful, produces a lossy vector (same problem this PR aims to solve)
- Code behavior contradicts the comment "uses abstract as vector text"
Fix: Before enqueueing the base record, reset vectorization text to abstract:
if vectorize_text and len(vectorize_text) > semantic.memory_chunk_chars:
# ... existing chunk logic ...
# Reset base record to use abstract
memory.set_vectorize(Vectorize(text=memory.abstract))
# Now base record uses abstract as intended
embedding_msg = EmbeddingMsgConverter.from_context(memory)
await self.vikingdb.enqueue_embedding_msg(embedding_msg)| @@ -80,3 +83,54 @@ def test_batch_splitting(): | |||
| assert len(batches[0]) == 50 | |||
There was a problem hiding this comment.
[Suggestion] (non-blocking)
Current tests only cover _chunk_text() unit behavior. Consider adding integration tests to verify:
- Long memories are split into correct number of chunk records
- Base record's vectorization text is abstract (not full content)
- Chunk URIs follow the
memory_uri#chunk_NNNNformat parent_uricorrectly points to original memory URI
This would catch issues like the base record vectorization bug.
| if boundary > start + chunk_size // 2: | ||
| end = boundary + 2 # Include the double newline | ||
|
|
||
| chunks.append(text[start:end].strip()) |
There was a problem hiding this comment.
[Suggestion] (non-blocking)
The overlap calculation may be slightly inaccurate because:
- Line 212:
chunks.append(text[start:end].strip())strips whitespace - Line 213:
start = end - overlapuses the originalendposition
Actual overlap between chunks may differ from configured overlap value. Consider either:
- Document that overlap is approximate
- Calculate boundaries first, then strip uniformly
Summary
SemanticConfig.memory_chunk_charsandmemory_chunk_overlapFixes #530
Problem
When a memory exceeds the embedder's max token limit, the current code chunks it internally but merges all chunks into a single averaged vector (via
_chunk_and_embedinbase.py). This averaging is lossy — a long memory covering "Python debugging" AND "Docker networking" becomes one blended vector that matches neither topic precisely. Retrieval quality degrades as memories grow.Solution
Chunk long memories before enqueuing to the EmbeddingQueue. Each chunk becomes a separate vector record:
memory_uri#chunk_0001,memory_uri#chunk_0002, etc.parent_uripoints to the original memory for deduplication\n\n) with configurable overlapChanges Made
openviking_cli/utils/config/parser_config.py: Addedmemory_chunk_chars(2000) andmemory_chunk_overlap(200) toSemanticConfigopenviking/session/compressor.py: Modified_index_memory()to chunk long memories before enqueuing. Added_chunk_text()static method with paragraph-aware splitting.tests/misc/test_semantic_config.py: 5 new tests for chunking logic (short text, long text, overlap, paragraph boundaries, custom config)Configuration
{ "semantic": { "memory_chunk_chars": 2000, "memory_chunk_overlap": 200 } }Type of Change
Testing
Checklist