fix(audit): chunk content before embedding upsert (#1539)#1540
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements per-entry chunking for diary entries and general memories to ensure document sizes remain within embedding model limits. It introduces a new drawer ID scheme, updates the "diary_ingest" and "mcp_server" modules to respect a configurable chunk size, and includes logic to purge legacy drawers during full rebuilds. Comprehensive tests were added to verify the new chunking and indexing behavior. Feedback was provided regarding the efficiency of database operations in "diary_ingest.py", specifically suggesting that individual "upsert" calls be batched to improve performance and ensure atomicity.
a68bd1b to
4364d6f
Compare
b35f2f2 to
1155064
Compare
84fc486 to
39789d9
Compare
39789d9 to
725d73c
Compare
…e#1539) Four sites passed content to `collection.upsert(documents=[...])` or `collection.add(documents=[...])` without per-drawer size cap, hitting the same `RuntimeError: Invalid buffer size` crash class from the embedding model's attention buffer: - `general_extractor.extract_memories`: post-classification slicer that propagates `memory_type` to every slice. New `chunk_size` parameter defaulting to `DEFAULT_CHUNK_SIZE` and resolved from `MempalaceConfig.chunk_size` by the caller. - `diary_ingest.ingest_diaries`: per-entry drawers via existing `_split_entries`, with character-chunk fallback for any single entry larger than `chunk_size`. New `_diary_drawer_id_entry` helper carries (entry_idx, entry_chunk_idx). `chunk_index` in metadata is a global counter across the file so `searcher._expand_with_neighbors` stitches sibling chunks regardless of entry boundary. Upsert is batched atomic per file (one call carrying every entry/chunk) so a mid-pass embedding failure cannot half-write the day. Auto-purge on full rebuild deletes prior-pass drawers via `where={"source_file": ...}`, which also migrates pre-MemPalace#1539 legacy `drawer_diary_` IDs as a side effect of normal use. - `mcp_server.tool_diary_write`: split oversized entries into bounded per-chunk drawers via a single batched `col.add` (atomic, no half-write on embedding failure). `col.add` is intentional: `entry_id` is timestamp-based with microsecond precision, so a duplicate is a same-microsecond clash that should surface rather than silently overwrite. - `mcp_server.tool_add_drawer`: same crash class on the more common add-drawer surface (100 KB sanitize cap, 125x `CHUNK_SIZE`). Chunked path mirrors Site 3 with batched atomic `col.upsert`, per-chunk `parent_drawer_id` + `chunk_index` metadata, and a dual-id idempotency probe (last chunk for atomicity, legacy `drawer_id` for pre-MemPalace#1539 single-row backwards-compat). Return shape additive: `chunks` always present, `chunk_ids` on the chunked path. `tool_get_drawer` / `tool_delete_drawer` against the logical handle report "not found" on the chunked path; callers iterate `chunk_ids` or query `parent_drawer_id`. Chunk id width is `:06d` so even a single-digit `chunk_size` config cannot lex-sort chunks out of order.
725d73c to
6658a4d
Compare
igorls
left a comment
There was a problem hiding this comment.
Approval with one follow-up
Thanks for the thorough audit @mvalentsev — the PR cleanly addresses all four #1539 sites, the verbatim-store mandate is preserved, the batched single-call upsert gives the right atomicity guarantee, and the v2_ prefix + per-source delete(where=...) migration runs as a side effect of normal use without a separate command.
Idempotency probe in tool_add_drawer (probing both legacy drawer_id and the last chunk id) is the right call; boundary off-by-ones check out at len(content) ∈ {chunk_size, chunk_size+1, 2·chunk_size}. The 6-digit zero-padded chunk suffix correctly prevents lex-sort breakage at small chunk_size configs.
One concern — opening as a follow-up rather than blocking merge
tool_add_drawer(content=<oversized>, source_file=None) stores every chunk with source_file="". Across multiple such calls, all chunks share that empty key. A search hit on chunk 0 of drawer A can pull "neighbors" that are actually chunks from drawer B — searcher._expand_with_neighbors queries by source_file + chunk_index and doesn't filter by parent_drawer_id.
Pre-PR this couldn't happen because oversized MCP content was filed as one drawer (the bug being fixed). Post-PR, the chunks are siblings under a shared empty source_file. Scope is narrow (MCP paste-in over chunk_size without an explicit source_file) and only affects neighbor expansion — direct get_drawer still returns correct verbatim content.
Fix shape is one line in _expand_with_neighbors: skip neighbor expansion when source_file is empty, or add parent_drawer_id to the $and filter when present. Opening as a separate issue so this can land now.
Smaller notes (no action required for merge)
- Diary incremental re-embeds all entries every pass — the batched upsert loops
entries, notnew_entries. ChromaDB doesn't dedup by content. Pre-PR also re-embedded the whole file as one truncated doc, so net wall-time may not regress, but embedding-API call count does. Acceptable trade-off; worth a follow-up. - API contract change —
tool_get_drawer(drawer_id)/tool_delete_drawer(drawer_id)return "not found" on the chunked path. Documented in the new docstring; worth a CHANGELOG entry. MempalaceConfig()constructed fresh insideingest_diariesvs._configsingleton used inmcp_server.py. Inconsistent; harmless if construction is cheap.entry_text = f"{header}\n{body}" if body else headersilently drops the trailing newline for header-only entries vs. the pre-PRf"{header}\n{body}". Trivial verbatim diff.- Comment in
tool_diary_writesays "if the embedding model fails mid-loop" but the code is a single batchedcol.add()(no loop). Cosmetic.
Local test run (the subset touched by this PR): 261 passed. The single failure was a httpx.ConnectTimeout during first-run embedding-model SSL handshake on a clean machine — unrelated to this PR, and the upstream CI is fully green across all 5 OS/Python combos.
Merging.
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.
What does this PR do?
Closes #1539. Fixes four sites that fed unbounded content to
collection.upsert(documents=[...])orcollection.add(documents=[...]). On the currentmempalace/develop95caf80the upstream embedder accepts the oversized input and silently truncates it at its 256-token window, so the stored drawer's search vector represents only the first ~512 characters and the rest of the content becomes undiscoverable by semantic search. The same code path still hitsRuntimeError: Invalid buffer sizeon multi-MB single-document inputs (the crash class #1534 fixed inconvo_miner._chunk_by_paragraph); the four sites here were catalogued in #1539 during that audit.Site 1:
general_extractor.extract_memories(mempalace/general_extractor.py:426-444)A 200,000-char single segment with classification markers used to become one oversized memory whose search vector represented only the first ~512 characters of the segment (and crashed embedding upsert outright on the pre-#1534 attention-buffer behavior). The fix adds a
chunk_size: int = DEFAULT_CHUNK_SIZEparameter and a post-classification slicer that propagatesmemory_typeto every slice. Reclassifying sub-slices would risk dropping data (markers may live in only one slice), so the label is approximated for sub-slices rather than re-derived.convo_miner.py:552(the only in-tree caller) now passeschunk_size=cfg_chunk_sizesoMempalaceConfig.chunk_sizereaches this path.Site 2:
diary_ingest.ingest_diaries(mempalace/diary_ingest.py:230-273)A multi-MB daily diary file used to upsert as ONE document. The fix introduces per-entry drawers via the existing
_split_entries(text)call, with character-chunk fallback for any single entry larger thanchunk_size. New_diary_drawer_id_entry(wing, date_str, entry_idx, entry_chunk_idx)helper carries the per-entry/per-chunk position.chunk_indexin drawer metadata is a global counter across the file sosearcher._expand_with_neighbors(which queries bysource_file + chunk_index) stitches sibling chunks back together regardless of entry boundary.Upsert is batched atomic per file: one
drawers_col.upsert(ids=..., documents=..., metadatas=...)call carries every entry/chunk for the file, so a mid-pass embedding failure either commits the day in full or leaves no partial drawers behind. The state file is updated only after the upsert returns, so a failed batch is retried on the next run rather than silently skipped.Auto-purge on full rebuild deletes prior-pass drawers via
where={"source_file": ...}, which also migrates pre-#1539 legacydrawer_diary_IDs as a side effect of normal use. No separate migration command needed; existing palaces auto-clean on the firstingest_diariescall that triggers a full rebuild (force=Trueor detected content change).Site 3:
mcp_server.tool_diary_write(mempalace/mcp_server.py:1599-1710)The MCP tool accepted entries up to 100 KB (the
sanitize_contentcap) and filed them as one document. The fix splits oversized entries into bounded per-chunk drawers via a single batchedcol.add(ids=..., documents=..., metadatas=...)so the embedding pass is atomic (no half-written palace if the embedding model fails mid-loop). Each chunk carriesparent_entry_idlinkage andchunk_indexmetadata. Return shape is additive:chunksfield is always present,chunk_idsis added on the chunked path. Single-chunk entries returnchunks=1and anentry_idthat matches the stored drawer id (preserves the pre-#1539 contract).Site 4:
mcp_server.tool_add_drawer(mempalace/mcp_server.py:1103-1215)The MCP
mempalace_add_drawertool acceptedcontentup to 100 KB throughsanitize_contentand filed it as one document, reproducing the Site 3 design defect on the more commonadd_drawersurface. The fix mirrors Site 3: content abovechunk_sizeis split into bounded per-chunk drawers via a single batchedcol.upsert(ids=..., documents=..., metadatas=...). Each chunk carriesparent_drawer_idlinkage andchunk_indexmetadata. Chunk ids use a 6-digit{drawer_id}_chunk_NNNNNNsuffix so even a low-chunk_sizeconfig (down to single-digit) cannot lex-sort chunks out of order.Idempotency on the chunked path probes both the last chunk id (its presence implies the whole batched upsert committed atomically) and the legacy logical
drawer_id(so a pre-#1539 single-row write of oversized content under the same id is detected and skipped rather than getting co-resident chunk siblings on the next call).Return shape is additive:
chunksis always present;chunk_idsis added on the chunked path. Single-chunk content returnschunks=1and adrawer_idthat matches the stored drawer id (preserves the pre-#1539 contract).tool_get_drawer(drawer_id)andtool_delete_drawer(drawer_id)against the logical handle return "not found" on the chunked path; callers must iteratechunk_idsor query byparent_drawer_idmetadata. The docstring ontool_add_drawerdocuments this contract.Adjacent known issue, NOT closed by this PR
#390 (open) reports that the default
CHUNK_SIZE = 800exceeds the default embedding model's 256-token / ~512-char window, causing silent truncation in embeddings. That is a different failure mode than the crash this PR fixes: crash on multi-MB input vs. silent quality degradation on chunks above ~512 chars. The PR honours whateverchunk_sizeis configured (MempalaceConfig.chunk_size, via #1024), so users hitting the silent-truncation case today can dropchunk_sizein their config. Default change and per-model binding (#442) are out of scope here.Out-of-scope sites surfaced during the audit (filed as follow-ups, not fixed here)
mempalace/miner.py:add_drawerlegacy shim: no in-tree caller passes unbounded content; recommend a size-cap if a future external caller wires raw documents through.mempalace/sources/context.py:upsert_drawerinterface (PalaceContext): RFC 002 adapter surface, no in-tree adapter currently constructs records of unbounded size; recommend a separate issue to enforcelen(record.content) <= chunk_sizeat the boundary before the adapter surface stabilises.mempalace/sweeper.py:parse_claude_jsonlper-messagecontent: a transcript message carrying a multi-MBtool_resultblock reachescollection.upsert(documents=[...])at line 251 unbounded. Separate code path (Stop hook only), file a dedicated follow-up issue.Thanks to @davidglidden for the upstream #1534 report that surfaced the audit.
How to test
Manual reproducers (all four exit 0 and emit
CHUNK_SIZE-bounded drawers on this branch; onmempalace/develop95caf80the same calls also exit 0 but file the input as a single oversized drawer whose search vector represents only the first ~512 characters. Observechunksfield absent or equal to 1 on develop, versus multiple bounded chunks on this branch):Unit suite (21 new tests across the four sites, including atomicity coverage for Site 2 and contract coverage for Site 4):
Coverage added:
extract_memoriesoversized-segment slicing withmemory_typepropagation + customchunk_sizeplumbingdiary_ingestper-entry drawer count, oversized-entry character chunking, file-shrink orphan purge, header-only entry, globalchunk_indexcontiguity, incremental append idempotence, batched-upsert atomicity on embedding failuretool_diary_writenormal vs. chunked return shape,chunk_indexmetadata,parent_entry_idlinkage, joined verbatim equalitytool_add_drawernormal vs. chunked return shape,chunk_index+parent_drawer_idmetadata, idempotency on re-call with identical oversized content (last-chunk + legacydrawer_idprobe), boundary atlen == chunk_size, documented contract thattool_get_drawer/tool_delete_draweragainst the logicaldrawer_idreport "not found" on the chunked pathChecklist
uv run pytest tests/ -v --ignore=tests/benchmarks)uvx --from 'ruff==0.15.9' ruff check . && uvx --from 'ruff==0.15.9' ruff format --check .per CI ruff pin)