fix(audit): chunk content before embedding upsert (#1539) by mvalentsev · Pull Request #1540 · MemPalace/mempalace

mvalentsev · 2026-05-17T23:49:00Z

What does this PR do?

Closes #1539. Fixes four sites that fed unbounded content to collection.upsert(documents=[...]) or collection.add(documents=[...]). On the current mempalace/develop 95caf80 the upstream embedder accepts the oversized input and silently truncates it at its 256-token window, so the stored drawer's search vector represents only the first ~512 characters and the rest of the content becomes undiscoverable by semantic search. The same code path still hits RuntimeError: Invalid buffer size on multi-MB single-document inputs (the crash class #1534 fixed in convo_miner._chunk_by_paragraph); the four sites here were catalogued in #1539 during that audit.

Site 1: general_extractor.extract_memories (mempalace/general_extractor.py:426-444)

A 200,000-char single segment with classification markers used to become one oversized memory whose search vector represented only the first ~512 characters of the segment (and crashed embedding upsert outright on the pre-#1534 attention-buffer behavior). The fix adds a chunk_size: int = DEFAULT_CHUNK_SIZE parameter and a post-classification slicer that propagates memory_type to every slice. Reclassifying sub-slices would risk dropping data (markers may live in only one slice), so the label is approximated for sub-slices rather than re-derived. convo_miner.py:552 (the only in-tree caller) now passes chunk_size=cfg_chunk_size so MempalaceConfig.chunk_size reaches this path.

Site 2: diary_ingest.ingest_diaries (mempalace/diary_ingest.py:230-273)

A multi-MB daily diary file used to upsert as ONE document. The fix introduces per-entry drawers via the existing _split_entries(text) call, with character-chunk fallback for any single entry larger than chunk_size. New _diary_drawer_id_entry(wing, date_str, entry_idx, entry_chunk_idx) helper carries the per-entry/per-chunk position.

chunk_index in drawer metadata is a global counter across the file so searcher._expand_with_neighbors (which queries by source_file + chunk_index) stitches sibling chunks back together regardless of entry boundary.

Upsert is batched atomic per file: one drawers_col.upsert(ids=..., documents=..., metadatas=...) call carries every entry/chunk for the file, so a mid-pass embedding failure either commits the day in full or leaves no partial drawers behind. The state file is updated only after the upsert returns, so a failed batch is retried on the next run rather than silently skipped.

Auto-purge on full rebuild deletes prior-pass drawers via where={"source_file": ...}, which also migrates pre-#1539 legacy drawer_diary_ IDs as a side effect of normal use. No separate migration command needed; existing palaces auto-clean on the first ingest_diaries call that triggers a full rebuild (force=True or detected content change).

Site 3: mcp_server.tool_diary_write (mempalace/mcp_server.py:1599-1710)

The MCP tool accepted entries up to 100 KB (the sanitize_content cap) and filed them as one document. The fix splits oversized entries into bounded per-chunk drawers via a single batched col.add(ids=..., documents=..., metadatas=...) so the embedding pass is atomic (no half-written palace if the embedding model fails mid-loop). Each chunk carries parent_entry_id linkage and chunk_index metadata. Return shape is additive: chunks field is always present, chunk_ids is added on the chunked path. Single-chunk entries return chunks=1 and an entry_id that matches the stored drawer id (preserves the pre-#1539 contract).

Site 4: mcp_server.tool_add_drawer (mempalace/mcp_server.py:1103-1215)

The MCP mempalace_add_drawer tool accepted content up to 100 KB through sanitize_content and filed it as one document, reproducing the Site 3 design defect on the more common add_drawer surface. The fix mirrors Site 3: content above chunk_size is split into bounded per-chunk drawers via a single batched col.upsert(ids=..., documents=..., metadatas=...). Each chunk carries parent_drawer_id linkage and chunk_index metadata. Chunk ids use a 6-digit {drawer_id}_chunk_NNNNNN suffix so even a low-chunk_size config (down to single-digit) cannot lex-sort chunks out of order.

Idempotency on the chunked path probes both the last chunk id (its presence implies the whole batched upsert committed atomically) and the legacy logical drawer_id (so a pre-#1539 single-row write of oversized content under the same id is detected and skipped rather than getting co-resident chunk siblings on the next call).

Return shape is additive: chunks is always present; chunk_ids is added on the chunked path. Single-chunk content returns chunks=1 and a drawer_id that matches the stored drawer id (preserves the pre-#1539 contract). tool_get_drawer(drawer_id) and tool_delete_drawer(drawer_id) against the logical handle return "not found" on the chunked path; callers must iterate chunk_ids or query by parent_drawer_id metadata. The docstring on tool_add_drawer documents this contract.

Adjacent known issue, NOT closed by this PR

#390 (open) reports that the default CHUNK_SIZE = 800 exceeds the default embedding model's 256-token / ~512-char window, causing silent truncation in embeddings. That is a different failure mode than the crash this PR fixes: crash on multi-MB input vs. silent quality degradation on chunks above ~512 chars. The PR honours whatever chunk_size is configured (MempalaceConfig.chunk_size, via #1024), so users hitting the silent-truncation case today can drop chunk_size in their config. Default change and per-model binding (#442) are out of scope here.

Out-of-scope sites surfaced during the audit (filed as follow-ups, not fixed here)

mempalace/miner.py:add_drawer legacy shim: no in-tree caller passes unbounded content; recommend a size-cap if a future external caller wires raw documents through.
mempalace/sources/context.py:upsert_drawer interface (PalaceContext): RFC 002 adapter surface, no in-tree adapter currently constructs records of unbounded size; recommend a separate issue to enforce len(record.content) <= chunk_size at the boundary before the adapter surface stabilises.
mempalace/sweeper.py:parse_claude_jsonl per-message content: a transcript message carrying a multi-MB tool_result block reaches collection.upsert(documents=[...]) at line 251 unbounded. Separate code path (Stop hook only), file a dedicated follow-up issue.

Thanks to @davidglidden for the upstream #1534 report that surfaced the audit.

How to test

Manual reproducers (all four exit 0 and emit CHUNK_SIZE-bounded drawers on this branch; on mempalace/develop 95caf80 the same calls also exit 0 but file the input as a single oversized drawer whose search vector represents only the first ~512 characters. Observe chunks field absent or equal to 1 on develop, versus multiple bounded chunks on this branch):

# Site 1: extract_memories
TMPDIR=$(mktemp -d)
uv run python -c "open('$TMPDIR/long.txt','w').write('We decided to migrate to PostgreSQL because performance matters. ' * 200)"
uv run python -m mempalace mine "$TMPDIR" --mode convos --extract general --wing repro-1539-extract

# Site 2: diary_ingest with multi-MB single entry
DIARY=$(mktemp -d)/2026-05-18.md
PALACE=$(mktemp -d)
uv run python -c "
content = '# 2026-05-18\n\n## 10:00\nshort entry one.\n\n## 11:00\n' + ('A' * 50000) + '\n\n## 12:00\nshort entry two.\n'
open('$DIARY','w').write(content)
"
uv run python -c "
from mempalace.diary_ingest import ingest_diaries
print(ingest_diaries('$(dirname $DIARY)', '$PALACE', wing='repro-1539-diary', force=True))
"

# Site 3: tool_diary_write via direct invocation
uv run python -c "
from mempalace.mcp_server import tool_diary_write
result = tool_diary_write(agent_name='audit', entry='Z' * 5000, wing='repro-1539-mcp')
assert result['success'] and result['chunks'] > 1
print(result)
"

# Site 4: tool_add_drawer via direct invocation
uv run python -c "
from mempalace.mcp_server import tool_add_drawer
result = tool_add_drawer(wing='repro-1539-add', room='paste', content='X' * 10000)
assert result['success'] and result.get('chunks', 1) > 1
print(result)
"

Unit suite (21 new tests across the four sites, including atomicity coverage for Site 2 and contract coverage for Site 4):

uv run pytest tests/test_general_extractor.py tests/test_closets.py::TestDiaryIngest tests/test_mcp_server.py::TestWriteTools tests/test_mcp_server.py::TestDiaryTools -v

Coverage added:

extract_memories oversized-segment slicing with memory_type propagation + custom chunk_size plumbing
diary_ingest per-entry drawer count, oversized-entry character chunking, file-shrink orphan purge, header-only entry, global chunk_index contiguity, incremental append idempotence, batched-upsert atomicity on embedding failure
tool_diary_write normal vs. chunked return shape, chunk_index metadata, parent_entry_id linkage, joined verbatim equality
tool_add_drawer normal vs. chunked return shape, chunk_index + parent_drawer_id metadata, idempotency on re-call with identical oversized content (last-chunk + legacy drawer_id probe), boundary at len == chunk_size, documented contract that tool_get_drawer / tool_delete_drawer against the logical drawer_id report "not found" on the chunked path

Checklist

Tests pass (uv run pytest tests/ -v --ignore=tests/benchmarks)
No hardcoded paths
Linter passes (uvx --from 'ruff==0.15.9' ruff check . && uvx --from 'ruff==0.15.9' ruff format --check . per CI ruff pin)

gemini-code-assist

Code Review

This pull request implements per-entry chunking for diary entries and general memories to ensure document sizes remain within embedding model limits. It introduces a new drawer ID scheme, updates the "diary_ingest" and "mcp_server" modules to respect a configurable chunk size, and includes logic to purge legacy drawers during full rebuilds. Comprehensive tests were added to verify the new chunking and indexing behavior. Feedback was provided regarding the efficiency of database operations in "diary_ingest.py", specifically suggesting that individual "upsert" calls be batched to improve performance and ensure atomicity.

…e#1539) Four sites passed content to `collection.upsert(documents=[...])` or `collection.add(documents=[...])` without per-drawer size cap, hitting the same `RuntimeError: Invalid buffer size` crash class from the embedding model's attention buffer: - `general_extractor.extract_memories`: post-classification slicer that propagates `memory_type` to every slice. New `chunk_size` parameter defaulting to `DEFAULT_CHUNK_SIZE` and resolved from `MempalaceConfig.chunk_size` by the caller. - `diary_ingest.ingest_diaries`: per-entry drawers via existing `_split_entries`, with character-chunk fallback for any single entry larger than `chunk_size`. New `_diary_drawer_id_entry` helper carries (entry_idx, entry_chunk_idx). `chunk_index` in metadata is a global counter across the file so `searcher._expand_with_neighbors` stitches sibling chunks regardless of entry boundary. Upsert is batched atomic per file (one call carrying every entry/chunk) so a mid-pass embedding failure cannot half-write the day. Auto-purge on full rebuild deletes prior-pass drawers via `where={"source_file": ...}`, which also migrates pre-MemPalace#1539 legacy `drawer_diary_` IDs as a side effect of normal use. - `mcp_server.tool_diary_write`: split oversized entries into bounded per-chunk drawers via a single batched `col.add` (atomic, no half-write on embedding failure). `col.add` is intentional: `entry_id` is timestamp-based with microsecond precision, so a duplicate is a same-microsecond clash that should surface rather than silently overwrite. - `mcp_server.tool_add_drawer`: same crash class on the more common add-drawer surface (100 KB sanitize cap, 125x `CHUNK_SIZE`). Chunked path mirrors Site 3 with batched atomic `col.upsert`, per-chunk `parent_drawer_id` + `chunk_index` metadata, and a dual-id idempotency probe (last chunk for atomicity, legacy `drawer_id` for pre-MemPalace#1539 single-row backwards-compat). Return shape additive: `chunks` always present, `chunk_ids` on the chunked path. `tool_get_drawer` / `tool_delete_drawer` against the logical handle report "not found" on the chunked path; callers iterate `chunk_ids` or query `parent_drawer_id`. Chunk id width is `:06d` so even a single-digit `chunk_size` config cannot lex-sort chunks out of order.

igorls

Approval with one follow-up

Thanks for the thorough audit @mvalentsev — the PR cleanly addresses all four #1539 sites, the verbatim-store mandate is preserved, the batched single-call upsert gives the right atomicity guarantee, and the v2_ prefix + per-source delete(where=...) migration runs as a side effect of normal use without a separate command.

Idempotency probe in tool_add_drawer (probing both legacy drawer_id and the last chunk id) is the right call; boundary off-by-ones check out at len(content) ∈ {chunk_size, chunk_size+1, 2·chunk_size}. The 6-digit zero-padded chunk suffix correctly prevents lex-sort breakage at small chunk_size configs.

One concern — opening as a follow-up rather than blocking merge

tool_add_drawer(content=<oversized>, source_file=None) stores every chunk with source_file="". Across multiple such calls, all chunks share that empty key. A search hit on chunk 0 of drawer A can pull "neighbors" that are actually chunks from drawer B — searcher._expand_with_neighbors queries by source_file + chunk_index and doesn't filter by parent_drawer_id.

Pre-PR this couldn't happen because oversized MCP content was filed as one drawer (the bug being fixed). Post-PR, the chunks are siblings under a shared empty source_file. Scope is narrow (MCP paste-in over chunk_size without an explicit source_file) and only affects neighbor expansion — direct get_drawer still returns correct verbatim content.

Fix shape is one line in _expand_with_neighbors: skip neighbor expansion when source_file is empty, or add parent_drawer_id to the $and filter when present. Opening as a separate issue so this can land now.

Smaller notes (no action required for merge)

Diary incremental re-embeds all entries every pass — the batched upsert loops entries, not new_entries. ChromaDB doesn't dedup by content. Pre-PR also re-embedded the whole file as one truncated doc, so net wall-time may not regress, but embedding-API call count does. Acceptable trade-off; worth a follow-up.
API contract change — tool_get_drawer(drawer_id) / tool_delete_drawer(drawer_id) return "not found" on the chunked path. Documented in the new docstring; worth a CHANGELOG entry.
MempalaceConfig() constructed fresh inside ingest_diaries vs. _config singleton used in mcp_server.py. Inconsistent; harmless if construction is cheap.
entry_text = f"{header}\n{body}" if body else header silently drops the trailing newline for header-only entries vs. the pre-PR f"{header}\n{body}". Trivial verbatim diff.
Comment in tool_diary_write says "if the embedding model fails mid-loop" but the code is a single batched col.add() (no loop). Cosmetic.

Local test run (the subset touched by this PR): 261 passed. The single failure was a httpx.ConnectTimeout during first-run embedding-model SSL handshake on a clean machine — unrelated to this PR, and the upstream CI is fully green across all 5 OS/Python combos.

Merging.

Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.

gemini-code-assist Bot reviewed May 17, 2026

View reviewed changes

Comment thread mempalace/diary_ingest.py

mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch from a68bd1b to 4364d6f Compare May 18, 2026 00:46

mvalentsev marked this pull request as ready for review May 18, 2026 01:12

mvalentsev requested review from igorls and milla-jovovich as code owners May 18, 2026 01:12

mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch 3 times, most recently from b35f2f2 to 1155064 Compare May 18, 2026 18:56

igorls added this to the v3.3.6 milestone May 18, 2026

mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch 3 times, most recently from 84fc486 to 39789d9 Compare May 21, 2026 06:19

This was referenced May 21, 2026

Audit: CHUNK_SIZE not enforced before embedding upsert in general_extractor.py and diary_ingest.py (same class as #1534) #1539

Closed

fix: enforce CHUNK_SIZE before embedding upsert at 4 crash sites (#1539) #1569

Closed

mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch from 39789d9 to 725d73c Compare May 21, 2026 18:23

mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch from 725d73c to 6658a4d Compare May 21, 2026 19:21

igorls approved these changes May 21, 2026

View reviewed changes

igorls merged commit fa7f87f into MemPalace:develop May 21, 2026
6 checks passed

igorls mentioned this pull request May 21, 2026

searcher: _expand_with_neighbors stitches unrelated chunks across MCP drawers sharing empty source_file #1580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(audit): chunk content before embedding upsert (#1539)#1540

fix(audit): chunk content before embedding upsert (#1539)#1540
igorls merged 1 commit into
MemPalace:developfrom
mvalentsev:fix/1539-three-unbounded-upsert-sites

mvalentsev commented May 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

igorls left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mvalentsev commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to test

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

igorls left a comment

Choose a reason for hiding this comment

Approval with one follow-up

One concern — opening as a follow-up rather than blocking merge

Smaller notes (no action required for merge)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mvalentsev commented May 17, 2026 •

edited

Loading