Skip to content

fix(audit): chunk content before embedding upsert (#1539)#1540

Merged
igorls merged 1 commit into
MemPalace:developfrom
mvalentsev:fix/1539-three-unbounded-upsert-sites
May 21, 2026
Merged

fix(audit): chunk content before embedding upsert (#1539)#1540
igorls merged 1 commit into
MemPalace:developfrom
mvalentsev:fix/1539-three-unbounded-upsert-sites

Conversation

@mvalentsev

@mvalentsev mvalentsev commented May 17, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Closes #1539. Fixes four sites that fed unbounded content to collection.upsert(documents=[...]) or collection.add(documents=[...]). On the current mempalace/develop 95caf80 the upstream embedder accepts the oversized input and silently truncates it at its 256-token window, so the stored drawer's search vector represents only the first ~512 characters and the rest of the content becomes undiscoverable by semantic search. The same code path still hits RuntimeError: Invalid buffer size on multi-MB single-document inputs (the crash class #1534 fixed in convo_miner._chunk_by_paragraph); the four sites here were catalogued in #1539 during that audit.

Site 1: general_extractor.extract_memories (mempalace/general_extractor.py:426-444)

A 200,000-char single segment with classification markers used to become one oversized memory whose search vector represented only the first ~512 characters of the segment (and crashed embedding upsert outright on the pre-#1534 attention-buffer behavior). The fix adds a chunk_size: int = DEFAULT_CHUNK_SIZE parameter and a post-classification slicer that propagates memory_type to every slice. Reclassifying sub-slices would risk dropping data (markers may live in only one slice), so the label is approximated for sub-slices rather than re-derived. convo_miner.py:552 (the only in-tree caller) now passes chunk_size=cfg_chunk_size so MempalaceConfig.chunk_size reaches this path.

Site 2: diary_ingest.ingest_diaries (mempalace/diary_ingest.py:230-273)

A multi-MB daily diary file used to upsert as ONE document. The fix introduces per-entry drawers via the existing _split_entries(text) call, with character-chunk fallback for any single entry larger than chunk_size. New _diary_drawer_id_entry(wing, date_str, entry_idx, entry_chunk_idx) helper carries the per-entry/per-chunk position.

chunk_index in drawer metadata is a global counter across the file so searcher._expand_with_neighbors (which queries by source_file + chunk_index) stitches sibling chunks back together regardless of entry boundary.

Upsert is batched atomic per file: one drawers_col.upsert(ids=..., documents=..., metadatas=...) call carries every entry/chunk for the file, so a mid-pass embedding failure either commits the day in full or leaves no partial drawers behind. The state file is updated only after the upsert returns, so a failed batch is retried on the next run rather than silently skipped.

Auto-purge on full rebuild deletes prior-pass drawers via where={"source_file": ...}, which also migrates pre-#1539 legacy drawer_diary_ IDs as a side effect of normal use. No separate migration command needed; existing palaces auto-clean on the first ingest_diaries call that triggers a full rebuild (force=True or detected content change).

Site 3: mcp_server.tool_diary_write (mempalace/mcp_server.py:1599-1710)

The MCP tool accepted entries up to 100 KB (the sanitize_content cap) and filed them as one document. The fix splits oversized entries into bounded per-chunk drawers via a single batched col.add(ids=..., documents=..., metadatas=...) so the embedding pass is atomic (no half-written palace if the embedding model fails mid-loop). Each chunk carries parent_entry_id linkage and chunk_index metadata. Return shape is additive: chunks field is always present, chunk_ids is added on the chunked path. Single-chunk entries return chunks=1 and an entry_id that matches the stored drawer id (preserves the pre-#1539 contract).

Site 4: mcp_server.tool_add_drawer (mempalace/mcp_server.py:1103-1215)

The MCP mempalace_add_drawer tool accepted content up to 100 KB through sanitize_content and filed it as one document, reproducing the Site 3 design defect on the more common add_drawer surface. The fix mirrors Site 3: content above chunk_size is split into bounded per-chunk drawers via a single batched col.upsert(ids=..., documents=..., metadatas=...). Each chunk carries parent_drawer_id linkage and chunk_index metadata. Chunk ids use a 6-digit {drawer_id}_chunk_NNNNNN suffix so even a low-chunk_size config (down to single-digit) cannot lex-sort chunks out of order.

Idempotency on the chunked path probes both the last chunk id (its presence implies the whole batched upsert committed atomically) and the legacy logical drawer_id (so a pre-#1539 single-row write of oversized content under the same id is detected and skipped rather than getting co-resident chunk siblings on the next call).

Return shape is additive: chunks is always present; chunk_ids is added on the chunked path. Single-chunk content returns chunks=1 and a drawer_id that matches the stored drawer id (preserves the pre-#1539 contract). tool_get_drawer(drawer_id) and tool_delete_drawer(drawer_id) against the logical handle return "not found" on the chunked path; callers must iterate chunk_ids or query by parent_drawer_id metadata. The docstring on tool_add_drawer documents this contract.

Adjacent known issue, NOT closed by this PR

#390 (open) reports that the default CHUNK_SIZE = 800 exceeds the default embedding model's 256-token / ~512-char window, causing silent truncation in embeddings. That is a different failure mode than the crash this PR fixes: crash on multi-MB input vs. silent quality degradation on chunks above ~512 chars. The PR honours whatever chunk_size is configured (MempalaceConfig.chunk_size, via #1024), so users hitting the silent-truncation case today can drop chunk_size in their config. Default change and per-model binding (#442) are out of scope here.

Out-of-scope sites surfaced during the audit (filed as follow-ups, not fixed here)

  • mempalace/miner.py:add_drawer legacy shim: no in-tree caller passes unbounded content; recommend a size-cap if a future external caller wires raw documents through.
  • mempalace/sources/context.py:upsert_drawer interface (PalaceContext): RFC 002 adapter surface, no in-tree adapter currently constructs records of unbounded size; recommend a separate issue to enforce len(record.content) <= chunk_size at the boundary before the adapter surface stabilises.
  • mempalace/sweeper.py:parse_claude_jsonl per-message content: a transcript message carrying a multi-MB tool_result block reaches collection.upsert(documents=[...]) at line 251 unbounded. Separate code path (Stop hook only), file a dedicated follow-up issue.

Thanks to @davidglidden for the upstream #1534 report that surfaced the audit.

How to test

Manual reproducers (all four exit 0 and emit CHUNK_SIZE-bounded drawers on this branch; on mempalace/develop 95caf80 the same calls also exit 0 but file the input as a single oversized drawer whose search vector represents only the first ~512 characters. Observe chunks field absent or equal to 1 on develop, versus multiple bounded chunks on this branch):

# Site 1: extract_memories
TMPDIR=$(mktemp -d)
uv run python -c "open('$TMPDIR/long.txt','w').write('We decided to migrate to PostgreSQL because performance matters. ' * 200)"
uv run python -m mempalace mine "$TMPDIR" --mode convos --extract general --wing repro-1539-extract

# Site 2: diary_ingest with multi-MB single entry
DIARY=$(mktemp -d)/2026-05-18.md
PALACE=$(mktemp -d)
uv run python -c "
content = '# 2026-05-18\n\n## 10:00\nshort entry one.\n\n## 11:00\n' + ('A' * 50000) + '\n\n## 12:00\nshort entry two.\n'
open('$DIARY','w').write(content)
"
uv run python -c "
from mempalace.diary_ingest import ingest_diaries
print(ingest_diaries('$(dirname $DIARY)', '$PALACE', wing='repro-1539-diary', force=True))
"

# Site 3: tool_diary_write via direct invocation
uv run python -c "
from mempalace.mcp_server import tool_diary_write
result = tool_diary_write(agent_name='audit', entry='Z' * 5000, wing='repro-1539-mcp')
assert result['success'] and result['chunks'] > 1
print(result)
"

# Site 4: tool_add_drawer via direct invocation
uv run python -c "
from mempalace.mcp_server import tool_add_drawer
result = tool_add_drawer(wing='repro-1539-add', room='paste', content='X' * 10000)
assert result['success'] and result.get('chunks', 1) > 1
print(result)
"

Unit suite (21 new tests across the four sites, including atomicity coverage for Site 2 and contract coverage for Site 4):

uv run pytest tests/test_general_extractor.py tests/test_closets.py::TestDiaryIngest tests/test_mcp_server.py::TestWriteTools tests/test_mcp_server.py::TestDiaryTools -v

Coverage added:

  • extract_memories oversized-segment slicing with memory_type propagation + custom chunk_size plumbing
  • diary_ingest per-entry drawer count, oversized-entry character chunking, file-shrink orphan purge, header-only entry, global chunk_index contiguity, incremental append idempotence, batched-upsert atomicity on embedding failure
  • tool_diary_write normal vs. chunked return shape, chunk_index metadata, parent_entry_id linkage, joined verbatim equality
  • tool_add_drawer normal vs. chunked return shape, chunk_index + parent_drawer_id metadata, idempotency on re-call with identical oversized content (last-chunk + legacy drawer_id probe), boundary at len == chunk_size, documented contract that tool_get_drawer / tool_delete_drawer against the logical drawer_id report "not found" on the chunked path

Checklist

  • Tests pass (uv run pytest tests/ -v --ignore=tests/benchmarks)
  • No hardcoded paths
  • Linter passes (uvx --from 'ruff==0.15.9' ruff check . && uvx --from 'ruff==0.15.9' ruff format --check . per CI ruff pin)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements per-entry chunking for diary entries and general memories to ensure document sizes remain within embedding model limits. It introduces a new drawer ID scheme, updates the "diary_ingest" and "mcp_server" modules to respect a configurable chunk size, and includes logic to purge legacy drawers during full rebuilds. Comprehensive tests were added to verify the new chunking and indexing behavior. Feedback was provided regarding the efficiency of database operations in "diary_ingest.py", specifically suggesting that individual "upsert" calls be batched to improve performance and ensure atomicity.

Comment thread mempalace/diary_ingest.py
@mvalentsev mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch from a68bd1b to 4364d6f Compare May 18, 2026 00:46
@mvalentsev mvalentsev marked this pull request as ready for review May 18, 2026 01:12
@mvalentsev mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch 3 times, most recently from b35f2f2 to 1155064 Compare May 18, 2026 18:56
@igorls igorls added this to the v3.3.6 milestone May 18, 2026
@mvalentsev mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch 3 times, most recently from 84fc486 to 39789d9 Compare May 21, 2026 06:19
@mvalentsev mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch from 39789d9 to 725d73c Compare May 21, 2026 18:23
…e#1539)

Four sites passed content to `collection.upsert(documents=[...])` or
`collection.add(documents=[...])` without per-drawer size cap, hitting
the same `RuntimeError: Invalid buffer size` crash class from the
embedding model's attention buffer:

- `general_extractor.extract_memories`: post-classification slicer
  that propagates `memory_type` to every slice. New `chunk_size`
  parameter defaulting to `DEFAULT_CHUNK_SIZE` and resolved from
  `MempalaceConfig.chunk_size` by the caller.
- `diary_ingest.ingest_diaries`: per-entry drawers via existing
  `_split_entries`, with character-chunk fallback for any single
  entry larger than `chunk_size`. New `_diary_drawer_id_entry`
  helper carries (entry_idx, entry_chunk_idx). `chunk_index` in
  metadata is a global counter across the file so
  `searcher._expand_with_neighbors` stitches sibling chunks
  regardless of entry boundary. Upsert is batched atomic per file
  (one call carrying every entry/chunk) so a mid-pass embedding
  failure cannot half-write the day. Auto-purge on full rebuild
  deletes prior-pass drawers via `where={"source_file": ...}`,
  which also migrates pre-MemPalace#1539 legacy `drawer_diary_` IDs as a
  side effect of normal use.
- `mcp_server.tool_diary_write`: split oversized entries into
  bounded per-chunk drawers via a single batched `col.add` (atomic,
  no half-write on embedding failure). `col.add` is intentional:
  `entry_id` is timestamp-based with microsecond precision, so a
  duplicate is a same-microsecond clash that should surface rather
  than silently overwrite.
- `mcp_server.tool_add_drawer`: same crash class on the more common
  add-drawer surface (100 KB sanitize cap, 125x `CHUNK_SIZE`).
  Chunked path mirrors Site 3 with batched atomic `col.upsert`,
  per-chunk `parent_drawer_id` + `chunk_index` metadata, and a
  dual-id idempotency probe (last chunk for atomicity, legacy
  `drawer_id` for pre-MemPalace#1539 single-row backwards-compat). Return
  shape additive: `chunks` always present, `chunk_ids` on the
  chunked path. `tool_get_drawer` / `tool_delete_drawer` against
  the logical handle report "not found" on the chunked path;
  callers iterate `chunk_ids` or query `parent_drawer_id`.

Chunk id width is `:06d` so even a single-digit `chunk_size`
config cannot lex-sort chunks out of order.
@mvalentsev mvalentsev force-pushed the fix/1539-three-unbounded-upsert-sites branch from 725d73c to 6658a4d Compare May 21, 2026 19:21

@igorls igorls left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval with one follow-up

Thanks for the thorough audit @mvalentsev — the PR cleanly addresses all four #1539 sites, the verbatim-store mandate is preserved, the batched single-call upsert gives the right atomicity guarantee, and the v2_ prefix + per-source delete(where=...) migration runs as a side effect of normal use without a separate command.

Idempotency probe in tool_add_drawer (probing both legacy drawer_id and the last chunk id) is the right call; boundary off-by-ones check out at len(content) ∈ {chunk_size, chunk_size+1, 2·chunk_size}. The 6-digit zero-padded chunk suffix correctly prevents lex-sort breakage at small chunk_size configs.

One concern — opening as a follow-up rather than blocking merge

tool_add_drawer(content=<oversized>, source_file=None) stores every chunk with source_file="". Across multiple such calls, all chunks share that empty key. A search hit on chunk 0 of drawer A can pull "neighbors" that are actually chunks from drawer B — searcher._expand_with_neighbors queries by source_file + chunk_index and doesn't filter by parent_drawer_id.

Pre-PR this couldn't happen because oversized MCP content was filed as one drawer (the bug being fixed). Post-PR, the chunks are siblings under a shared empty source_file. Scope is narrow (MCP paste-in over chunk_size without an explicit source_file) and only affects neighbor expansion — direct get_drawer still returns correct verbatim content.

Fix shape is one line in _expand_with_neighbors: skip neighbor expansion when source_file is empty, or add parent_drawer_id to the $and filter when present. Opening as a separate issue so this can land now.

Smaller notes (no action required for merge)

  • Diary incremental re-embeds all entries every pass — the batched upsert loops entries, not new_entries. ChromaDB doesn't dedup by content. Pre-PR also re-embedded the whole file as one truncated doc, so net wall-time may not regress, but embedding-API call count does. Acceptable trade-off; worth a follow-up.
  • API contract changetool_get_drawer(drawer_id) / tool_delete_drawer(drawer_id) return "not found" on the chunked path. Documented in the new docstring; worth a CHANGELOG entry.
  • MempalaceConfig() constructed fresh inside ingest_diaries vs. _config singleton used in mcp_server.py. Inconsistent; harmless if construction is cheap.
  • entry_text = f"{header}\n{body}" if body else header silently drops the trailing newline for header-only entries vs. the pre-PR f"{header}\n{body}". Trivial verbatim diff.
  • Comment in tool_diary_write says "if the embedding model fails mid-loop" but the code is a single batched col.add() (no loop). Cosmetic.

Local test run (the subset touched by this PR): 261 passed. The single failure was a httpx.ConnectTimeout during first-run embedding-model SSL handshake on a clean machine — unrelated to this PR, and the upstream CI is fully green across all 5 OS/Python combos.

Merging.

@igorls igorls merged commit fa7f87f into MemPalace:develop May 21, 2026
6 checks passed
arnoldwender pushed a commit to arnoldwender/mempalace that referenced this pull request May 24, 2026
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin
manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json,
.codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md
from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the
major user-facing entries that landed without changelog entries during
the cycle:

Features:
- MemPalace#1555 office-document mining via --mode extract + virtual line numbers
- MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a)
- MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph)
- MemPalace#1565 cross-wing tunnels auto-promoted from hallways
- MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels
- MemPalace#1236 API-tool transcripts auto-route to wing_api
- MemPalace#711 hooks.auto_save toggle for silent-mode sessions
- MemPalace#1605 COCA content-word filter for entity detection
- MemPalace#1557 case-insensitive entity matching at mine time
- MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default

Bug Fixes (selected, user-visible):
- MemPalace#1540 silent data loss in three unchunked upsert sites
- MemPalace#1538 paragraph chunker oversized chunks
- MemPalace#1554 per-file chunk cap too low for transcripts
- MemPalace#1562 Windows hook subprocess/ChromaDB deadlock
- MemPalace#1529 create_tunnel corrupted hyphenated wing names
- MemPalace#1424 save-hook truncated hyphenated project folders
- MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths
- MemPalace#1466 silent symlink skip now logged
- MemPalace#1441 macOS stock-bash 3.2 hook compatibility
- MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input
- MemPalace#1523 VACUUM + FTS5 rebuild after repair
- MemPalace#1548 FTS5 validation at end of mine
- plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466,
  MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585

Performance:
- MemPalace#1474 convo miner pre-fetches mined-set
- MemPalace#1487 rebuild_index progress callback
- MemPalace#1530 MCP cold-start diagnostics + opt-in warmup

Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment
verified per RELEASING.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audit: CHUNK_SIZE not enforced before embedding upsert in general_extractor.py and diary_ingest.py (same class as #1534)

2 participants