Skip to content

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085

Closed
midweste wants to merge 2 commits into
MemPalace:developfrom
midweste:midweste
Closed

perf: batch ChromaDB inserts in miner - 10-30x faster mining#1085
midweste wants to merge 2 commits into
MemPalace:developfrom
midweste:midweste

Conversation

@midweste

@midweste midweste commented Apr 21, 2026

Copy link
Copy Markdown

perf: batch ChromaDB inserts in miner — 10-30x faster mining

Problem

process_file() called add_drawer() once per chunk, resulting in:

  • N separate ONNX embedding passes per file (one per chunk)
  • N separate ChromaDB upsert calls per file
  • N redundant datetime.now() and os.path.getmtime() syscalls per file

On large files (1000+ chunks), this made mining slow.

Solution

1. DRY metadata construction via _build_drawer() helper

Extracted the drawer ID generation and metadata assembly into a shared _build_drawer() function used by both:

  • add_drawer() — single-insert path (unchanged contract, used by MCP tools)
  • add_drawers() — new batch-insert path (used by process_file())

2. Batched upserts with sub-batching

add_drawers() collects all chunks into batch lists and upserts in groups of CHROMA_BATCH_LIMIT (5000) to stay under ChromaDB's hard cap of 5,461.

3. Hoisted syscalls

datetime.now() and os.path.getmtime() are called once per file and shared across all chunks via _build_drawer(now=..., source_mtime=...).

Results

Metric Before After
ONNX passes per file N (per chunk) ⌈N/5000⌉
DB writes per file N ⌈N/5000⌉
Syscalls per file 2N 2
Metadata construction duplicated in add_drawer + process_file single _build_drawer helper

Tested against a 499-file project (~70k drawers) with zero errors, including an 18,803-chunk benchmark file.

Breaking Changes

None. add_drawer() retains its existing signature and behavior. add_drawers() is additive.

@igorls igorls added performance Performance improvements area/mining File and conversation mining storage labels Apr 24, 2026
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
…nserts)

README fork-change-queue gets a Performance row above the L1
importance pre-filter row. CLAUDE.md gets a row 27 noting the
conflict-resolution choices (kept fork's DRAWER_UPSERT_BATCH_SIZE=1000,
aliased upstream's CHROMA_BATCH_LIMIT to it).

The cherry-pick is already on upstream develop, so this row exists
mainly to document the merge-conflict resolution for the next
upstream→main sync. Will become a no-op when develop ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
I wrote "already in upstream develop" without running gh pr view, and
JP caught it. MemPalace#1085 was created 2026-04-21, still open against develop.
Re-framed: fork cherry-picked an open upstream PR; the row clears when
MemPalace#1085 lands and we sync develop→main. We don't file a competing
fork-side PR — the proposal is @midweste's; we wait.

Lesson re-applied from feedback_verify_merged_not_described.md: always
read upstream state via gh, never infer from the README's PR list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request Apr 26, 2026
Upstream CHANGELOG.md tracks released versions of MemPalace and is
merged from upstream — mixing fork-ahead changes into it would create
upstream-sync conflicts.

This file is the supplement: date-based sections (no semver — fork
doesn't cut its own release tags), Keep-a-Changelog buckets (Added /
Changed / Fixed / Performance). When a fork-ahead row lands upstream,
the entry moves to the "Merged into upstream" section at the bottom,
kept ~2 weeks then trimmed.

Backfilled with 2026-04-25 + 2026-04-26 entries covering: the
checkpoint collection split (phases A–D, PreCompact incorporation,
the new MCP tool), the kind= filter (transitional safety net), the
HNSW cold-start gate + integrity gate, drawer_id surfacing, the
deploy script, the MemPalace#1085 cherry-pick, palace_graph None guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
midwestE added 2 commits May 5, 2026 04:10
Prevent HNSW index corruption by deferring Ctrl+C until the current
file's ChromaDB upsert completes, then flushing the background
compactor before process exit.

- Add safe_mine_session context manager (palace.py) centralizing
  SIGINT deferral + close_palace flush + signal restoration order
- Apply to both mine() and mine_convos() entry points
- Replace silent except-pass with logger.warning on close failure
- Add 18 tests covering lifecycle, signal handling, and close paths

Closes: HNSW corruption on Ctrl+C during mining
- Register atexit.register(_DEFAULT_BACKEND.close) in palace.py to flush
  ChromaDB on MCP server process termination (SIGTERM, sys.exit, etc.)
- Add ChromaBackend._cache_key() to normalize palace paths via
  Path.resolve(), preventing silent cache misses when close_palace()
  receives a different path format than _client() used
- Add 8 tests covering atexit, close idempotency, and path normalization
@midweste

midweste commented May 5, 2026

Copy link
Copy Markdown
Author

Closing this PR — the core optimization (batched ChromaDB upserts) was independently implemented in #1185 (perf/batched-upsert-gpu), which expanded on the same approach with GPU acceleration support and hardware-aware embedding functions.

All key improvements from this PR are now covered in develop:

DRY metadata builder (_build_drawer_metadata)
Batched collection.upsert() per file instead of per-chunk
Hoisted os.path.getmtime() out of the per-chunk loop
Sub-batching via DRAWER_UPSERT_BATCH_SIZE to stay under ChromaDB's hard cap
Thanks for landing the expanded version! 🎉

@midweste midweste closed this May 5, 2026
jphein added a commit to techempower-org/mempalace that referenced this pull request May 16, 2026
…MemPalace#1185 (#36) (#89)

Upstream PR MemPalace#1085 was closed by @midweste on 2026-05-05 in favor of
[MemPalace#1185](MemPalace#1185) which merged
to develop on 2026-04-24. MemPalace#1185 is a wider-scope superseder: same
per-chunk batch-insert path plus optional GPU acceleration.

Our cherry-pick at 6be6fff is now functionally redundant with develop.
Updated three fork docs that previously claimed MemPalace#1085 was open and
implied the cherry-pick would become a no-op only when MemPalace#1085 itself
merged:
  - FORK_CHANGELOG.md (the canonical cherry-pick entry)
  - README.md "Composition layer" callout
  - README.md "Cherry-picks" comparison table row

Cherry-pick can be dropped on the next upstream develop sync; flagged
that on all three rows.

Closes #36

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
jphein added a commit to techempower-org/mempalace that referenced this pull request May 22, 2026
* docs: update jphein/ URLs to techempower-org/ (#118)

- render-docs.py: changelog header and commit_link() now point at
  techempower-org/mempalace so the FORK_CHANGELOG regenerates with
  the correct repo slug.
- check-docs.sh: cross-repo URL filter accepts both jphein/ and
  techempower-org/ slugs while the YAML body still carries historical
  jphein/ narrative; the retired-check comment updated for symmetry.
- README.md: sweep all https://github.com/jphein/mempalace and
  https://github.com/jphein/palace-daemon URLs to their techempower-org
  equivalents (PR links, commit links, deployment-runs-at slug).
- FORK_CHANGELOG.md: regenerated from docs/fork-changes.yaml so every
  rendered commit link resolves to techempower-org/mempalace; the
  historical body text inside YAML entries is unchanged.

Historical narrative ("transferred from jphein/mempalace in May 2026")
and links to other personal forks (jphein/familiar.realm.watch,
jphein/notebook, etc.) stay as-is — only the two repos that actually
moved to techempower-org are rewritten.

Closes #118

* docs: mark PR MemPalace#1024 merged, MemPalace#1085 closed; drop stale ChromaDB references

- docs/fork-changes.yaml: PR MemPalace#1024 (configurable chunking) moved from
  OPEN to MERGED with merged_date 2026-05-15; added to merged_upstream
  entries so the FORK_CHANGELOG recently-merged section reflects the
  2026-05-15 upstream land.
- docs/fork-changes.yaml: PR MemPalace#1085 (batched ChromaDB inserts cherry-pick)
  moved from OPEN to CLOSED with body note that @midweste closed it
  2026-05-16, superseded by merged upstream MemPalace#1185 (wider GPU-acceleration
  scope). Fork's 6be6fff is now a no-op against develop.
- CLAUDE.md: Key Files block rewrites the stale "ChromaDB vector store.
  The actual data." line to point at postgres on disks.jphe.in via
  palace-daemon and reference the 2026-05-14 cutover + RETIRED marker.
- CLAUDE.md: Two-Layer Memory Architecture line updates MemPalace
  storage from "~/.mempalace/palace/, ~183K drawers" to the production
  postgres + pgvector + AGE on disks.jphe.in with a normalized "300K+"
  drawer count.

README.md drawer-count normalization (160K/273K/274K -> 300K+) and
ChromaDB->pgvector edits landed in efcb5c6 alongside the #118 URL
sweep. FORK_CHANGELOG.md regenerates clean from the updated YAML.

Closes #115, #116

* fix(docs): unbreak check-docs test-count lint and converge on .venv

The README test-count drift check in scripts/check-docs.sh anchored its
regex at start-of-line, but the count lives mid-paragraph on README:28.
The check silently no-op'd, letting "~1850 tests pass on main" drift to
2486 actual without a single warning (#114).

De-anchor the regex; the rest of the pipeline already does the right
thing (counts via pytest --collect-only).

Also converge on .venv/ — the project's actual venv path — instead of
the dead venv/ path hard-coded in scripts/check-docs.sh and
scripts/preflight.sh. Fixes the silent fallback-to-PATH behavior that
masked the wrong path for so long (#117). CLAUDE.md:59 was already
inconsistent with :46 — both now say .venv.

Refs: #114, #117
jphein added a commit to techempower-org/mempalace that referenced this pull request May 25, 2026
…) (#197)

Issue #165 asked to drop the MemPalace#1085 cherry-pick on the next upstream
sync, on the premise that upstream MemPalace#1185 (merged) had superseded it.
That premise held on 2026-05-16 but is no longer accurate: in the 73
commits to miner.py / convo_miner.py / format_miner.py since the
cherry-pick landed, the fork built on top of these primitives in ways
upstream MemPalace#1185 does not provide:

1. add_drawers() is a fork-only public API that returns
   (added, batch_ids, warnings) and is wired to room-taxonomy
   validation (#86). Consumed by test_room_taxonomy.py.
2. DRAWER_UPSERT_BATCH_SIZE / CHROMA_BATCH_LIMIT sub-batching knob —
   upstream does one giant upsert per file (OOM risk on pathological
   files); fork sub-batches in groups of 1000. Used from miner.py,
   convo_miner.py, format_miner.py and monkeypatched by tests.
3. _build_drawer_metadata carries fork-only Tier 6a extensions
   (line_start, line_end, content_date) that closet pointers depend
   on. Tested at tests/test_miner.py.

Updates the fork-changes.yaml entry to mark the cherry-pick as
"absorbed into fork architecture" rather than "drop on next sync",
and regenerates FORK_CHANGELOG.md from the YAML.

Closes #165.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining performance Performance improvements storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants