feat: add configurable multilingual embedding model support#442
feat: add configurable multilingual embedding model support#442NickShtefan wants to merge 5 commits into
Conversation
Real-World Testing ResultsWe applied this patch to our production MemPalace instance and re-mined ~150 files across 3 projects. Here are the findings: 📊 Impact
🔍 Search Quality ImprovementBefore: Searching for "extension config" returned mypy type cache data ( After: Same query returns actual extension code and documentation with match scores around 🐛 Additional Fix AppliedWe also excluded mypy cache files which were polluting search results: # miner.py line 304
) or filename.endswith((".data.json", ".meta.json")):
continueThis reduced our palace size by an additional ~24,000 drawers of type metadata that have no semantic search value. ✅ RecommendationMerge this PR. The smaller chunk size:
The overlap reduction from 100→50 is also appropriate since chunks are now half the size. Test environment:
|
|
This PR covers the embedding function centralization part of #231 — configurable model via env var / config.json, all 7 ChromaDB consumer modules patched, optional I tested this on 245 Russian blog posts (5 years of content). With If @EndeavorYen's more comprehensive implementation lands first (room classification, entity detection, spellcheck), happy to close this — it can serve as a reference for the embedding centralization pattern. Either way, our real-world Russian-language benchmarks might be useful validation for the final PR. |
|
Nice PR! The env → config.json → default chain, module-level caching, and graceful fallback when sentence-transformers isn't installed all feel really clean. The CHUNK_SIZE 800→450 fix is a great bonus too. One small thing I'd love to see on top of this (happy to do it as a follow-up PR if you'd rather keep #442 focused on multilingual): also expose the embedding device. Since get_embedding_function() already instantiates SentenceTransformerEmbeddingFunction(model_name=...), and that class accepts a device kwarg, it's a pretty small addition. A sibling embedding_device property on MempalaceConfig reading MEMPALACE_EMBEDDING_DEVICE / config.json, and conditionally passing it through: kwargs = {"model_name": model_name} The reason I'm asking: MemPalace currently disables CoreML in init.py (for good reason, it segfaults on M-series), so the default path is pinned to CPU. On a large corpus like /.claude/projects/ (1.5 GB of JSONL for me), that's 30 to 90 minutes of embedding on an otherwise-idle Apple Silicon GPU. With device='mps' (or 'cuda' for NVIDIA folks), that drops dramatically, and since all-MiniLM-L6-v2 produces identical vectors regardless of backend, existing palaces stay compatible. It should be pretty low-risk: One minor gotcha worth mentioning in docs: SentenceTransformerEmbeddingFunction caches models by name only (not (name, device)), so switching devices mid-process returns a stale cached model. Only matters for tests/benchmarks, but good to know. Very happy to send a follow-up PR if you'd prefer to keep this one scoped. Either way, thanks for the contribution, this unlocks a lot! |
…ression Builds on top of MemPalace#442 to add two improvements: 1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key. This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series). 2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder), so users don't have to know the model name to get GPU acceleration, and existing palaces remain vector-compatible. 3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to return None, which newer ChromaDB rejects with 'You must provide an embedding function' at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly, restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again. All 552 tests pass, including 7 new tests covering: - embedding_device property reads from env var and config.json - device is passed through to SentenceTransformerEmbeddingFunction when set - device alone activates the default model - device is NOT passed when unset (preserves original MemPalace#442 call signature) - device can be set via config.json independent of model
|
Hey, one more thing while testing this branch locally. I think there's a small regression in the default path. When MEMPALACE_EMBEDDING_MODEL isn't set (the case for every existing user post-upgrade), get_embedding_function() returns None, which then gets passed explicitly to every ChromaDB call site: col = client.get_collection(collection_name, embedding_function=ef) # ef is None In chromadb 0.6.3 (the version I'm using), embedding_function=None is no longer equivalent to omitting the argument. The collection gets created with a null embedder, and then collection.add() raises: ValueError: You must provide an embedding function to compute embeddings. This breaks tests/test_convo_miner.py::test_convo_mining on this branch. Repro: git fetch upstream pull/442/head:pr-442 && git checkout pr-442 Passes on main, fails on pr-442. The fix is a one-liner in config.py: return DefaultEmbeddingFunction() instead of None for the "no model configured" path, so the 14 call sites you patched can keep passing ef unconditionally: if not model_name: Same pattern in the except block for the missing-sentence-transformers case. Happy to send a small follow-up PR against your branch if easier, or you can patch it directly |
Port PR MemPalace#416's parallelization pattern from miner.py to convo_miner.py. PR MemPalace#416 only batched project mining (mempalace mine <dir>), leaving conversation mining (mempalace mine <dir> --mode convos) on the slow per-chunk collection.add() path. Since convo mining is the primary ingest path for large corpora like ~/.claude/projects/ (~1.5 GB of Claude Code JSONL), this omission negated most of the speedup potential from both PR MemPalace#416 (batching) and PR MemPalace#442 (GPU embedding). Changes: 1. Extract process_convo_file_cpu() — pure CPU worker that normalizes, chunks, detects room, and builds drawer records. Thread-safe by construction: no ChromaDB calls, no shared state, all inputs passed explicitly. Returns (source_file, room, records, room_counts_delta) or None for skipped files. 2. Rewrite mine_convos() ingest loop: - Pre-filter pending files with file_already_mined() (sequential, matches PR MemPalace#416's pattern) - Submit all pending files to ThreadPoolExecutor with MAX_WORKERS = min(32, cpu_count()*2) for parallel normalize/chunk - Main thread accumulates records into batch_ids/docs/metas lists and flushes via collection.upsert() every BATCH_SIZE (128) records - try/finally around the executor guarantees final flush_batch() runs even if a worker raises, preventing silent loss of up to BATCH_SIZE-1 pending drawers - Per-worker exceptions are caught and logged instead of aborting the whole run (each file is independent) 3. Keep dry_run path sequential — matches miner.py, preserves original output formatting (per-file [DRY RUN] lines, room distribution), uses the same extracted worker for consistency. 4. Switch collection.add() -> collection.upsert() — idempotent, removes the try/except 'already exists' dance, matches miner.py. Performance expectations (M-series Mac with MEMPALACE_EMBEDDING_DEVICE=mps): Before (single-file sequential loop + per-chunk .add()): 502 drawers in 12.3s = 40.7 drawers/s After (parallel reads + batched upserts): Expected ~3-5x improvement from batching alone (GPU finally gets meaningful batch sizes), plus another ~2-3x from parallelizing the normalize() step on large JSONL files. Combined: ~5-15x. All 556 tests still pass, including tests/test_convo_miner.py which exercises the real ChromaDB write path end-to-end. No changes required to the public API. Callers (cli.py, mcp_server.py) are unaffected.
…ression Builds on top of MemPalace#442 to add two improvements: 1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key. This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series). 2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder), so users don't have to know the model name to get GPU acceleration, and existing palaces remain vector-compatible. 3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to return None, which newer ChromaDB rejects with 'You must provide an embedding function' at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly, restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again. All 552 tests pass, including 7 new tests covering: - embedding_device property reads from env var and config.json - device is passed through to SentenceTransformerEmbeddingFunction when set - device alone activates the default model - device is NOT passed when unset (preserves original MemPalace#442 call signature) - device can be set via config.json independent of model
|
Thanks @FabioLissi — great catch on the Co-authored-by: Fabio Lissi fabio.lissi@gmail.com |
|
@NickShtefan Thanks again for pulling in the device support commit. One follow-up thought I wanted to float while it's still fresh.
|
web3guru888
left a comment
There was a problem hiding this comment.
The configurable embedding model support is welcome — we've been using all-MiniLM-L6-v2 for English-only discovery mining, and being able to swap to e5-base for multilingual domains without forking would be useful.
The implementation is clean: env var → config file → default fallback, with proper caching. The device support (mps/cuda) is a nice ergonomic touch.
However, this PR includes an unrelated breaking change that should be split out:
-CHUNK_SIZE = 800 # chars per drawer
-CHUNK_OVERLAP = 100 # overlap between chunks
+CHUNK_SIZE = 450 # chars per drawer
+CHUNK_OVERLAP = 50 # overlap between chunksHalving the chunk size from 800→450 and the overlap from 100→50 is a significant behavioral change that:
- Will produce ~2x more drawers for the same corpus (more ChromaDB storage, slower queries)
- Makes existing palaces inconsistent — old drawers are 800-char, new ones 450-char
- Has nothing to do with multilingual embedding support
If smaller chunks improve retrieval for multilingual models (plausible — shorter contexts embed more precisely), that's worth doing, but it should be a separate PR with its own benchmarks showing the improvement. Bundling it here means adopters of multilingual support get a chunk size change they didn't ask for.
Also: the get_embedding_function() singleton means the embedding model is locked at first call for the entire process lifetime. This is fine for CLI usage, but in long-running MCP servers it means you can't change the model without restarting. Might be worth documenting that constraint.
Real-World Production FindingsI opened issue #390 reporting the truncation problem, and this PR directly addresses it. We've already deployed Our Setup
Measured Results
Why Drawers Decreased (Not Increased)The reviewer's concern about "2x more drawers" didn't materialize because:
Chunk Size & Multilingual EmbeddingsImportant: Even with the new multilingual embedding support, the default embedding issue still applies:
The 450-char default ensures all users get working embeddings out of the box, regardless of which model they choose. If With custom embedding providers, users can set whatever chunk size matches their model's token limit. But the default Quality ImprovementThe +28% relevance gain comes from:
RecommendationMerge this PR. The chunk size change isn't just related to multilingual support — it's a critical fix for the default
The singleton constraint noted by the reviewer is worth documenting, but shouldn't block this fix. Users with long-running MCP TL;DR: We reported the problem in #390, this PR fixes it, and our real-world deployment proves it works better than the |
|
Good point about the singleton — worth clarifying why it's intentional rather than a limitation. Switching the embedding model isn't a hot-swap operation. Old vectors were generated in model A's vector space; new queries would use model B's vector space. Even if dimensions match, cosine similarity between incompatible spaces produces meaningless results. If dimensions differ, ChromaDB rejects the insert outright. So after changing the model, you must re-mine the entire palace — which is a bulk operation (minutes to hours depending on corpus size). Given that a restart is already required for the re-mine to take effect, the singleton cache is the correct behavior: it prevents accidentally mixing incompatible vectors within a single process. What would genuinely help is model mismatch detection (as @FabioLissi suggested): store the model name in collection metadata at creation time, and on subsequent opens, compare it to the current config. On mismatch → clear error with recovery instructions. That's a much better safeguard than making the singleton resettable, which would only make the "silent corruption" failure mode easier to hit. Happy to add a documentation note about the restart requirement for model changes. |
|
@FabioLissi Great suggestion — this is exactly the kind of safeguard that should ship alongside configurable embedding models. Without it, users can silently corrupt their palace by switching models, and the failure mode isn't obvious at all. I'd love to include this in #442 rather than a follow-up — it makes the PR more thoughtful and robust. If you're up for writing it, go ahead! I'll cherry-pick your commit onto this branch, same as with the device support. Thanks for flagging this and for the continued contributions! |
|
Three commits: 6a23c8f: Chunk size env vars 7aa96cc: Embedding model mismatch detection 94412e8 - Re-mine command The recovery path for intentional model switches. Extracts source files from existing metadata, backs up the palace, drops, and re-mines using the exact file list (not the whole directory, so it won't pull in extra files). Supports --dry-run. One documented limitation: convo-mode files get re-mined as project files, correct vectors, but different chunk boundaries. Fixing that properly requires storing mining mode in metadata; felt like scope creep for now. Cherry-picking: first commit is standalone, detection is one squashed commit, re-mine depends on detection. Happy to answer questions or make any adjustments! |
…ression Builds on top of MemPalace#442 to add two improvements: 1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key. This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series). 2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder), so users don't have to know the model name to get GPU acceleration, and existing palaces remain vector-compatible. 3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to return None, which newer ChromaDB rejects with 'You must provide an embedding function' at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly, restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again. All 552 tests pass, including 7 new tests covering: - embedding_device property reads from env var and config.json - device is passed through to SentenceTransformerEmbeddingFunction when set - device alone activates the default model - device is NOT passed when unset (preserves original MemPalace#442 call signature) - device can be set via config.json independent of model
e1ce332 to
826e3bc
Compare
|
@FabioLissi Thanks for the excellent contributions! I've cherry-picked all three commits into this PR:
During integration I found and fixed two issues:
Full test cycle passed: mine → search → model switch (mismatch detected) → force bypass (dimension error, expected) → re-mine → search with new model. Everything works. Rebased on latest upstream/main, all clean. |
|
@NickShtefan — great catch on the 1.
|
@NickShtefan fixed the CLI path in PR MemPalace#442 (searcher.py) by re-raising EmbeddingModelMismatchError past the generic except clause. The MCP path had the same bug via two additional swallows: - mcp_server.py:_get_collection() wraps _palace_get_collection in except Exception: return None, so every tool using _get_collection (status, list_wings, list_rooms, get_taxonomy, check_duplicate, search, etc.) silently showed "No palace found" instead of the actual mismatch error. - The top-level dispatcher in handle_request() catches Exception and returns hardcoded "Internal tool error", so even if the wrapper re-raises, the actionable message (stored vs current model + recovery instructions) never reaches the caller. Fix in three spots: 1. searcher.py: re-raise EmbeddingModelMismatchError in search() and search_memories() before the generic except (matches @NickShtefan's fix). 2. mcp_server.py:_get_collection(): re-raise EmbeddingModelMismatchError before the generic swallow. 3. mcp_server.py handle_request(): catch EmbeddingModelMismatchError explicitly and return the message as a JSON result so MCP clients see the real error and recovery instructions.
826e3bc to
f18187b
Compare
…ression Builds on top of MemPalace#442 to add two improvements: 1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key. This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series). 2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder), so users don't have to know the model name to get GPU acceleration, and existing palaces remain vector-compatible. 3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to return None, which newer ChromaDB rejects with 'You must provide an embedding function' at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly, restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again. All 552 tests pass, including 7 new tests covering: - embedding_device property reads from env var and config.json - device is passed through to SentenceTransformerEmbeddingFunction when set - device alone activates the default model - device is NOT passed when unset (preserves original MemPalace#442 call signature) - device can be set via config.json independent of model
@NickShtefan fixed the CLI path in PR MemPalace#442 (searcher.py) by re-raising EmbeddingModelMismatchError past the generic except clause. The MCP path had the same bug via two additional swallows: - mcp_server.py:_get_collection() wraps _palace_get_collection in except Exception: return None, so every tool using _get_collection (status, list_wings, list_rooms, get_taxonomy, check_duplicate, search, etc.) silently showed "No palace found" instead of the actual mismatch error. - The top-level dispatcher in handle_request() catches Exception and returns hardcoded "Internal tool error", so even if the wrapper re-raises, the actionable message (stored vs current model + recovery instructions) never reaches the caller. Fix in three spots: 1. searcher.py: re-raise EmbeddingModelMismatchError in search() and search_memories() before the generic except (matches @NickShtefan's fix). 2. mcp_server.py:_get_collection(): re-raise EmbeddingModelMismatchError before the generic swallow. 3. mcp_server.py handle_request(): catch EmbeddingModelMismatchError explicitly and return the message as a JSON result so MCP clients see the real error and recovery instructions.
|
@FabioLissi cherry-picked your MCP fix (e64cdd8) — thanks. PR rebased on latest upstream/main, all 604 tests pass. Ready for review. |
|
Hey @NickShtefan We would like to include for the next release once the plugin based storage layer is added, please wait to rebase |
Embedding model is bound to the palace at init time via collection metadata and changeable only through re-mine. Removes all embedding- related environment variables in favor of explicit CLI flags and auto-detection. Features: - `mempalace init --model <name>` binds embedding model at creation - `mempalace re-mine --model <new>` migrates to a different model - `mempalace init --chunk-size N` configures chunk size per palace - Auto-detect device: cuda > mps (arm64 only) > cpu - Embedding model mismatch detection with clear error messages - Legacy palace auto-migration (stamps "chromadb-default") - Backend seam abstraction for storage layer - MCP ping health checks Removed env vars: - MEMPALACE_EMBEDDING_MODEL (use --model flag) - MEMPALACE_EMBEDDING_DEVICE (auto-detected) - MEMPALACE_CHUNK_SIZE / MEMPALACE_CHUNK_OVERLAP (use --chunk-size) - MEMPALACE_FORCE_EMBEDDING (use --force flag) Benchmarked on 247 Russian blog posts (intfloat/multilingual-e5-base): 758 drawers, 49ms avg search, 0.719 avg similarity, 100% high-relevance Co-authored-by: Fabio Lissi <fabio.lissi@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lection After PR MemPalace#442 moved the embedding model from MEMPALACE_EMBEDDING_MODEL env var to collection metadata, mcp_server._get_collection() stopped passing embedding_function to ChromaDB. This made the MCP server silently fall back to ChromaDB's built-in default (all-MiniLM-L6-v2, dim 384), regardless of what model the palace was actually built with. For palaces created via `mempalace init --model intfloat/multilingual-e5-base` (the headline use case of MemPalace#442), this means inserts and queries through MCP would attempt 384-dim embeddings against 768-dim stored vectors and fail with a dimension mismatch. searcher.py + cli.py work correctly because they go through palace.get_collection() — only the MCP path was broken. This commit makes _get_collection mirror palace.get_collection's behaviour: - Read embedding_model from collection metadata via read_collection_metadata - Instantiate matching embedding function via get_embedding_function - Pass that function explicitly to ChromaDB - Stamp embedding_model into metadata for legacy palaces (matches the legacy migration path in palace.get_collection) Local-only commit for now; will be evaluated for inclusion into PR MemPalace#442 or a follow-up upstream PR.
37d6177 to
8e4319c
Compare
|
Quick context on why I came back to this now: I wanted to update my personal install to the latest Everything verified against my live palace (15K+ drawers, Following @igorls's earlier note that this should land "once the plugin based storage layer is added" — happy to wait for #1196 (SQLiteVec backend) before merging if that's still the plan, or merge sooner if the storage layer timing has shifted. Whatever works for you. |
|
Following over from #390 where @NickShtefan tied the threads together — a few thoughts on direction. The bind-chunk-size-to-collection-metadata property is the load-bearing piece. The reason is the silent-drift failure mode: change Composition with #1024. Different layers; both wanted:
If maintainer direction is to wait for #1196 (SQLiteVec backend) per @igorls's earlier note, that's reasonable — the plugin-storage layer changes the metadata-binding shape and rebasing now risks a third rebase later. The intermediate state where users have #1024 (config override) but not #442 (collection-bound) is acceptable as long as the gap is documented in CHANGELOG: "changing The (For context: the fork has #1024-equivalent in production but not #442-equivalent; the silent-drift failure mode you describe hasn't bitten us yet only because we haven't changed embedding model. So I'm reading #442 as the durable fix, #1024 as the bridge.) |
|
Adding production-data point + use case in support of this PR. I've been running a local patch that effectively does the minimum-viable version of what this PR proposes — hardcoded Use case — literary / contemplative semantic memory:
Why
Trade-off vs e5-base: heavier (~2.3 GB vs ~1.1 GB), longer initial mining time, but the larger context window matters when chunks are long-form literary rather than short-form note-taking. Empirical:
Small suggestion for the PR: README's recommended-models section could note `BAAI/bge-m3` as the option for users with longer-context literary / scholarly content. `e5-base` for most, `bge-m3` for the long-passage case. Two recommendations cover the main use-case axis (note-taking vs long-form). The init-time binding via collection metadata is the right architecture — strongly agreed. Avoids the env-var-drift class of bugs. Looking forward to retiring my local patch. (Happy to test the PR branch against this corpus once my current mine completes and report back.) |
Surfaces the model-selection axis introduced by this PR (chromadb-default vs intfloat/multilingual-e5-base vs BAAI/bge-m3) in the README so users land on the model that fits their content shape (short-form English vs multilingual notes vs long-form literary). Notes that the binding is stamped into collection metadata at init time and that switching models is done via re-mine. No code changes. Per @davidglidden's suggestion in the PR thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed @davidglidden — if you have time after your current mine completes, a test report on the PR branch against your literary corpus would be a useful production-data point alongside @sidonsoft's earlier findings; happy to incorporate any wording tweaks if |
|
@NickShtefan -- Preparing a curated mine covering English, French, and Spanish to run overnight. Will post a test report tomorrow. Thanks for reaching out. |
PR #442 test report — bge-m3 on literary / multilingual corpusTest branch: Following up on @NickShtefan's request for a production-data point against my literary corpus alongside @sidonsoft's earlier findings. Reporting here so the PR thread has a second independent confirmation that bge-m3 binding works end-to-end. SetupInstalled PR branch in a dedicated venv ( Corpus11 markdown files, ~8 MB, 21,735 drawers, 25 min mining wall-clock on M-series CoreML. Curated to exercise bge-m3's stated multilingual + long-form claims:
Results — cross-language semantic retrieval worksTen queries crossing language boundaries. Trimmed scores; full output available. Q: "italic for emphasis" (English) →
Pure-semantic match (BM25=0.0) at higher cosine than the lexical English hit. The PR's central claim verified on first query. Q: "intimate space of dwelling" (English) → Bachelard La Poétique de l'espace (French) ×4, cos 0.585-0.612, all BM25=0.0. English query, French book, no lexical overlap — the book retrieves entirely on semantic similarity. Q: "approche typographique" (French — the French technical term for letter-spacing) →
Q: "letter spacing between pairs of characters" (English) → top 3 English (Saenger, Hochuli), but Q: "lineas viudas y huerfanas" (Spanish for "widow and orphan lines") →
Q: "comillas españolas tipos" (Spanish, "Spanish quotation mark types") → 4 Sousa hits, cos 0.609-0.636 — clean native-language retrieval. Q: "the dead and the unborn" (English, Harrison's signature phrase) → 4 Harrison hits (Forests + Juvenescence) cos 0.52-0.53 — long-form scholarly retrieval on a specialized conceptual vocabulary. Q: "humic burial dead" (Harrison's coinage) → 4 Forests hits cos 0.498-0.536 — even his idiosyncratic specialist vocabulary retrieves coherently. Q: "restraint in ornament" (English, Tschichold/Bringhurst posture) →
Cross-tradition AND cross-language. Notable that "restraint in ornament" retrieves Harrison's Forests passage on civilizational restraint — the embedding space is bridging conceptual register across domains, not just lexical surface. What this corpus shows that wasn't obvious beforeThree patterns worth flagging for the README / PR description: 1. The 2. The 3. Cosine scores on bge-m3 cluster lower than on default MiniLM (typical range 0.50-0.70 for clear conceptual matches vs MiniLM's 0.7-0.9 on the same kind of corpus). Worth a note in the README that bge-m3's score range is different — operators accustomed to MiniLM scoring may misread a 0.60 cosine as "weak" when it actually represents strong semantic match in bge-m3's space. The empirical scores above suggest a working threshold of cosine ≥ 0.50 (vs MiniLM's typical ≥ 0.65) for relevance on this corpus. ConclusionPR #442 delivers what it promises on this corpus. The cross-language semantic retrieval works on first query, even on technical typographic vocabulary that's domain-specific in each language (approche / kerning / interletraje, cursiva / italique / italic, viuda / veuve / widow). Long-form scholarly content (Harrison trilogy, Saenger, Bachelard) retrieves coherently with the 8192-token context window. Retires my local hardcoded bge-m3 patch in Two small footnotes (the sentence-transformers install gate and the Happy to test on a larger corpus (full ~600 MB literary library) once #442 lands on |
|
Quick follow-up to the test report above, with a finding that might be useful for the path forward. After 3.3.5 shipped on I expected this to be the kind of thing where I'd carry forward my old hardcoded patch from April. It turned out to be much smaller than that — and the smallness is the point. The whole patch is two hunksHunk 1 — Hunk 2 — That's it. Combined diff against End-to-end verified: collection metadata stamps What I noticed
In that architecture, swapping the embedding model is mostly just a swap at the seam. The reason #442 is a ~999-line PR isn't the binding itself — it's everything around it:
All of that is real work and worth having. Just noticing that it's somewhat independent of the underlying binding. One thought, only if it'd be useful: depending on how the rebase against current A caveatThe patch I described is local-only for our chamber-library mine. Not proposing it as the path forward upstream — your architecture in #442 (configurable model + stamping + migration + mismatch detection) is the right shape for a general-purpose tool. The minimal patch only works because we have a fixed corpus and a single model in mind. Sharing it as evidence that the core piece is small in 3.3.5's architecture, in case that's useful context. Will report back on the chamber-library full mine when it completes. Plan: ~700k drawers across French/Spanish/English typography sources, paleography, scholarly philosophy. Mining on the local-patched 3.3.5 with single-writer discipline (MCP off during the mine), using 3.3.5's Thanks again for #442 — the bridge it builds between English-leaning ChromaDB defaults and genuinely multilingual corpora matters a lot for what this corpus is. |
Two follow-ups to PR MemPalace#442 surfaced by davidglidden's 8 MB French/Spanish/ English typography corpus test on commit c652c2f (MemPalace#442 (comment)). 1) `--auto-mine` ignored `--palace`. `_maybe_run_mine_after_init` read `cfg.palace_path` while the matching init step honoured `args.palace`, so `mempalace init --palace X --auto-mine` initialised X but mined into the default palace — leaving X empty and polluting the default. Use the same `getattr(args, "palace", None)` fallback pattern as `cmd_init`. 2) Silent fallback to MiniLM when sentence-transformers was missing for a non-default model. `init --model BAAI/bge-m3` against a venv without the `[multilingual]` extra succeeded, stamped `embedding_model = BAAI/bge-m3` into collection metadata, and bound the embedding function to all-MiniLM-L6-v2 (384d, English-leaning) — every later query returned the wrong model's neighbours under a multilingual label. Probe `import sentence_transformers` directly and raise ImportError with the install hint. (ChromaDB transmutes the missing dep to a generic ValueError inside `SentenceTransformerEmbeddingFunction` so catching ImportError around the chromadb call alone is insufficient.) Cascading fix in `palace.get_collection`: the non-silent embedder change exposed that chromadb's internal EF-conflict validator fires before our domain `EmbeddingModelMismatchError` could, masking the actionable message. Read `embedding_model` from collection metadata up front and raise `EmbeddingModelMismatchError` (or take the explicit force-path that opens the collection with the stored EF and re-stamps metadata) before instantiating the new EF. Also avoids loading a heavy model only to discard it on mismatch. Tests: - Replace `TestGetEmbeddingFunctionFallback` (asserted silent fallback) with `TestGetEmbeddingFunctionMissingDependency` (asserts hard raise + install hint), plus a guard that the default `chromadb-default` path is unaffected when sentence-transformers is missing. - Add a `sentence_transformers` sys.modules stub fixture so the existing mocked-`SentenceTransformerEmbeddingFunction` tests still pass on a stock `[dev]` env (no `[multilingual]` extra required for unit tests). - Add `test_maybe_run_mine_honours_args_palace_over_cfg` and `test_maybe_run_mine_falls_back_to_cfg_when_palace_unset` regression guards for fix #1. Full suite green (1339 passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@davidglidden — thanks for the production test report and the minimal-patch follow-up. Pushed Both bugs fixed:
One cascading fix the non-silent embedder exposed: dropping the fallback meant chromadb's internal EF-conflict validator ( Tests:
Full suite green (1339 passed). README intentionally left for a separate docs-only follow-up — the |
|
Following up on the 2026-05-11 validation report. A 1.29M-drawer mine on this branch surfaced two findings worth feeding back — neither critiques the bge-m3 integration itself, only the scale validation methodology and one small ancillary question. Honest amendment to the 2026-05-11 validationRe-examining the test palace post-hoc: the 21.7k mine succeeded for search purposes (FTS5 + brute-force vector fallback returned correct results) but the VECTOR segment's The Stage-2 integrity gate in Surfacing it now in case future PR #442 validations want to include a Scale finding (filed separately upstream)A subsequent 1.29M-drawer mine of a larger corpus on this same branch did not complete cleanly. 37 quarantine events compounded into a 38,966-row Full forensic + reproduction recipe + suggested fixes filed as #1526. The mechanism is embedding-model-adjacent but the fix lives in the chromadb integration layer ( Suggestion for scale validation methodologyThe 11-book subset used in 2026-05-11 was all small files (largest ~5 MB). The cascade pattern doesn't surface on small-file workloads regardless of total drawer count — it surfaces specifically when one source's embed time exceeds the 300s threshold. If PR #442 wants to characterize bge-m3 behavior beyond the current validation: including at least one ~25 MB plain-text single-source file in the validation corpus would surface the cascade before merge. The drawer-count axis (21k → 100k → 1M) is fine but secondary; the single-source-size axis is what triggers Mode 1. (The 21.7k validation also implicitly demonstrated that the Stage-2 integrity gate handles small benign drift correctly — that part of the gate works. The cascade is specifically when the gap is sustained for hours with the pickle in mid-write states.) One side questionThe 21.7k test palace's (If intentional: ignore. If oversight: the closets-collection creation path may need the same Environment details + the local commits atop Thank you for the PR work — the bge-m3 integration itself produces clean multilingual results at the scales that worked (cross-language retrieval verified at 21.7k; semantic similarity ranges sensible). The above is scale-and-methodology context, offered as a useful early indicator. |
|
@davidglidden thanks for the 1.29M-scale run and the careful write-up.
Thanks again for the production data points. |
…compressed) get_closets_collection and the compressed-collection path call get_collection without an explicit model, so secondary collections silently fell back to the ChromaDB default while drawers used the configured model. In a multilingual palace this left closets/compressed on MiniLM, degrading non-English search on the index layer (closets is the searchable index). get_collection now inherits the palace's canonical model (read from the drawers collection) when creating a non-default collection that has no stored model and no explicit override. Stored models are never overridden, so existing palaces are unaffected and a re-mine migrates them. Adds tests for closets/compressed inheritance and the default-stays-default case. Addresses @davidglidden's closets finding on MemPalace#442.
|
@davidglidden pushed the closets/compressed stamp fix (2b28cf7), closing the loop on your finding above. Secondary collections were created without threading the palace's model, so they fell back to the ChromaDB default while drawers used the configured one. get_collection now inherits the palace's canonical model (from the drawers collection) when creating a non-default collection that has no stored model and no explicit override; stored models are never overridden, so existing palaces are unaffected (a re-mine migrates them). Added tests for closets/compressed inheritance and the default-stays-default case. |
What does this PR do?
Adds configurable multilingual embedding model support with init-time binding and production-tested infrastructure for safe model switching. The embedding model becomes a property of the palace itself (stamped in collection metadata at
init), not an environment variable that can drift between processes.Closes #903 — bug: embedding model mismatch — MCP server uses MiniLM (384-dim) while ingest can use mpnet (768-dim). The fix routes
mcp_server._get_collection()through the same path the rest of the codebase already uses, so MCP queries always use the embedding function the palace was built with.This PR is the embedding infrastructure layer that #488 (multilingual classification by @EndeavorYen) builds on top of. Agreed coordination: merge #442 first, then rebase #488 on top.
Related parallel work: #912 (@OmkarKirpan) targets the same
_get_collectionmismatch bug with a narrower scope. This PR covers that fix plus configurable model name, init-time binding, mismatch detection,re-minecommand, and MCP error propagation. See cross-linked discussion for coordination.Changes
Init-time embedding model binding (replaces env vars):
mempalace init <dir> --model intfloat/multilingual-e5-base— model is stamped in ChromaDB collection metadata at palace creationmempalace re-mine --model <new>— backup → drop → re-mine with a different modelmempalace init --chunk-size N --chunk-overlap M— chunk size is per-palace, no env varcuda>mps(arm64 only) >cpu(noMEMPALACE_EMBEDDING_DEVICEneeded)MEMPALACE_EMBEDDING_MODEL,MEMPALACE_EMBEDDING_DEVICE,MEMPALACE_CHUNK_SIZE,MEMPALACE_CHUNK_OVERLAP,MEMPALACE_FORCE_EMBEDDINGMCP fix (closes #903):
mcp_server._get_collection()now readsembedding_modelfrom collection metadata and resolves the matchingembedding_function— previously it silently fell back to ChromaDB'sall-MiniLM-L6-v2regardless of the palace's actual modelEmbeddingModelMismatchErrorpropagated through_get_collection()andhandle_request()(not swallowed by genericexcept)searcher.pypaths were already correct viapalace.get_collection()— only the MCP path was broken; this commit mirrors that behaviour for MCPOther:
mempalace/backends/chroma.py) for pluggable storage layer (RFC 001 §10)embedding_modelmetadata are stamped aschromadb-defaultEmbeddingModelMismatchErrorwith clear recovery instructionspyproject.toml—[multilingual]optional dependencytests/test_multilingual.py(28 tests),tests/test_remine.py(re-mine flow)Backward compatible: existing palaces continue working with their original model (auto-stamped as
chromadb-default); no re-mining required.Production data (Russian): search scores 0.19→0.77 with
intfloat/multilingual-e5-baseon the same 247-blog-post corpus. Note:paraphrase-multilingual-MiniLM-L12-v2(proposed in #231) has a 128-token window that truncates chunks and degrades quality —e5-base(512 tokens) is the recommended default for multilingual.How to test
Checklist
1336 passed, 0 failedon rebased branch (~6 min, full suite excluding benchmarks)ruff check .)tool_diary_readreturns full resultsContributors
_get_collectionfixCloses #903
Refs #390 (chunk size), #231 (multilingual config), #488 (multilingual classification builds on top), #912 (parallel narrower fix for same bug)