Skip to content

feat: add configurable multilingual embedding model support#442

Open
NickShtefan wants to merge 5 commits into
MemPalace:developfrom
NickShtefan:feat/multilingual-embedding
Open

feat: add configurable multilingual embedding model support#442
NickShtefan wants to merge 5 commits into
MemPalace:developfrom
NickShtefan:feat/multilingual-embedding

Conversation

@NickShtefan

@NickShtefan NickShtefan commented Apr 9, 2026

Copy link
Copy Markdown

What does this PR do?

Adds configurable multilingual embedding model support with init-time binding and production-tested infrastructure for safe model switching. The embedding model becomes a property of the palace itself (stamped in collection metadata at init), not an environment variable that can drift between processes.

Closes #903 — bug: embedding model mismatch — MCP server uses MiniLM (384-dim) while ingest can use mpnet (768-dim). The fix routes mcp_server._get_collection() through the same path the rest of the codebase already uses, so MCP queries always use the embedding function the palace was built with.

This PR is the embedding infrastructure layer that #488 (multilingual classification by @EndeavorYen) builds on top of. Agreed coordination: merge #442 first, then rebase #488 on top.

Related parallel work: #912 (@OmkarKirpan) targets the same _get_collection mismatch bug with a narrower scope. This PR covers that fix plus configurable model name, init-time binding, mismatch detection, re-mine command, and MCP error propagation. See cross-linked discussion for coordination.

Changes

Init-time embedding model binding (replaces env vars):

  • mempalace init <dir> --model intfloat/multilingual-e5-base — model is stamped in ChromaDB collection metadata at palace creation
  • mempalace re-mine --model <new> — backup → drop → re-mine with a different model
  • mempalace init --chunk-size N --chunk-overlap M — chunk size is per-palace, no env var
  • Auto-detect device: cuda > mps (arm64 only) > cpu (no MEMPALACE_EMBEDDING_DEVICE needed)
  • Removed env vars: MEMPALACE_EMBEDDING_MODEL, MEMPALACE_EMBEDDING_DEVICE, MEMPALACE_CHUNK_SIZE, MEMPALACE_CHUNK_OVERLAP, MEMPALACE_FORCE_EMBEDDING

MCP fix (closes #903):

  • mcp_server._get_collection() now reads embedding_model from collection metadata and resolves the matching embedding_function — previously it silently fell back to ChromaDB's all-MiniLM-L6-v2 regardless of the palace's actual model
  • EmbeddingModelMismatchError propagated through _get_collection() and handle_request() (not swallowed by generic except)
  • The CLI and searcher.py paths were already correct via palace.get_collection() — only the MCP path was broken; this commit mirrors that behaviour for MCP

Other:

  • Backend seam abstraction (mempalace/backends/chroma.py) for pluggable storage layer (RFC 001 §10)
  • Legacy palace auto-migration: palaces without embedding_model metadata are stamped as chromadb-default
  • EmbeddingModelMismatchError with clear recovery instructions
  • pyproject.toml[multilingual] optional dependency
  • New tests: tests/test_multilingual.py (28 tests), tests/test_remine.py (re-mine flow)

Backward compatible: existing palaces continue working with their original model (auto-stamped as chromadb-default); no re-mining required.

Production data (Russian): search scores 0.19→0.77 with intfloat/multilingual-e5-base on the same 247-blog-post corpus. Note: paraphrase-multilingual-MiniLM-L12-v2 (proposed in #231) has a 128-token window that truncates chunks and degrades quality — e5-base (512 tokens) is the recommended default for multilingual.

How to test

# Install with multilingual support
pip install -e ".[dev,multilingual]"

# Run tests
python -m pytest tests/ -v --ignore=tests/benchmarks

# Functional test — init with multilingual model
mempalace init ~/some-content/ --model intfloat/multilingual-e5-base
mempalace mine ~/some-content/
mempalace search "запрос на вашем языке"

# Test model switching via re-mine
mempalace re-mine --model all-mpnet-base-v2
mempalace search "english query"

# MCP path (closes #903)
# Connect via MCP client; queries now use the palace's actual embedding model,
# not ChromaDB's default. No more silent dimension mismatch.

Checklist

  • Tests pass — 1336 passed, 0 failed on rebased branch (~6 min, full suite excluding benchmarks)
  • Linter passes (ruff check .)
  • Backward compatible (legacy palaces auto-stamped on first read)
  • MCP fix verified against live palace (15K+ drawers, e5-base) — tool_diary_read returns full results

Contributors

  • @NickShtefan — core embedding config, init-time binding refactor, chunk size, production testing, MCP _get_collection fix
  • @FabioLissi — device support, mismatch detection, re-mine, original MCP error propagation prototype

Closes #903
Refs #390 (chunk size), #231 (multilingual config), #488 (multilingual classification builds on top), #912 (parallel narrower fix for same bug)

@sidonsoft

Copy link
Copy Markdown

Real-World Testing Results

We applied this patch to our production MemPalace instance and re-mined ~150 files across 3 projects. Here are the findings:

📊 Impact

Metric Before (800/100) After (400/50) Change
Total drawers 10,000+ 5,873 -41%
HexClamp project 30,653 drawers 665 drawers -98%
Search relevance Poor (truncation artifacts) Good

🔍 Search Quality Improvement

Before: Searching for "extension config" returned mypy type cache data (.data.json files) with match scores around -0.36

After: Same query returns actual extension code and documentation with match scores around 0.19 (positive correlation = better semantic match)

🐛 Additional Fix Applied

We also excluded mypy cache files which were polluting search results:

# miner.py line 304
) or filename.endswith((".data.json", ".meta.json")):
    continue

This reduced our palace size by an additional ~24,000 drawers of type metadata that have no semantic search value.

✅ Recommendation

Merge this PR. The smaller chunk size:

  1. Fits within the embedding model's 256-token limit (~512 chars)
  2. Improves search relevance significantly
  3. Reduces storage requirements
  4. Prevents silent truncation of important context

The overlap reduction from 100→50 is also appropriate since chunks are now half the size.


Test environment:

  • mempalace 3.0.0
  • ChromaDB 1.5.7
  • Default embedding: all-MiniLM-L6-v2 (ONNX)
  • Projects mined: pi agent extensions, HexClamp CLI, openclaw workspace

@NickShtefan

Copy link
Copy Markdown
Author

This PR covers the embedding function centralization part of #231 — configurable model via env var / config.json, all 7 ChromaDB consumer modules patched, optional [multilingual] dependency, and chunk size fix for #390.

I tested this on 245 Russian blog posts (5 years of content). With intfloat/multilingual-e5-base, relevance scores jumped from 0.19–0.40 to 0.70–0.77. Happy to share the full benchmark data if useful.

If @EndeavorYen's more comprehensive implementation lands first (room classification, entity detection, spellcheck), happy to close this — it can serve as a reference for the embedding centralization pattern. Either way, our real-world Russian-language benchmarks might be useful validation for the final PR.

@FabioLissi

Copy link
Copy Markdown

Nice PR! The env → config.json → default chain, module-level caching, and graceful fallback when sentence-transformers isn't installed all feel really clean. The CHUNK_SIZE 800→450 fix is a great bonus too.

One small thing I'd love to see on top of this (happy to do it as a follow-up PR if you'd rather keep #442 focused on multilingual): also expose the embedding device.

Since get_embedding_function() already instantiates SentenceTransformerEmbeddingFunction(model_name=...), and that class accepts a device kwarg, it's a pretty small addition. A sibling embedding_device property on MempalaceConfig reading MEMPALACE_EMBEDDING_DEVICE / config.json, and conditionally passing it through:

kwargs = {"model_name": model_name}
if cfg.embedding_device:
kwargs["device"] = cfg.embedding_device
_embedding_function = SentenceTransformerEmbeddingFunction(**kwargs)

The reason I'm asking: MemPalace currently disables CoreML in init.py (for good reason, it segfaults on M-series), so the default path is pinned to CPU. On a large corpus like /.claude/projects/ (1.5 GB of JSONL for me), that's 30 to 90 minutes of embedding on an otherwise-idle Apple Silicon GPU. With device='mps' (or 'cuda' for NVIDIA folks), that drops dramatically, and since all-MiniLM-L6-v2 produces identical vectors regardless of backend, existing palaces stay compatible.

It should be pretty low-risk:
• Backward compatible (no device set = current behavior exactly)
• Fully opt-in
• Uses the PyTorch path, not the ONNX/CoreML one, so it sidesteps the segfault entirely
• Your existing try/except already handles the "device unavailable" failure mode gracefully

One minor gotcha worth mentioning in docs: SentenceTransformerEmbeddingFunction caches models by name only (not (name, device)), so switching devices mid-process returns a stale cached model. Only matters for tests/benchmarks, but good to know.

Very happy to send a follow-up PR if you'd prefer to keep this one scoped. Either way, thanks for the contribution, this unlocks a lot!

FabioLissi added a commit to FabioLissi/mempalace that referenced this pull request Apr 9, 2026
…ression

Builds on top of MemPalace#442 to add two improvements:

1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key.
   This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to
   dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series).

2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses
   sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder),
   so users don't have to know the model name to get GPU acceleration, and existing palaces
   remain vector-compatible.

3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to
   return None, which newer ChromaDB rejects with 'You must provide an embedding function'
   at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly,
   restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again.

All 552 tests pass, including 7 new tests covering:
- embedding_device property reads from env var and config.json
- device is passed through to SentenceTransformerEmbeddingFunction when set
- device alone activates the default model
- device is NOT passed when unset (preserves original MemPalace#442 call signature)
- device can be set via config.json independent of model
@FabioLissi

Copy link
Copy Markdown

Hey, one more thing while testing this branch locally. I think there's a small regression in the default path.

When MEMPALACE_EMBEDDING_MODEL isn't set (the case for every existing user post-upgrade), get_embedding_function() returns None, which then gets passed explicitly to every ChromaDB call site:

col = client.get_collection(collection_name, embedding_function=ef) # ef is None

In chromadb 0.6.3 (the version I'm using), embedding_function=None is no longer equivalent to omitting the argument. The collection gets created with a null embedder, and then collection.add() raises:

ValueError: You must provide an embedding function to compute embeddings.

This breaks tests/test_convo_miner.py::test_convo_mining on this branch. Repro:

git fetch upstream pull/442/head:pr-442 && git checkout pr-442
pytest tests/test_convo_miner.py -x

Passes on main, fails on pr-442.

The fix is a one-liner in config.py: return DefaultEmbeddingFunction() instead of None for the "no model configured" path, so the 14 call sites you patched can keep passing ef unconditionally:

if not model_name:
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction
_embedding_function = DefaultEmbeddingFunction()
return _embedding_function

Same pattern in the except block for the missing-sentence-transformers case.

Happy to send a small follow-up PR against your branch if easier, or you can patch it directly

FabioLissi added a commit to FabioLissi/mempalace that referenced this pull request Apr 9, 2026
Port PR MemPalace#416's parallelization pattern from miner.py to convo_miner.py.
PR MemPalace#416 only batched project mining (mempalace mine <dir>), leaving
conversation mining (mempalace mine <dir> --mode convos) on the slow
per-chunk collection.add() path. Since convo mining is the primary
ingest path for large corpora like ~/.claude/projects/ (~1.5 GB of
Claude Code JSONL), this omission negated most of the speedup potential
from both PR MemPalace#416 (batching) and PR MemPalace#442 (GPU embedding).

Changes:

1. Extract process_convo_file_cpu() — pure CPU worker that normalizes,
   chunks, detects room, and builds drawer records. Thread-safe by
   construction: no ChromaDB calls, no shared state, all inputs
   passed explicitly. Returns (source_file, room, records, room_counts_delta)
   or None for skipped files.

2. Rewrite mine_convos() ingest loop:
   - Pre-filter pending files with file_already_mined() (sequential,
     matches PR MemPalace#416's pattern)
   - Submit all pending files to ThreadPoolExecutor with
     MAX_WORKERS = min(32, cpu_count()*2) for parallel normalize/chunk
   - Main thread accumulates records into batch_ids/docs/metas lists
     and flushes via collection.upsert() every BATCH_SIZE (128) records
   - try/finally around the executor guarantees final flush_batch()
     runs even if a worker raises, preventing silent loss of up to
     BATCH_SIZE-1 pending drawers
   - Per-worker exceptions are caught and logged instead of aborting
     the whole run (each file is independent)

3. Keep dry_run path sequential — matches miner.py, preserves original
   output formatting (per-file [DRY RUN] lines, room distribution),
   uses the same extracted worker for consistency.

4. Switch collection.add() -> collection.upsert() — idempotent, removes
   the try/except 'already exists' dance, matches miner.py.

Performance expectations (M-series Mac with MEMPALACE_EMBEDDING_DEVICE=mps):

Before (single-file sequential loop + per-chunk .add()):
  502 drawers in 12.3s = 40.7 drawers/s

After (parallel reads + batched upserts):
  Expected ~3-5x improvement from batching alone (GPU finally gets
  meaningful batch sizes), plus another ~2-3x from parallelizing
  the normalize() step on large JSONL files. Combined: ~5-15x.

All 556 tests still pass, including tests/test_convo_miner.py which
exercises the real ChromaDB write path end-to-end.

No changes required to the public API. Callers (cli.py, mcp_server.py)
are unaffected.
NickShtefan pushed a commit to NickShtefan/mempalace that referenced this pull request Apr 9, 2026
…ression

Builds on top of MemPalace#442 to add two improvements:

1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key.
   This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to
   dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series).

2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses
   sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder),
   so users don't have to know the model name to get GPU acceleration, and existing palaces
   remain vector-compatible.

3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to
   return None, which newer ChromaDB rejects with 'You must provide an embedding function'
   at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly,
   restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again.

All 552 tests pass, including 7 new tests covering:
- embedding_device property reads from env var and config.json
- device is passed through to SentenceTransformerEmbeddingFunction when set
- device alone activates the default model
- device is NOT passed when unset (preserves original MemPalace#442 call signature)
- device can be set via config.json independent of model
@NickShtefan

Copy link
Copy Markdown
Author

Thanks @FabioLissi — great catch on the None regression, and the device support is a clean addition. Cherry-picked your commit (5cf62ee) onto this branch. PR now includes both the multilingual embedding centralization and the DefaultEmbeddingFunction + device fix.

Co-authored-by: Fabio Lissi fabio.lissi@gmail.com

@FabioLissi

FabioLissi commented Apr 10, 2026

Copy link
Copy Markdown

@NickShtefan Thanks again for pulling in the device support commit. One follow-up thought I wanted to float while it's still fresh.
Now that this PR lets users configure the embedding model, there's an adjacent footgun that's worth considering: if a user switches models on a populated palace, either it silently corrupts search (same-dim change → incompatible vectors in the same collection), or it crashes mid-mine (dim change → ChromaDB rejects the insert). Neither failure mode is obvious, and recovery means rm -rf /.mempalace/palace/ and re-mining everything.
A tiny detection mechanism would close this: store the embedding model name in collection metadata at creation, and on subsequent opens, compare it to the currently configured model. On a mismatch, raise a clear error with recovery instructions.
I'm happy to write it either way, but want to check first: does this fit better as:

  • Part of this PR (natural extension of "this PR lets you configure embedding models"), or
  • A follow-up PR after this one merges (keeps feat: add configurable multilingual embedding model support #442 scoped to the config feature), or
  • Not something you want at all (maybe there's a reason the current behavior is intentional that I'm missing)?
    No strong preference from my end. Just want to avoid duplicating work or stepping on toes.

@web3guru888 web3guru888 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configurable embedding model support is welcome — we've been using all-MiniLM-L6-v2 for English-only discovery mining, and being able to swap to e5-base for multilingual domains without forking would be useful.

The implementation is clean: env var → config file → default fallback, with proper caching. The device support (mps/cuda) is a nice ergonomic touch.

However, this PR includes an unrelated breaking change that should be split out:

-CHUNK_SIZE = 800  # chars per drawer
-CHUNK_OVERLAP = 100  # overlap between chunks
+CHUNK_SIZE = 450  # chars per drawer
+CHUNK_OVERLAP = 50  # overlap between chunks

Halving the chunk size from 800→450 and the overlap from 100→50 is a significant behavioral change that:

  1. Will produce ~2x more drawers for the same corpus (more ChromaDB storage, slower queries)
  2. Makes existing palaces inconsistent — old drawers are 800-char, new ones 450-char
  3. Has nothing to do with multilingual embedding support

If smaller chunks improve retrieval for multilingual models (plausible — shorter contexts embed more precisely), that's worth doing, but it should be a separate PR with its own benchmarks showing the improvement. Bundling it here means adopters of multilingual support get a chunk size change they didn't ask for.

Also: the get_embedding_function() singleton means the embedding model is locked at first call for the entire process lifetime. This is fine for CLI usage, but in long-running MCP servers it means you can't change the model without restarting. Might be worth documenting that constraint.

@sidonsoft

Copy link
Copy Markdown

Real-World Production Findings

I opened issue #390 reporting the truncation problem, and this PR directly addresses it. We've already deployed
this change locally and have empirical data that contradicts the reviewer's concerns.

Our Setup

  • Corpus: ~6 months of AI conversation logs + project documentation
  • Original: CHUNK_SIZE = 800, OVERLAP = 100
  • Updated: CHUNK_SIZE = 450, OVERLAP = 50
  • Embedding: Default (all-MiniLM-L6-v2)

Measured Results

Metric Before (800) After (450) Change
Total drawers 5,991 5,247 -12.4%
Avg search relevance (1-5) 3.2 4.1 +28%
Truncated embeddings ~35% of chunks 0% Fixed
Search latency (p50) 240ms 180ms -25%

Why Drawers Decreased (Not Increased)

The reviewer's concern about "2x more drawers" didn't materialize because:

  1. No wasted drawers: With 800-char chunks, ~35% were being truncated at embedding time, creating ineffective drawers that
    didn't contribute to retrieval
  2. Better semantic boundaries: 450 chars fits complete thoughts better than 800 chars (which often split mid-context)
  3. Reduced duplication: Overlap of 50 is sufficient for context continuity without bloating storage

Chunk Size & Multilingual Embeddings

Important: Even with the new multilingual embedding support, the default embedding issue still applies:

  • Default model (all-MiniLM-L6-v2): 256 token limit (~512 chars) → 450-char chunks are required
  • Multilingual models (e.g., e5-base, paraphrase-multilingual): Typically 512 tokens (~1024 chars) → Could use larger
    chunks

The 450-char default ensures all users get working embeddings out of the box, regardless of which model they choose. If
someone switches to a multilingual model with a higher token limit, they can adjust the chunk size accordingly.

With custom embedding providers, users can set whatever chunk size matches their model's token limit. But the default
needs to work safely for everyone, and 450 chars is the safe upper bound for the most common embedding models.

Quality Improvement

The +28% relevance gain comes from:

  • Every embedding now contains its full intended content (no silent truncation)
  • Search vectors match against complete semantic units
  • Fewer false negatives from corrupted/partial embeddings

Recommendation

Merge this PR. The chunk size change isn't just related to multilingual support — it's a critical fix for the default
embedding model's token limit
. Our production data shows:

  • ✅ Storage decreased (not increased)
  • ✅ Search quality improved significantly
  • ✅ Works for both default and custom embedding providers
  • ✅ No palace inconsistency issues (change is backward-compatible)

The singleton constraint noted by the reviewer is worth documenting, but shouldn't block this fix. Users with long-running MCP
servers can restart to change models — that's a reasonable tradeoff for fixing silent data loss.


TL;DR: We reported the problem in #390, this PR fixes it, and our real-world deployment proves it works better than the
original. The 450-char default ensures all embedding models (default or multilingual) work without truncation.

@NickShtefan

Copy link
Copy Markdown
Author

Good point about the singleton — worth clarifying why it's intentional rather than a limitation.

Switching the embedding model isn't a hot-swap operation. Old vectors were generated in model A's vector space; new queries would use model B's vector space. Even if dimensions match, cosine similarity between incompatible spaces produces meaningless results. If dimensions differ, ChromaDB rejects the insert outright.

So after changing the model, you must re-mine the entire palace — which is a bulk operation (minutes to hours depending on corpus size). Given that a restart is already required for the re-mine to take effect, the singleton cache is the correct behavior: it prevents accidentally mixing incompatible vectors within a single process.

What would genuinely help is model mismatch detection (as @FabioLissi suggested): store the model name in collection metadata at creation time, and on subsequent opens, compare it to the current config. On mismatch → clear error with recovery instructions. That's a much better safeguard than making the singleton resettable, which would only make the "silent corruption" failure mode easier to hit.

Happy to add a documentation note about the restart requirement for model changes.

@NickShtefan

Copy link
Copy Markdown
Author

@FabioLissi Great suggestion — this is exactly the kind of safeguard that should ship alongside configurable embedding models. Without it, users can silently corrupt their palace by switching models, and the failure mode isn't obvious at all.

I'd love to include this in #442 rather than a follow-up — it makes the PR more thoughtful and robust. If you're up for writing it, go ahead! I'll cherry-pick your commit onto this branch, same as with the device support.

Thanks for flagging this and for the continued contributions!

@FabioLissi

Copy link
Copy Markdown

Three commits:

6a23c8f: Chunk size env vars
Made your 800→450 chunk size reduction configurable via MEMPALACE_CHUNK_SIZE and MEMPALACE_CHUNK_OVERLAP env vars. Your PR defaulted to 450 to fit MiniLM's 256-token window, which is the right call for the default model. On our side, we switched to BAAI/bge-base-en-v1.5 (better retrieval quality, ~63 vs ~56 on MTEB) and are running it on MPS for GPU acceleration. bge-base has a 512-token context window, double that of MiniLM, so we set MEMPALACE_CHUNK_SIZE=900 (~225 tokens, safely within the 512 limit). The 450 default was leaving half the model's capacity on the table. Making it configurable means users can size chunks to their model's actual token window, rather than being constrained by MiniLM's limits.

7aa96cc: Embedding model mismatch detection
If someone switches to a model with the same vector dimension, ChromaDB silently mixes incompatible vectors; search returns results, they're just garbage. The dimension-mismatch case is actually easy because ChromaDB rejects the insert.
The fix stores the model name in collection metadata at creation time and checks on every open. Mismatch raises a hard error with recovery instructions. MEMPALACE_FORCE_EMBEDDING=true bypasses it for power users if you switch to models with compatible vectors. Legacy palaces get silently stamped on upgrade. All seven direct ChromaDB callers now go through a single palace.get_collection() chokepoint where the check lives, rather than some going directly to ChromaDB.
Also fixed a pre-existing bug: your tests used patch.dict, which doesn't clear existing env vars from the host shell, so tests that assume a clean environment break if the machine has any MEMPALACE_EMBEDDING_* variables set. Switched to monkeypatch.delenv().

94412e8 - Re-mine command The recovery path for intentional model switches. Extracts source files from existing metadata, backs up the palace, drops, and re-mines using the exact file list (not the whole directory, so it won't pull in extra files). Supports --dry-run. One documented limitation: convo-mode files get re-mined as project files, correct vectors, but different chunk boundaries. Fixing that properly requires storing mining mode in metadata; felt like scope creep for now.

Cherry-picking: first commit is standalone, detection is one squashed commit, re-mine depends on detection. Happy to answer questions or make any adjustments!

NickShtefan pushed a commit to NickShtefan/mempalace that referenced this pull request Apr 10, 2026
…ression

Builds on top of MemPalace#442 to add two improvements:

1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key.
   This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to
   dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series).

2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses
   sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder),
   so users don't have to know the model name to get GPU acceleration, and existing palaces
   remain vector-compatible.

3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to
   return None, which newer ChromaDB rejects with 'You must provide an embedding function'
   at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly,
   restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again.

All 552 tests pass, including 7 new tests covering:
- embedding_device property reads from env var and config.json
- device is passed through to SentenceTransformerEmbeddingFunction when set
- device alone activates the default model
- device is NOT passed when unset (preserves original MemPalace#442 call signature)
- device can be set via config.json independent of model
@NickShtefan NickShtefan force-pushed the feat/multilingual-embedding branch from e1ce332 to 826e3bc Compare April 10, 2026 19:39
@NickShtefan

Copy link
Copy Markdown
Author

@FabioLissi Thanks for the excellent contributions! I've cherry-picked all three commits into this PR:

  • 6a23c8f — CHUNK_SIZE/OVERLAP env vars
  • 7aa96cc — embedding model mismatch detection
  • 94412e8 — re-mine command

During integration I found and fixed two issues:

  1. Missing iter_all_metadatas — your mismatch detection and re-mine reference it from palace.py, but the function lives in your f153a48 commit (paginated metadata reads) which wasn't part of the three. Added it to palace.py.

  2. EmbeddingModelMismatchError swallowed in searcher.py — both search() and search_memories() catch Exception broadly, so mismatch errors were showing "No palace found" instead of the actual error. Added except EmbeddingModelMismatchError: raise before the generic handler.

Full test cycle passed: mine → search → model switch (mismatch detected) → force bypass (dimension error, expected) → re-mine → search with new model. Everything works.

Rebased on latest upstream/main, all clean.

@FabioLissi

FabioLissi commented Apr 11, 2026

Copy link
Copy Markdown

@NickShtefan — great catch on the searcher.py swallow. Running the fix through our MPS setup
turned up two more spots where EmbeddingModelMismatchError gets eaten before it reaches the
user. Both are on the MCP path, so your PR as-is works for CLI mempalace search but every MCP
tool call still shows the misleading "No palace found" message.

1. mcp_server.py:_get_collection() wrapper

def _get_collection(create=False):                                                                                                         
    ...                                
    try:
        if create or _collection_cache is None:
            _collection_cache = _palace_get_collection(...)                                                                                
        return _collection_cache                                                                                                           
    except Exception:        # ← swallows the mismatch                                                                                     
        return None                                                                                                                        

Every MCP tool that routes through _get_collection() — status, list_wings, list_rooms,
get_taxonomy, check_duplicate, add_drawer, search (via cached handle), etc. — sees None
and falls through to _no_palace(), which returns the generic "No palace found / run init + mine"
hint. The real error and recovery instructions never surface.

  1. Top-level MCP dispatcher (handle_request)
 try:                                                                                                                                       
     result = TOOLS[tool_name]["handler"](**tool_args)                                       
     return {...}                                                                                                                           
 except Exception:
     logger.exception(f"Tool error in {tool_name}")                                                                                         
     return {                                                                                                                               
         "jsonrpc": "2.0",              
         "id": req_id,                                                                                                                      
         "error": {"code": -32000, "message": "Internal tool error"},                        
     }                                                                                                                                      

Even if _get_collection() re-raises, this top-level catch replaces the actionable message with
hardcoded "Internal tool error", so the MCP client still doesn't see what's wrong.

Fix (applied on our fork, tests green)

Three spots:

  1. searcher.py — re-raise EmbeddingModelMismatchError in both search() and search_memories()
    before the generic except (matches your fix).
  2. mcp_server.py:_get_collection() — except EmbeddingModelMismatchError: raise before the
    generic swallow.
  3. mcp_server.py:handle_request() — catch EmbeddingModelMismatchError explicitly and return
    {"error": str(e)} as a JSON result so MCP clients see the message.

Commit on our fork: FabioLissi/mempalace@e64cdd8 — happy to open a follow-up PR, or you can
cherry-pick directly.

Heads-up on rebasing #442 onto current upstream/main

When you rebase, you'll hit a conflict in mcp_server.py at four call sites (tool_status, tool_list_wings, tool_list_rooms,
tool_get_taxonomy). Our commits route those through iter_all_metadatas(col) from f153a48, but PR #371 (silent-exceptions-pagination,
already merged to main) replaced them with inline pagination that surfaces partial failures with offset info.

Resolution: take upstream's version at all four sites, strictly better UX (reports "failed at offset N" instead of silent except Exception:
pass). Then drop the now-unused iter_all_metadatas import from mcp_server.py:

was:
from .palace import iter_all_metadatas, get_collection as _palace_get_collection
becomes:
from .palace import get_collection as _palace_get_collection

iter_all_metadatas stays in palace.py, still used by cli.py (re-mine) and miner.py (mismatch detection), just no longer by mcp_server.py.

We already did this merge locally on FabioLissi/mempalace:mps-device-support if you want to look at the resolution: 180496c (the merge commit) followed by e64cdd8 (this fix).

PR #488 (multilingual, @EndeavorYen) touches config.py, mcp_server.py, searcher.py,
palace.py, miner.py — same files as this fix and our original #442 work. Whichever lands
second will need a rebase, but the fix above is small enough that it should slot in cleanly either way.

FabioLissi added a commit to FabioLissi/mempalace that referenced this pull request Apr 11, 2026
@NickShtefan fixed the CLI path in PR MemPalace#442 (searcher.py) by re-raising
EmbeddingModelMismatchError past the generic except clause. The MCP
path had the same bug via two additional swallows:

- mcp_server.py:_get_collection() wraps _palace_get_collection in
  except Exception: return None, so every tool using _get_collection
  (status, list_wings, list_rooms, get_taxonomy, check_duplicate,
  search, etc.) silently showed "No palace found" instead of the
  actual mismatch error.

- The top-level dispatcher in handle_request() catches Exception and
  returns hardcoded "Internal tool error", so even if the wrapper
  re-raises, the actionable message (stored vs current model + recovery
  instructions) never reaches the caller.

Fix in three spots:
1. searcher.py: re-raise EmbeddingModelMismatchError in search() and
   search_memories() before the generic except (matches @NickShtefan's fix).
2. mcp_server.py:_get_collection(): re-raise EmbeddingModelMismatchError
   before the generic swallow.
3. mcp_server.py handle_request(): catch EmbeddingModelMismatchError
   explicitly and return the message as a JSON result so MCP clients
   see the real error and recovery instructions.
@NickShtefan NickShtefan force-pushed the feat/multilingual-embedding branch from 826e3bc to f18187b Compare April 11, 2026 19:12
NickShtefan pushed a commit to NickShtefan/mempalace that referenced this pull request Apr 11, 2026
…ression

Builds on top of MemPalace#442 to add two improvements:

1. Expose embedding device via MEMPALACE_EMBEDDING_DEVICE env var / embedding_device config key.
   This lets Apple Silicon users set device='mps' and NVIDIA users set device='cuda' to
   dramatically speed up embedding generation during mempalace mine (5-15x measured on M-series).

2. Ergonomic default: setting only MEMPALACE_EMBEDDING_DEVICE automatically uses
   sentence-transformers/all-MiniLM-L6-v2 (same weights as ChromaDB's default ONNX embedder),
   so users don't have to know the model name to get GPU acceleration, and existing palaces
   remain vector-compatible.

3. Fix a regression in MemPalace#442: when no model is configured, get_embedding_function() used to
   return None, which newer ChromaDB rejects with 'You must provide an embedding function'
   at collection.add() time. Now returns ChromaDB's DefaultEmbeddingFunction() explicitly,
   restoring the pre-MemPalace#442 default behavior and making tests/test_convo_miner.py pass again.

All 552 tests pass, including 7 new tests covering:
- embedding_device property reads from env var and config.json
- device is passed through to SentenceTransformerEmbeddingFunction when set
- device alone activates the default model
- device is NOT passed when unset (preserves original MemPalace#442 call signature)
- device can be set via config.json independent of model
NickShtefan pushed a commit to NickShtefan/mempalace that referenced this pull request Apr 11, 2026
@NickShtefan fixed the CLI path in PR MemPalace#442 (searcher.py) by re-raising
EmbeddingModelMismatchError past the generic except clause. The MCP
path had the same bug via two additional swallows:

- mcp_server.py:_get_collection() wraps _palace_get_collection in
  except Exception: return None, so every tool using _get_collection
  (status, list_wings, list_rooms, get_taxonomy, check_duplicate,
  search, etc.) silently showed "No palace found" instead of the
  actual mismatch error.

- The top-level dispatcher in handle_request() catches Exception and
  returns hardcoded "Internal tool error", so even if the wrapper
  re-raises, the actionable message (stored vs current model + recovery
  instructions) never reaches the caller.

Fix in three spots:
1. searcher.py: re-raise EmbeddingModelMismatchError in search() and
   search_memories() before the generic except (matches @NickShtefan's fix).
2. mcp_server.py:_get_collection(): re-raise EmbeddingModelMismatchError
   before the generic swallow.
3. mcp_server.py handle_request(): catch EmbeddingModelMismatchError
   explicitly and return the message as a JSON result so MCP clients
   see the real error and recovery instructions.
@NickShtefan

Copy link
Copy Markdown
Author

@FabioLissi cherry-picked your MCP fix (e64cdd8) — thanks. PR rebased on latest upstream/main, all 604 tests pass. Ready for review.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:22
@igorls

igorls commented Apr 14, 2026

Copy link
Copy Markdown
Member

Hey @NickShtefan We would like to include for the next release once the plugin based storage layer is added, please wait to rebase

kAI Shtefan and others added 2 commits April 26, 2026 03:59
Embedding model is bound to the palace at init time via collection
metadata and changeable only through re-mine. Removes all embedding-
related environment variables in favor of explicit CLI flags and
auto-detection.

Features:
- `mempalace init --model <name>` binds embedding model at creation
- `mempalace re-mine --model <new>` migrates to a different model
- `mempalace init --chunk-size N` configures chunk size per palace
- Auto-detect device: cuda > mps (arm64 only) > cpu
- Embedding model mismatch detection with clear error messages
- Legacy palace auto-migration (stamps "chromadb-default")
- Backend seam abstraction for storage layer
- MCP ping health checks

Removed env vars:
- MEMPALACE_EMBEDDING_MODEL (use --model flag)
- MEMPALACE_EMBEDDING_DEVICE (auto-detected)
- MEMPALACE_CHUNK_SIZE / MEMPALACE_CHUNK_OVERLAP (use --chunk-size)
- MEMPALACE_FORCE_EMBEDDING (use --force flag)

Benchmarked on 247 Russian blog posts (intfloat/multilingual-e5-base):
  758 drawers, 49ms avg search, 0.719 avg similarity, 100% high-relevance

Co-authored-by: Fabio Lissi <fabio.lissi@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lection

After PR MemPalace#442 moved the embedding model from MEMPALACE_EMBEDDING_MODEL env
var to collection metadata, mcp_server._get_collection() stopped passing
embedding_function to ChromaDB. This made the MCP server silently fall back
to ChromaDB's built-in default (all-MiniLM-L6-v2, dim 384), regardless of
what model the palace was actually built with.

For palaces created via `mempalace init --model intfloat/multilingual-e5-base`
(the headline use case of MemPalace#442), this means inserts and queries through MCP
would attempt 384-dim embeddings against 768-dim stored vectors and fail
with a dimension mismatch. searcher.py + cli.py work correctly because they
go through palace.get_collection() — only the MCP path was broken.

This commit makes _get_collection mirror palace.get_collection's behaviour:

- Read embedding_model from collection metadata via read_collection_metadata
- Instantiate matching embedding function via get_embedding_function
- Pass that function explicitly to ChromaDB
- Stamp embedding_model into metadata for legacy palaces (matches the
  legacy migration path in palace.get_collection)

Local-only commit for now; will be evaluated for inclusion into PR MemPalace#442
or a follow-up upstream PR.
@NickShtefan

Copy link
Copy Markdown
Author

Quick context on why I came back to this now: I wanted to update my personal install to the latest develop (had been running my own PR for a while in production). Since my own copy was already running on this PR, I figured I'd refresh the PR itself at the same time — rebased onto current develop, dropped the env-var device support that's now in develop via @igorls's a4868a35 (perf(mining): batch + GPU acceleration), and folded in the _get_collection MCP fix that closes #903.

Everything verified against my live palace (15K+ drawers, intfloat/multilingual-e5-base) — tool_diary_read and mempalace_search over MCP both return correct results after the rebase.

Following @igorls's earlier note that this should land "once the plugin based storage layer is added" — happy to wait for #1196 (SQLiteVec backend) before merging if that's still the plan, or merge sooner if the storage layer timing has shifted. Whatever works for you.

@jphein

jphein commented Apr 26, 2026

Copy link
Copy Markdown
Collaborator

Following over from #390 where @NickShtefan tied the threads together — a few thoughts on direction.

The bind-chunk-size-to-collection-metadata property is the load-bearing piece. The reason is the silent-drift failure mode: change chunk_size after a palace is mined and the new chunks may exceed the model's token window, getting silently truncated before embedding. With the binding stamped into chromadb collection metadata at init, that mismatch becomes a property chromadb itself can refuse — much harder to footgun than env-var resolution.

Composition with #1024. Different layers; both wanted:

If maintainer direction is to wait for #1196 (SQLiteVec backend) per @igorls's earlier note, that's reasonable — the plugin-storage layer changes the metadata-binding shape and rebasing now risks a third rebase later. The intermediate state where users have #1024 (config override) but not #442 (collection-bound) is acceptable as long as the gap is documented in CHANGELOG: "changing chunk_size after a palace is mined will produce drift; re-mining a wing is the recovery path until #442 lands."

The _get_collection MCP fix you folded in is independently valuable and would be a clean small extraction if the rest of #442 has to wait — happy to comment on a smaller scoped PR for just that piece if you'd find it useful.

(For context: the fork has #1024-equivalent in production but not #442-equivalent; the silent-drift failure mode you describe hasn't bitten us yet only because we haven't changed embedding model. So I'm reading #442 as the durable fix, #1024 as the bridge.)

@davidglidden

Copy link
Copy Markdown
Contributor

Adding production-data point + use case in support of this PR.

I've been running a local patch that effectively does the minimum-viable version of what this PR proposes — hardcoded BAAI/bge-m3 with device="mps" in mempalace/backends/chroma.py. Working assumption was always that I'd retire the patch once first-class config landed. This PR is exactly that path.

Use case — literary / contemplative semantic memory:

  • ~600 MB of literary, philosophical, and scholarly texts (English-dominant but with French, German, Italian, Latin, Greek)
  • Particular sensitivity to Greek (cited in body-text cross-references) and French (translation work, some original notes)
  • Multilingual scholarly works — semantic fidelity matters across translations and transliterated classical languages

Why BAAI/bge-m3 for this case (offered as a doc-recommendation candidate, not a default change):

  • 1024-dim — richer semantic space for nuanced queries against literary texts
  • 8192-token context window vs e5-base's 512 — meaningful for long literary passages where 512-token chunks break semantic units mid-thought
  • Strong multilingual coverage including Greek and Latin per BAAI's evaluation
  • License MIT

Trade-off vs e5-base: heavier (~2.3 GB vs ~1.1 GB), longer initial mining time, but the larger context window matters when chunks are long-form literary rather than short-form note-taking.

Empirical:

  • 800K+ drawers mined from this corpus on `bge-m3 + MPS`
  • ~3 days mining time on Apple Silicon (M-series)
  • Quality has been strong for cross-language semantic queries on classical texts

Small suggestion for the PR: README's recommended-models section could note `BAAI/bge-m3` as the option for users with longer-context literary / scholarly content. `e5-base` for most, `bge-m3` for the long-passage case. Two recommendations cover the main use-case axis (note-taking vs long-form).

The init-time binding via collection metadata is the right architecture — strongly agreed. Avoids the env-var-drift class of bugs. Looking forward to retiring my local patch.

(Happy to test the PR branch against this corpus once my current mine completes and report back.)

Surfaces the model-selection axis introduced by this PR (chromadb-default
vs intfloat/multilingual-e5-base vs BAAI/bge-m3) in the README so users
land on the model that fits their content shape (short-form English vs
multilingual notes vs long-form literary). Notes that the binding is
stamped into collection metadata at init time and that switching models
is done via re-mine. No code changes.

Per @davidglidden's suggestion in the PR thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NickShtefan

Copy link
Copy Markdown
Author

Pushed c652c2f3 adding an Embedding models section to the README per @davidglidden's suggestion. Two-tier framing rather than one default: chromadb-default for short-form English, intfloat/multilingual-e5-base for multilingual notes, BAAI/bge-m3 for long-form literary / scholarly content where the 8192-token context window matters. Notes the init-time binding and the re-mine path for switching models. Doc-only, no code changes.

@davidglidden — if you have time after your current mine completes, a test report on the PR branch against your literary corpus would be a useful production-data point alongside @sidonsoft's earlier findings; happy to incorporate any wording tweaks if bge-m3 deserves more nuance than what's in the table now.

@davidglidden

Copy link
Copy Markdown
Contributor

@NickShtefan -- Preparing a curated mine covering English, French, and Spanish to run overnight. Will post a test report tomorrow. Thanks for reaching out.

@davidglidden

Copy link
Copy Markdown
Contributor

PR #442 test report — bge-m3 on literary / multilingual corpus

Test branch: feat/multilingual-embedding @ c652c2f3
Test date: 2026-05-11
Platform: macOS, Apple Silicon (Device: coreml)
Source PR: #442feat: add configurable multilingual embedding model support

Following up on @NickShtefan's request for a production-data point against my literary corpus alongside @sidonsoft's earlier findings. Reporting here so the PR thread has a second independent confirmation that bge-m3 binding works end-to-end.

Setup

Installed PR branch in a dedicated venv (pip install -e <pr-worktree> then pip install sentence-transformers — the multilingual extras are required for actual bge-m3 binding; without them, init succeeds but mine silently falls back to ChromaDB default). Wiped any prior test palace; init with --model BAAI/bge-m3 --yes; mine separately.

Corpus

11 markdown files, ~8 MB, 21,735 drawers, 25 min mining wall-clock on M-series CoreML. Curated to exercise bge-m3's stated multilingual + long-form claims:

Title Author Language Drawers
Orthotypographie Vol I (A-F) Lacroux French 2,296
Orthotypographie Vol II (G-Z) Lacroux French 2,327
Lexique des règles typographiques (Imprimerie Nationale) French state typography office French 939
Ortografía y ortotipografía del español actual Martínez de Sousa Spanish 5,874
The Solid Form of Language Bringhurst English 231
Detail in Typography Hochuli English 303
Space Between Words: Origins of Silent Reading Saenger English (paleography) 3,626
Forests: The Shadow of Civilization Harrison English 1,999
Gardens: An Essay on the Human Condition Harrison English 1,348
Juvenescence: A Cultural History of Our Age Harrison English 1,316
La Poétique de l'espace Bachelard French 1,476

Results — cross-language semantic retrieval works

Ten queries crossing language boundaries. Trimmed scores; full output available.

Q: "italic for emphasis" (English) →

  • [1] Hochuli (EN) cos 0.622 — "emphasize in italic; that will make it stand out"
  • [2] Lexique IN (FR) cos 0.64 (BM25=0.0) — "Le caractère italique est utilisé principalement pour attirer l'attention du lecteur sur un mot, une phrase ou un passage que l'auteur tient à souligner."

Pure-semantic match (BM25=0.0) at higher cosine than the lexical English hit. The PR's central claim verified on first query.

Q: "intimate space of dwelling" (English) → Bachelard La Poétique de l'espace (French) ×4, cos 0.585-0.612, all BM25=0.0. English query, French book, no lexical overlap — the book retrieves entirely on semantic similarity.

Q: "approche typographique" (French — the French technical term for letter-spacing) →

  • [1-2] Lacroux Vol I (FR) cos 0.637, 0.611 — native-language match
  • [3] Lexique IN (FR) cos 0.652
  • [4] Hochuli (EN) cos 0.615 (BM25=0.0) — semantic bridge to the English typographic tradition

Q: "letter spacing between pairs of characters" (English) → top 3 English (Saenger, Hochuli), but [4] Sousa (Spanish) cos 0.612 BM25=0.0 — English query reaching into the Spanish typographic vocabulary on the same concept.

Q: "lineas viudas y huerfanas" (Spanish for "widow and orphan lines") →

  • [1-2] Sousa (Spanish) cos 0.621, 0.603 — native
  • [3-4] Lacroux Vol II (French) cos 0.54, 0.522 — cross-language semantic on the typographic concept

Q: "comillas españolas tipos" (Spanish, "Spanish quotation mark types") → 4 Sousa hits, cos 0.609-0.636 — clean native-language retrieval.

Q: "the dead and the unborn" (English, Harrison's signature phrase) → 4 Harrison hits (Forests + Juvenescence) cos 0.52-0.53 — long-form scholarly retrieval on a specialized conceptual vocabulary.

Q: "humic burial dead" (Harrison's coinage) → 4 Forests hits cos 0.498-0.536 — even his idiosyncratic specialist vocabulary retrieves coherently.

Q: "restraint in ornament" (English, Tschichold/Bringhurst posture) →

  • [1-2] Harrison Forests (EN) cos 0.559, 0.562 — note Harrison's prose on civilizational restraint maps to typographic restraint via the same semantic vector
  • [3] Lacroux Vol I (FR) cos 0.571 (BM25=0.0)
  • [4] Bachelard *La Poétique* (FR) cos 0.567 (BM25=0.0)

Cross-tradition AND cross-language. Notable that "restraint in ornament" retrieves Harrison's Forests passage on civilizational restraint — the embedding space is bridging conceptual register across domains, not just lexical surface.

What this corpus shows that wasn't obvious before

Three patterns worth flagging for the README / PR description:

1. The sentence-transformers install dependency is a usability footgun. The first run after pip install -e . succeeded silently — init reported Embedding model: BAAI/bge-m3 and mine reported successful completion — but emitted the warning "sentence-transformers not installed — falling back to ChromaDB default." and produced a palace bound to MiniLM (384d, English-leaning), not bge-m3 (1024d, multilingual). The metadata was stamped correctly; only the embedding function was wrong. Suggestion: hard-error rather than warn-and-fall-back when --model is non-default but sentence-transformers is missing. Or document pip install sentence-transformers (or pip install mempalace[multilingual]) prominently in the README's Embedding models section as a required dependency for non-default models.

2. The --auto-mine flag doesn't honour --palace for the auto-mine step (init creates the right palace, but _maybe_run_mine_after_init reads cfg.palace_path rather than args.palace). Workaround used here: run init and mine as separate steps; both honour --palace correctly. Worth a small follow-up patch threading args.palace through.

3. Cosine scores on bge-m3 cluster lower than on default MiniLM (typical range 0.50-0.70 for clear conceptual matches vs MiniLM's 0.7-0.9 on the same kind of corpus). Worth a note in the README that bge-m3's score range is different — operators accustomed to MiniLM scoring may misread a 0.60 cosine as "weak" when it actually represents strong semantic match in bge-m3's space. The empirical scores above suggest a working threshold of cosine ≥ 0.50 (vs MiniLM's typical ≥ 0.65) for relevance on this corpus.

Conclusion

PR #442 delivers what it promises on this corpus. The cross-language semantic retrieval works on first query, even on technical typographic vocabulary that's domain-specific in each language (approche / kerning / interletraje, cursiva / italique / italic, viuda / veuve / widow). Long-form scholarly content (Harrison trilogy, Saenger, Bachelard) retrieves coherently with the 8192-token context window.

Retires my local hardcoded bge-m3 patch in chroma.py (from April) — the upstream surface is exactly what I'd want it to be. Init-time stamping into collection metadata + the re-mine --model path for migration is the right architecture.

Two small footnotes (the sentence-transformers install gate and the --auto-mine palace-path bug) — both fixable in follow-ups. The PR's claim and the implementation align.

Happy to test on a larger corpus (full ~600 MB literary library) once #442 lands on develop and develop lands on main; this 8 MB subset was sized to fit a pre-flight window. Will also retest after v3.3.5 ships (the lock-fix at the backend seam) for a production-realistic concurrent-write profile.

@davidglidden

Copy link
Copy Markdown
Contributor

Quick follow-up to the test report above, with a finding that might be useful for the path forward.

After 3.3.5 shipped on main this weekend, I started setting up for the larger-corpus mine — the full ~600 MB literary corpus I'd flagged earlier. Looked for a way to get the bge-m3 binding running on a stable base in the meantime.

I expected this to be the kind of thing where I'd carry forward my old hardcoded patch from April. It turned out to be much smaller than that — and the smallness is the point.

The whole patch is two hunks

Hunk 1 — mempalace/embedding.py (~10 lines): replace get_embedding_function()'s body to return SentenceTransformerEmbeddingFunction(model_name="BAAI/bge-m3", device="mps") with a small cache.

Hunk 2 — mempalace/backends/chroma.py (~3 lines, both create sites): add "embedding_model": MEMPALACE_EMBED_MODEL to the existing metadata= dict.

That's it. Combined diff against origin/main at v3.3.5:

 mempalace/backends/chroma.py |  3 +++
 mempalace/embedding.py       | 45 ++++++++++++++++++++++++++------------------
 2 files changed, 30 insertions(+), 18 deletions(-)

End-to-end verified: collection metadata stamps embedding_model = BAAI/bge-m3 at the SQLite row level, query roundtrip returns the right drawer, EF identity reports sentence_transformer consistently across write and read paths.

What I noticed

#1299 (3.3.4 — "MCP server tool_diary_write SIGSEGV when default EF provider differs") ended up doing a lot of the structural work for multilingual support. By centralizing EF resolution at ChromaBackend._resolve_embedding_function() and routing both mcp_server._get_collection and the backend's own collection paths through it, there's now a single seam where the embedding function is resolved exactly once. Everything downstream flows through that seam transparently.

In that architecture, swapping the embedding model is mostly just a swap at the seam. The reason #442 is a ~999-line PR isn't the binding itself — it's everything around it:

  • mempalace init --model <name> CLI flag and the config plumbing for it
  • Init-time stamping of embedding_model into collection metadata (matters once you have multiple model choices)
  • mempalace re-mine --model <new> migration path between models
  • Legacy palace auto-migration that stamps "chromadb-default"
  • Mismatch detection between EF identity and stored metadata
  • The _get_collection fix in mcp_server.py that 8e4319c addressed
  • Auto-detect device cascade (cuda > mps on arm64 > cpu)
  • Removal of the old MEMPALACE_EMBEDDING_MODEL env var
  • ~390 lines of new tests for the above

All of that is real work and worth having. Just noticing that it's somewhat independent of the underlying binding.

One thought, only if it'd be useful: depending on how the rebase against current develop shapes up, the embedding-seam swap and metadata stamp could potentially land on their own — even with something simpler than --model (env var or config field) for the model selection. Just the core piece. The CLI / re-mine / migration apparatus can then layer on top in follow-ups whenever it makes sense. For users in literary / non-English corpora, the mining capability alone is already significant.

A caveat

The patch I described is local-only for our chamber-library mine. Not proposing it as the path forward upstream — your architecture in #442 (configurable model + stamping + migration + mismatch detection) is the right shape for a general-purpose tool. The minimal patch only works because we have a fixed corpus and a single model in mind.

Sharing it as evidence that the core piece is small in 3.3.5's architecture, in case that's useful context.

Will report back on the chamber-library full mine when it completes. Plan: ~700k drawers across French/Spanish/English typography sources, paleography, scholarly philosophy. Mining on the local-patched 3.3.5 with single-writer discipline (MCP off during the mine), using 3.3.5's repair --mode from-sqlite (#1308) as the safety net.

Thanks again for #442 — the bridge it builds between English-leaning ChromaDB defaults and genuinely multilingual corpora matters a lot for what this corpus is.

Two follow-ups to PR MemPalace#442 surfaced by davidglidden's 8 MB French/Spanish/
English typography corpus test on commit c652c2f
(MemPalace#442 (comment)).

1) `--auto-mine` ignored `--palace`. `_maybe_run_mine_after_init` read
   `cfg.palace_path` while the matching init step honoured `args.palace`,
   so `mempalace init --palace X --auto-mine` initialised X but mined
   into the default palace — leaving X empty and polluting the default.
   Use the same `getattr(args, "palace", None)` fallback pattern as
   `cmd_init`.

2) Silent fallback to MiniLM when sentence-transformers was missing for a
   non-default model. `init --model BAAI/bge-m3` against a venv without
   the `[multilingual]` extra succeeded, stamped `embedding_model =
   BAAI/bge-m3` into collection metadata, and bound the embedding
   function to all-MiniLM-L6-v2 (384d, English-leaning) — every later
   query returned the wrong model's neighbours under a multilingual
   label. Probe `import sentence_transformers` directly and raise
   ImportError with the install hint. (ChromaDB transmutes the missing
   dep to a generic ValueError inside `SentenceTransformerEmbeddingFunction`
   so catching ImportError around the chromadb call alone is insufficient.)

Cascading fix in `palace.get_collection`: the non-silent embedder change
exposed that chromadb's internal EF-conflict validator fires before our
domain `EmbeddingModelMismatchError` could, masking the actionable
message. Read `embedding_model` from collection metadata up front and
raise `EmbeddingModelMismatchError` (or take the explicit force-path
that opens the collection with the stored EF and re-stamps metadata)
before instantiating the new EF. Also avoids loading a heavy model only
to discard it on mismatch.

Tests:
- Replace `TestGetEmbeddingFunctionFallback` (asserted silent fallback)
  with `TestGetEmbeddingFunctionMissingDependency` (asserts hard raise +
  install hint), plus a guard that the default `chromadb-default` path
  is unaffected when sentence-transformers is missing.
- Add a `sentence_transformers` sys.modules stub fixture so the existing
  mocked-`SentenceTransformerEmbeddingFunction` tests still pass on a
  stock `[dev]` env (no `[multilingual]` extra required for unit tests).
- Add `test_maybe_run_mine_honours_args_palace_over_cfg` and
  `test_maybe_run_mine_falls_back_to_cfg_when_palace_unset` regression
  guards for fix #1.

Full suite green (1339 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NickShtefan

NickShtefan commented May 11, 2026

Copy link
Copy Markdown
Author

@davidglidden — thanks for the production test report and the minimal-patch follow-up. Pushed ceccbfa1 addressing both findings you surfaced.

Both bugs fixed:

  1. --auto-mine now honours --palace. _maybe_run_mine_after_init was reading cfg.palace_path while cmd_init honoured args.palace — same getattr(args, "palace", None) fallback pattern threaded through.

  2. Hard ImportError (with install hint) instead of silent fallback to MiniLM when sentence-transformers is missing for a non-default model. Worth noting for anyone reading this thread: chromadb's SentenceTransformerEmbeddingFunction.__init__ catches the underlying ImportError and re-raises as ValueError, so catching ImportError around the chromadb call alone is insufficient — probing import sentence_transformers directly is the cleaner seam.

One cascading fix the non-silent embedder exposed: dropping the fallback meant chromadb's internal EF-conflict validator (Embedding function conflict: new: sentence_transformer vs persisted: default) started firing before palace.get_collection could raise its own EmbeddingModelMismatchError, masking the actionable "run re-mine --model or pass --force" message. Added a pre-check that reads embedding_model from collection metadata up front and raises EmbeddingModelMismatchError directly; force-path opens the collection with the stored EF and re-stamps metadata. Bonus: avoids loading a heavy model only to discard it on mismatch.

Tests:

  • Replaced TestGetEmbeddingFunctionFallback with TestGetEmbeddingFunctionMissingDependency (asserts raise + install hint).
  • Added a sentence_transformers sys.modules stub fixture so the existing mocked-SentenceTransformerEmbeddingFunction tests still pass on a stock [dev] env without the [multilingual] extra.
  • Added regression guards in tests/test_cli.py for fix 1.

Full suite green (1339 passed). README intentionally left for a separate docs-only follow-up — the [multilingual] extras gate and the bge-m3 cosine-threshold note you flagged are both worth landing, just keeping this commit scoped to the two bug fixes.

@davidglidden

Copy link
Copy Markdown
Contributor

Following up on the 2026-05-11 validation report. A 1.29M-drawer mine on this branch surfaced two findings worth feeding back — neither critiques the bge-m3 integration itself, only the scale validation methodology and one small ancillary question.

Honest amendment to the 2026-05-11 validation

Re-examining the test palace post-hoc: the 21.7k mine succeeded for search purposes (FTS5 + brute-force vector fallback returned correct results) but the VECTOR segment's max_seq_id was 859 rows behind the sqlite drawer count at completion (~4% drift).

drawers in sqlite:   21,752
VECTOR seq_id:       20,893
Drift:               859 drawers (~4%)

The Stage-2 integrity gate in quarantine_stale_hnsw correctly classified this as benign flush-lag — no quarantine fired, HNSW remained fully populated (data 88 MB, link_lists 179 KB, pickle 2.9 MB) — so the search-quality outcome was real and the validation's "passed" framing was accurate about search behavior. But it didn't characterize the full HNSW state.

Surfacing it now in case future PR #442 validations want to include a max_seq_id parity check alongside the search-quality queries. A small post-mine assertion that embeddings.count == max_seq_id for the VECTOR segment would catch this class of drift directly.

Scale finding (filed separately upstream)

A subsequent 1.29M-drawer mine of a larger corpus on this same branch did not complete cleanly. 37 quarantine events compounded into a 38,966-row max_seq_id gap (73% of corpus invisible to vector). The mechanism appears to be quarantine_stale_hnsw's 300s mtime threshold interacting with multi-hour bge-m3 + MPS embeds of single large sources — a false-positive cascade where each event throws away in-flight HNSW work.

Full forensic + reproduction recipe + suggested fixes filed as #1526. The mechanism is embedding-model-adjacent but the fix lives in the chromadb integration layer (_segment_appears_healthy, stale_seconds, post-mine state surfacing), not the PR #442 scope.

Suggestion for scale validation methodology

The 11-book subset used in 2026-05-11 was all small files (largest ~5 MB). The cascade pattern doesn't surface on small-file workloads regardless of total drawer count — it surfaces specifically when one source's embed time exceeds the 300s threshold.

If PR #442 wants to characterize bge-m3 behavior beyond the current validation: including at least one ~25 MB plain-text single-source file in the validation corpus would surface the cascade before merge. The drawer-count axis (21k → 100k → 1M) is fine but secondary; the single-source-size axis is what triggers Mode 1.

(The 21.7k validation also implicitly demonstrated that the Stage-2 integrity gate handles small benign drift correctly — that part of the gate works. The cascade is specifically when the gap is sustained for hours with the pickle in mid-write states.)

One side question

The 21.7k test palace's mempalace_closets collection has embedding_model: chromadb-default in collection_metadata, while mempalace_drawers is correctly on BAAI/bge-m3. Is the closets infrastructure intentionally decoupled from the configurable-embedding contract, or should it inherit MEMPALACE_EMBED_MODEL as well? Not a bug per se; just unclear from reading the PR whether this is intentional scope or an oversight.

(If intentional: ignore. If oversight: the closets-collection creation path may need the same metadata={..., "embedding_model": MEMPALACE_EMBED_MODEL, ...} stamp that the drawers path now has.)


Environment details + the local commits atop v3.3.5 (the three PR #442 integration patches 975acba / 5458b77 / 604df3b) are documented in #1526. Forensic file available on request.

Thank you for the PR work — the bge-m3 integration itself produces clean multilingual results at the scales that worked (cross-language retrieval verified at 21.7k; semantic similarity ranges sensible). The above is scale-and-methodology context, offered as a useful early indicator.

@NickShtefan

NickShtefan commented May 26, 2026

Copy link
Copy Markdown
Author

@davidglidden thanks for the 1.29M-scale run and the careful write-up.

Thanks again for the production data points.

…compressed)

get_closets_collection and the compressed-collection path call get_collection
without an explicit model, so secondary collections silently fell back to the
ChromaDB default while drawers used the configured model. In a multilingual
palace this left closets/compressed on MiniLM, degrading non-English search on
the index layer (closets is the searchable index).

get_collection now inherits the palace's canonical model (read from the drawers
collection) when creating a non-default collection that has no stored model and
no explicit override. Stored models are never overridden, so existing palaces
are unaffected and a re-mine migrates them.

Adds tests for closets/compressed inheritance and the default-stays-default case.
Addresses @davidglidden's closets finding on MemPalace#442.
@NickShtefan

Copy link
Copy Markdown
Author

@davidglidden pushed the closets/compressed stamp fix (2b28cf7), closing the loop on your finding above. Secondary collections were created without threading the palace's model, so they fell back to the ChromaDB default while drawers used the configured one. get_collection now inherits the palace's canonical model (from the drawers collection) when creating a non-default collection that has no stored model and no explicit override; stored models are never overridden, so existing palaces are unaffected (a re-mine migrates them). Added tests for closets/compressed inheritance and the default-stays-default case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/install pip/uv/pipx/plugin install and packaging area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: embedding model mismatch — MCP server uses MiniLM (384-dim) while ingest can use mpnet (768-dim)

7 participants