Skip to content

feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language)#1483

Merged
igorls merged 12 commits into
developfrom
feat/benchmark-multilingual
May 24, 2026
Merged

feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language)#1483
igorls merged 12 commits into
developfrom
feat/benchmark-multilingual

Conversation

@igorls

@igorls igorls commented May 12, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds multilingual benchmarking to benchmarks.model_eval so shipping decisions reflect non-English users. New --language / --languages flags load dataset.{lang}.jsonl alongside dataset.jsonl; pt-BR / es / zh translations included (633 samples across 4 tasks). Inputs translated; proper nouns and labels stay English so cross-lingual scoring works without re-translating ground truth.
  • Adds --num-ctx to force options.num_ctx per Ollama request, overriding the model's Modelfile default. Required for apples-to-apples VRAM/TPS comparison across candidates with mismatched defaults (e.g. qwen3:4b-instruct-2507-q8_0 defaults to 32k = 9.7 GB resident; at 8k it's 5.6 GB).
  • Adds --embed-model with default flipped to embeddinggemma. The v1 nomic-embed-text cosine on EN↔PT-BR same-meaning pairs lands at ~0.607 (right at the 0.6 match threshold), so any phrasing drift collapses to false-negative. embeddinggemma lands ~0.766 with 2.7× the signal/noise spread. PT-BR memory_extraction recovered from 0.150 → 0.850 on the same model outputs after the swap — the previous score was a methodology artifact, not a model regression.
  • load_candidates() now synthesizes entries for tags not in candidates.yaml, so ad-hoc tags (e.g. igorls/gemma4-e4b-classifier:Q4_K_M) work via --candidates without yaml edits.
  • CSV gains a language column.

Why now

We needed to pick between igorls/gemma4-e4b-classifier:Q4_K_M and qwen3:4b-instruct-2507-q8_0 for MemPalace's 8 GB-VRAM tier. The English-only harness at n=30 couldn't separate them (deltas 1-6 points = noise-adjacent). The multilingual matrix (4 models × 5 tasks × 4 languages = 80 runs) produced clear signal: Gemma family owns open-set room cls in every language, Qwen wins entity by 5-7 points everywhere, classifier matches official Gemma within noise on every cell. The methodology fixes here are what made that signal trustworthy.

What this does NOT change

  • No production code touched outside the benchmark harness and mempalace/llm_client.py (added a num_ctx kwarg + **provider_kwargs forwarding in get_provider; no behavioral change unless the new kwarg is set).
  • Default --embed-model change does affect score numbers vs historical EN-only runs. The relative ranking is unchanged but absolute coverage/similarity scores will be slightly higher under the new default. Worth flagging when comparing new results to pre-this-PR baselines.

Test plan

  • uv run python -m benchmarks.model_eval.orchestrator --candidates qwen3:4b-instruct-2507-q4_K_M --tasks all --languages en,pt-BR --num-ctx 8192 --dataset-dir benchmarks/model_eval/datasets --output /tmp/pr-smoke.csv produces 10 rows, language column populated, no errors
  • Verify --num-ctx 8192 is actually applied: curl -s localhost:11434/api/ps shows context_length: 8192 for a model whose Modelfile default is higher
  • Verify --embed-model override: run with --embed-model nomic-embed-text and confirm scores match historical EN baselines on memory_extraction
  • Multilingual datasets load and produce non-zero accuracy on all 3 languages × 4 tasks
  • Existing single-language runs (no --languages flag) still produce identical output as before this PR

…l, num_ctx, language)

Enables shipping decisions for non-English users and fair comparison across
candidates whose Modelfile defaults disagree.

- --language / --languages: load dataset.{lang}.jsonl alongside the base
  dataset.jsonl. CSV gains a language column. Synthesized candidate
  entries let ad-hoc model tags run without editing candidates.yaml.
- --num-ctx: force Ollama options.num_ctx per request, overriding the
  model's Modelfile default. Required for apples-to-apples VRAM/TPS
  (qwen3:4b-q8 defaults to 32k = 9.7 GB resident; at 8k it's 5.6 GB).
- --embed-model: thread the semantic-similarity embedding model through
  scoring. Default flips to embeddinggemma (was nomic-embed-text v1).
  Reason: v1 cosine on EN<->PT-BR same-meaning pairs sits at ~0.607
  (right at the 0.6 match threshold), so any phrasing drift collapses
  to false-negative. embeddinggemma lands ~0.766 with 2.7x the
  signal/noise spread. PT-BR memory_extraction recovered 0.15 -> 0.85
  on the same outputs after the swap.

Datasets: 12 new files (pt-BR/es/zh x 4 tasks, 633 samples). Input text
translated; proper nouns and labels stay English so cross-lingual
scoring against the existing labels.jsonl works without re-translation.
Copilot AI review requested due to automatic review settings May 12, 2026 22:01
@igorls igorls requested a review from milla-jovovich as a code owner May 12, 2026 22:01

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-language support for model evaluation by adding Spanish, Portuguese, and Chinese datasets and updating the orchestrator to process these variants. It also enhances the evaluation pipeline with support for multiple LLM providers, configurable embedding models, and context window overrides. A review comment correctly identified that the --embed-endpoint argument does not implement the defaulting logic described in its help text, which could lead to configuration issues when using remote Ollama instances.

Comment thread benchmarks/model_eval/orchestrator.py Outdated
Comment on lines +151 to +152
parser.add_argument("--embed-endpoint", default="http://localhost:11434",
help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The help text for --embed-endpoint states that it defaults to --endpoint when using the ollama provider, but this logic is not implemented. Currently, it always defaults to http://localhost:11434 regardless of the LLM endpoint configuration, which will cause issues when benchmarking remote Ollama instances.

Suggested change
parser.add_argument("--embed-endpoint", default="http://localhost:11434",
help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.")
parser.add_argument("--embed-endpoint", default=None,
help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider, otherwise http://localhost:11434.")

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multilingual benchmarking support to the benchmarks.model_eval harness and introduces parity controls so benchmark runs can be compared fairly across models/providers and context-window defaults.

Changes:

  • Add --language/--languages dataset selection and record language in JSON/CSV outputs.
  • Add --embed-model (default now embeddinggemma) for embedding-scored tasks, and --num-ctx to override Ollama options.num_ctx per request.
  • Allow --candidates to accept ad-hoc model tags not present in candidates.yaml (synthesized candidate entries).

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
mempalace/llm_client.py Adds num_ctx support to OllamaProvider and forwards extra provider kwargs via get_provider().
benchmarks/model_eval/runner.py Adds language selection, embed model/endpoint plumbing, provider selection, and num_ctx forwarding into the single-run harness.
benchmarks/model_eval/orchestrator.py Extends matrix runner to iterate languages, add language CSV column, embed model/endpoint flags, and ad-hoc candidate tag support.
benchmarks/model_eval/datasets/room_classification/dataset.zh.jsonl Adds Chinese room_classification dataset variant.
benchmarks/model_eval/datasets/room_classification/dataset.pt-BR.jsonl Adds pt-BR room_classification dataset variant.
benchmarks/model_eval/datasets/room_classification/dataset.es.jsonl Adds Spanish room_classification dataset variant.
benchmarks/model_eval/datasets/memory_extraction/dataset.zh.jsonl Adds Chinese memory_extraction dataset variant.
benchmarks/model_eval/datasets/memory_extraction/dataset.pt-BR.jsonl Adds pt-BR memory_extraction dataset variant.
benchmarks/model_eval/datasets/memory_extraction/dataset.es.jsonl Adds Spanish memory_extraction dataset variant.
benchmarks/model_eval/datasets/entity_extraction/dataset.zh.jsonl Adds Chinese entity_extraction dataset variant.
benchmarks/model_eval/datasets/entity_extraction/dataset.pt-BR.jsonl Adds pt-BR entity_extraction dataset variant.
benchmarks/model_eval/datasets/entity_extraction/dataset.es.jsonl Adds Spanish entity_extraction dataset variant.
benchmarks/model_eval/datasets/calibration/dataset.zh.jsonl Adds Chinese calibration dataset variant.
benchmarks/model_eval/datasets/calibration/dataset.pt-BR.jsonl Adds pt-BR calibration dataset variant.
benchmarks/model_eval/datasets/calibration/dataset.es.jsonl Adds Spanish calibration dataset variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 250 to +253
task_dir = dataset_dir / task
samples = load_jsonl(task_dir / "dataset.jsonl")
dataset_file = "dataset.jsonl" if language == "en" else f"dataset.{language}.jsonl"
dataset_path = task_dir / dataset_file
if not dataset_path.exists():
Comment on lines 418 to +422
strip_thinking=not args.no_strip_thinking,
llm_provider=args.llm_provider,
language=args.language,
embed_model=args.embed_model,
num_ctx=args.num_ctx,
Comment thread benchmarks/model_eval/orchestrator.py Outdated
igorls and others added 2 commits May 12, 2026 19:19
… defaulting

Addresses Copilot + gemini-code-assist review on #1483.

1. Path-traversal guard for --language. The value is interpolated into
   the dataset filename (`dataset.{language}.jsonl`), so unvalidated
   input could escape `task_dir`. Now:
   - regex `^[A-Za-z][A-Za-z0-9]*(?:[_-][A-Za-z0-9]+)?$` accepts en,
     pt-BR, zh-CN, fr_CA, etc. and rejects anything with path separators
     or `..`
   - belt-and-suspenders `Path.resolve().is_relative_to(task_dir)` check
     before opening the file

2. --embed-endpoint now defaults to None and is resolved after parsing:
   uses --endpoint when --llm-provider=ollama (so remote benchmark
   runs score against the same host), else http://localhost:11434.
   Help text now matches behavior. runner.py's CLI was also missing the
   flag entirely — added so single-task runs honor remote endpoints.
…nslated labels

Adds 6 new language datasets (German, French, Hindi, Italian, Korean, Russian)
across all 4 benchmark tasks (calibration, entity_extraction, memory_extraction,
room_classification) — 630 samples total, same conventions as the existing
pt-BR/es/zh datasets: inputs translated, labels/ground-truth stay English
except where noted.

Changes:
- 24 new dataset.{de,fr,hi,it,ko,ru}.jsonl files across all 4 tasks
- labels.ko.jsonl for memory_extraction: Korean ground-truth so the scorer
  compares Korean model output against Korean expected content instead of
  English (fixes ~20pp score gap identified during testing — see report)
- runner.py: loads labels.{lang}.jsonl when present, falls back to labels.jsonl
- orchestrator.py: adds --output-dir (writes <dir>/<lang>/YYYY-MM-DD-<host>.csv
  per language); --output single-file mode unchanged
- candidates.yaml: adds community tier (igorls classifier variants, heretic)
  and local tier (gemma4:e4b)
- translate_datasets.py: script used to generate the translations via Ollama;
  included so contributors can extend to new languages without manual work
- reports/2026-05-13-multilingual.md: 210-run benchmark report across
  6 models × 7 languages × 5 tasks on RTX 3080 Laptop 8 GB
lealbrunocalhau and others added 3 commits May 14, 2026 00:11
…ted samples, KO labels

Addresses the review feedback from igorls, gemini-code-assist, and Copilot.

HIGH:
- orchestrator: --output single-file mode now shares ONE (fh, writer) across
  all languages instead of opening N handles to the same path. The old code
  caused interleaved buffer corruption: first language opened "w", subsequent
  ones opened "a", and writes from independent file offsets could overwrite
  each other. Verified with a multi-language --output smoke test (4 rows
  written, all distinct).
- 19 untranslated/empty samples re-translated:
  - dataset.de.jsonl: cal_017
  - dataset.hi.jsonl entity_extraction: ent_020, ent_025, ent_032, ent_038
  - dataset.hi.jsonl room_classification: rc_017, rc_026, rc_028, rc_040,
    rc_064, rc_089, rc_091
  - dataset.ko.jsonl room_classification: rc_027, rc_067
  - dataset.it.jsonl room_classification: rc_029, rc_030, rc_031, rc_032,
    rc_053 (previously empty strings)
- labels.ko.jsonl: restored all proper nouns to English (Doreth, Saela, Ivora,
  Ren Solanke, Pol Krisat, Pell Halloran, Bramble, Hollowmounts Institute,
  Wendelsea, Bridgewater Community Garden, Wends, Drukar, Aerwyn cycle,
  Jaccard, Mason bee, Markdown). Also fixed mistranslation 유전자 사과
  (genetic apple) → 재래종 사과 (heirloom apple).

MEDIUM:
- runner.py: refactored label-resolution one-liner into 3 readable lines
  and added an info log when falling back to English ground truth, so
  readers don't misread "score collapse" as model failure.

LOW:
- orchestrator: moved `import socket` to module top (PEP 8); removed
  unused `out_path` from the unpacking tuple.
- translate_datasets.py: renamed loop variable `l` → `code` (ruff E741);
  made the _translate_one fallback return path explicit instead of relying
  on for-loop fall-through; added a privacy warning in the docstring
  flagging that the default `kimi-k2.6:cloud` sends prose to a remote
  endpoint and should not be used over real palace data.
- 2026-05-13-multilingual.md: converted analytical paragraphs from
  Portuguese to English to match the existing repo convention.
…tended-languages

feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels
…rison

Without an explicit num_ctx, each candidate ran at its Modelfile default
(32k for the Gemma4 variants, larger for qwen3), so VRAM and latency
weren't comparable across families — a 32k-default model pre-allocates
KV cache a 4k-default model doesn't. The flag's own docstring promised
"apples-to-apples" but defaulted to None, defeating the intent.

All current benchmark prompts fit comfortably under 4k tokens
(memory_extraction is the longest at ~500). Users with longer prompts
can still pass --num-ctx <larger>.

Adds a methodology note to the 2026-05-13 multilingual report so its
VRAM/latency numbers aren't conflated with future runs at the new default.
@igorls

igorls commented May 14, 2026

Copy link
Copy Markdown
Member Author

Validation run: top-2 models × 7 languages with --num-ctx 4096

Ran the top-2 models from the 2026-05-13 report (igorls/gemma4-e4b-classifier:Q8_0 + qwen3:4b-instruct-2507-q8_0) across all 7 languages to validate Bruno's #1503 fixes against this PR's harness. Two methodology findings worth flagging here on the parent PR.

Setup: NVIDIA RTX 3090, Ollama 0.23.2. Different hardware from the report's Mercurio (RTX 3080 Laptop), so absolute latency numbers aren't comparable across runs — but accuracy and VRAM-per-model still are.

1. embeddinggemma largely solves the non-EN memory_extraction collapse

The current code default in runner.py:64 is _EMBED_MODEL = "embeddinggemma", but the 2026-05-13 report header says nomic-embed-text. So this run effectively re-tests the report's "critical methodology issue" with a different embedder.

Compare qwen3-4b-q8 memory_extraction:

EN DE FR HI IT KO RU
Report (nomic-embed-text) 0.950 0.287 0.438 0.463 0.463 0.400 0.212
This run (embeddinggemma) 0.925 0.925 0.812 0.887 0.838 0.863 0.800

Same dataset, same labels.jsonl fallback for everything except KO. The "all non-EN scores collapse ~0.52-0.63 pp" finding in the report was mostly an embedding-model artifact, not a model-quality issue. The hypothesis in the methodology note ("re-run with embeddinggemma to separate scoring vs model effect") was correct.

2. --num-ctx 4096 brings VRAM into line with model size

Without an explicit --num-ctx, each candidate used its Modelfile default (32k for Gemma4 variants). ollama ps showed classifier-q8 at 9.7 GB resident at the start of this run. After restarting with --num-ctx 4096:

classifier-q8 VRAM qwen3-4b-q8 VRAM
Modelfile defaults (ollama ps) 9.7 GB (not measured this run)
--num-ctx 4096 (this run) 8693 MB 5196 MB

~1 GB shaved off classifier-q8 by not pre-allocating 32k KV cache it never uses. Accuracy unaffected (all prompts fit comfortably under 4k). Pushed as commit 8591d13 on feat/benchmark-multilingual--num-ctx now defaults to 4096 in both orchestrator.py and runner.py.

Latency presumably improved too, but I can't quantify it from this run alone — would need a same-hardware before/after on z690-ex-glacial to isolate from the hardware delta vs Mercurio.

Accuracy ranking (this run, EN baseline)

task classifier-q8 qwen3-4b-q8
room_classification closed 0.622 0.564
room_classification open 0.652 0.576
entity_extraction 0.746 0.768
memory_extraction 0.855 0.864
calibration 0.964 0.950

Same overall picture as the report: classifier-q8 wins room tasks, qwen3-4b-q8 wins extraction tasks. On this hardware, qwen3-4b-q8 ran at e2e_p50 ~125 ms on room tasks vs classifier-q8's ~239 ms, with 5.2 GB VRAM vs 8.7 GB.

Follow-ups worth doing before this PR lands on develop

  • Update the 2026-05-13 report. Its "memory_extraction collapses non-EN" narrative is now misleading — the collapse was specific to nomic-embed-text and largely disappears with embeddinggemma. A short addendum noting this would prevent the report being quoted out of context after merge.
  • Reconcile embed-model default with the report. Either change the code default back to nomic-embed-text so it matches the report's header, or re-run the full 210-row matrix with embeddinggemma and replace the report. Right now a fresh run produces numbers that don't match the report.
  • The silent-fallback log Bruno added in feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels #1503 is working greatinfo: language=de labels=labels.jsonl (no labels.de.jsonl found ...) showed up for every non-EN+non-KO run on memory_extraction. Easy to spot which numbers are methodology-bound.

@igorls

igorls commented May 14, 2026

Copy link
Copy Markdown
Member Author

Embedding model deep-dive: production gap + winner

Following from the prior comment, dug into the embedder question all the way through. Headline: MemPalace's production embedder cannot find multilingual memories at all, and there's a drop-in zero-migration upgrade that fixes it. Details below.

The production gap (this is the real finding)

MemPalace ships with all-MiniLM-L6-v2 (ChromaDB default, English-only training, 384-dim, via ONNX in mempalace/embedding.py). Tested it on the multilingual datasets in this PR by computing cos(emb(EN_text), emb(translation)): same content, different language. A capable multilingual embedder gives ~0.85+; a monolingual one gives roughly random:

embedder de fr hi it ko ru non-EN avg
all-MiniLM-L6-v2 (prod) 0.435 0.488 0.284 0.361 0.333 0.169 0.345
nomic-v2-moe 0.829 0.820 0.773 0.810 0.731 0.778 0.790
bge-m3 (no prefix) 0.888 0.885 0.867 0.883 0.850 0.866 0.873
embeddinggemma + sim prefix 0.897 0.899 0.884 0.868 0.864 0.879 0.882

RU at 0.169 means a Russian conversation and its identical English translation embed to nearly orthogonal vectors. A Russian-speaking MemPalace user effectively cannot find their own stored memories. Same story (less severe) for Korean, Hindi, Italian, French, German.

This breaks the "100% recall is the design requirement" promise from CLAUDE.md for any non-English content.

The fix: embeddinggemma-300m-ONNX at 384-dim via MRL

Validated the ONNX port (onnx-community/embeddinggemma-300m-ONNX) directly against Ollama's gguf to confirm production-format parity:

lang Ollama gguf 768d ONNX q8 768d ONNX q8 384d (MRL) Δ q8 vs gguf
de 0.897 0.897 0.902 -0.001
fr 0.899 0.899 0.904 +0.000
hi 0.884 0.883 0.893 -0.001
it 0.868 0.866 0.889 -0.002
ko 0.864 0.864 0.879 -0.000
ru 0.879 0.879 0.888 +0.000
avg 0.882 0.881 0.893 -0.000

Two surprises worth noting:

  1. q8 ONNX is lossless vs gguf (max delta 0.002). No quantization-loss risk. Ships at ~300 MB.
  2. 384-dim MRL truncation outperforms full 768-dim (0.893 vs 0.881). Known Matryoshka property: the first dims are trained to carry the most semantically dense signal. So we keep ChromaDB's 384-dim collections and get higher quality than full 768d.

Net change for production: all-MiniLM-L6-v2embeddinggemma-300m-ONNX q8 @ 384d (MRL truncated). Same dim, no schema change, no re-index. +210 MB on disk. +0.548 absolute non-EN similarity.

Methodology notes worth bundling here

  • The prefix matters. embeddinggemma without prefix scores 0.829 cos; with "task: sentence similarity | query: " it jumps to 0.882. The current runner.py doesn't apply prefixes, so the benchmark mildly under-counts embeddinggemma's quality. Worth a small follow-up to add an embedder-specific prefix table.
  • e5-small couldn't be benchmarked. Both Ollama community ports tried (qllama/, jeffh/intfloat-) crash with EOF during embedding (known issue with some community GGUF embedding ports). embeddinggemma at 384-dim via MRL covers e5-small's main appeal (zero re-index) anyway.
  • bge-m3 is the close runner-up. 0.873 vs embeddinggemma's 0.882, MIT license vs Gemma (custom), no prefix needed. Real backup if the Gemma license becomes a blocker for distribution.

Implemented

Initial implementation landed on this branch as commit 51702e9:

  • New EmbeddinggemmaONNX class in mempalace/embedding.py (lazy hf_hub_download, onnxruntime inference, sim prefix, MRL→384d, L2-normalized).
  • New MEMPALACE_EMBEDDING_MODEL env var. Default stays minilm for back-compat. Opt-in embeddinggemma.
  • New [multilingual] extra in pyproject (huggingface_hub + tokenizers + numpy). Core deps unchanged.
  • 9 existing test_embedding.py tests + 120 config/embedding-touched tests pass. Lint clean.

Follow-ups still open: offline tests for the new EF (mock hf_hub_download), docs note on running mempalace repair rebuild-index after switching, and a friendlier startup warning when ChromaDB rejects reads from an EF-name mismatch.

igorls added 3 commits May 14, 2026 04:41
…embedder

MemPalace's default embedder (all-MiniLM-L6-v2) is English-only-trained.
Cross-lingual cosine similarity on parallel-translated text averages 0.35
across DE/FR/HI/IT/KO/RU — vs 0.88 for embeddinggemma-300m ONNX (q8) with
the semantic-similarity prefix. RU is the worst at 0.17, meaning a Russian
memory and its identical English translation embed to nearly orthogonal
vectors. Multilingual users effectively cannot retrieve their own memories.

This commit adds embeddinggemma-300m as an opt-in alternative:

* New EmbeddinggemmaONNX class implementing ChromaDB's EF protocol.
  Lazy-downloads model_quantized.onnx (~300 MB) via huggingface_hub on
  first use; cached under ~/.cache/huggingface/. Applies the sim prefix,
  runs onnxruntime inference, truncates to 384 dims via Matryoshka
  (MRL), L2-normalizes.

* MRL truncation to 384d is intentional: matches MiniLM's vector width
  so collection schemas don't change, and validation showed 384d MRL
  actually outperforms full 768d on these similarity tasks (0.893 vs
  0.881 avg) — known property of MRL training.

* MEMPALACE_EMBEDDING_MODEL env (default "minilm" for back-compat).
  Switching models on an existing palace requires re-embedding —
  ChromaDB rejects reads with a mismatched EF name. Run
  `mempalace repair rebuild-index` after changing the value.

* New optional dep group: pip install mempalace[multilingual]
  Adds huggingface_hub + tokenizers + numpy. Core deps unchanged.

ONNX q8 validated lossless vs the Ollama gguf benchmarked previously
(max delta 0.002 cos across 240 parallel pairs).
Three follow-ups bundled for the embeddinggemma EF added in 51702e9:

1. Offline tests for EmbeddinggemmaONNX (10 tests, 0.08s, no network).
   Mocks huggingface_hub.hf_hub_download, tokenizers.Tokenizer.from_file,
   and onnxruntime.InferenceSession so CI never pulls the 300 MB model.
   Guarded with pytest.importorskip so the file is skipped when the
   multilingual extra isn't installed. Covers: stable name(), lazy-load
   runs exactly once, output shape (n, 384) after MRL truncation, L2
   normalization, sim prefix applied, dispatch from
   get_embedding_function(model="embeddinggemma"), cache key separates
   models, helpful ImportError when deps missing, env override.

2. Friendlier ChromaDB EF-name-mismatch error. Switching
   MEMPALACE_EMBEDDING_MODEL on an existing palace previously surfaced
   ChromaDB's bare "Embedding function conflict: new: X vs persisted: Y"
   ValueError. Now ChromaBackend.get_collection() wraps that error and
   points users at the two recovery paths: revert the env var, or run
   `mempalace repair rebuild-index --palace <path>`. New
   _explain_ef_mismatch helper + 3 tests (unit + end-to-end).

3. Docs: CHANGELOG [Unreleased] entry covers both the new EF and the
   error wrapper. README Requirements section mentions the multilingual
   extra and points at the embedding.py docstring for the migration note.
Onboarding now asks the user once, on first run, whether to use the
multilingual embedding model. The default answer is yes — defaulting to
English-only made the recall promise effectively unreachable for any
non-English content (cross-lingual cos ~0.35 vs ~0.88 for the multilingual
model). The choice is written to config.json so subsequent runs pick the
right EF without re-prompting; existing installs that never set the env
var or ran onboarding stay on minilm for back-compat. MEMPALACE_EMBEDDING_MODEL
still overrides both.

Multilingual deps (huggingface_hub, tokenizers, numpy) move from the
[multilingual] extra into core. The extra is kept as a no-op alias so
existing install scripts keep working. The 300 MB ONNX model is still
lazy-downloaded on first use, not at install time.

`quick_setup` (the programmatic non-interactive path) grows an optional
`embedding_model` arg so tests and benchmark scripts can pick a model
without writing config.json by accident.

EmbeddinggemmaONNX's "missing deps" error now points at the right
recovery path (reinstall mempalace, since the deps are core) rather
than the obsolete pip install mempalace[multilingual] hint.

Tests: 9 new (3 _ask_embedding_model variants + 2 run_onboarding
persistence + 2 quick_setup + 2 set_embedding_model round-trips). The
existing 2 run_onboarding tests now patch _ask_embedding_model so they
don't print to stdout.
@igorls igorls added enhancement New feature or request area/i18n Multilingual, Unicode, non-English embeddings labels May 15, 2026
igorls added 3 commits May 18, 2026 17:58
Resolve conflicts:
- backends/chroma.py: keep both new except handlers in get_collection
  (CollectionNotInitializedError from develop + EF-mismatch helper from
  this branch), ordered _ChromaNotFoundError before ValueError to match
  the create-branch handler order.
- uv.lock: regenerated from merged pyproject.toml.

Fix lint: ruff format mempalace/embedding.py + tests/test_embeddinggemma.py
(CI now pins ruff==0.15.9 via develop's workflow update).

Full suite: 1923 passed, 1 skipped.
Resolves conflicts in CHANGELOG.md and pyproject.toml by combining
the multilingual-embedder additions (huggingface_hub/tokenizers/numpy
core deps, [multilingual] alias, Features section) with develop's
additions (python-dateutil core dep, [extract] extra, tunnel Bug
Fixes and Internal sections).

Prepares PR #1483 for merge into v3.3.6.
@igorls igorls merged commit df36eb3 into develop May 24, 2026
7 checks passed
@igorls igorls mentioned this pull request May 24, 2026
3 tasks
arnoldwender pushed a commit to arnoldwender/mempalace that referenced this pull request May 24, 2026
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin
manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json,
.codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md
from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the
major user-facing entries that landed without changelog entries during
the cycle:

Features:
- MemPalace#1555 office-document mining via --mode extract + virtual line numbers
- MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a)
- MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph)
- MemPalace#1565 cross-wing tunnels auto-promoted from hallways
- MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels
- MemPalace#1236 API-tool transcripts auto-route to wing_api
- MemPalace#711 hooks.auto_save toggle for silent-mode sessions
- MemPalace#1605 COCA content-word filter for entity detection
- MemPalace#1557 case-insensitive entity matching at mine time
- MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default

Bug Fixes (selected, user-visible):
- MemPalace#1540 silent data loss in three unchunked upsert sites
- MemPalace#1538 paragraph chunker oversized chunks
- MemPalace#1554 per-file chunk cap too low for transcripts
- MemPalace#1562 Windows hook subprocess/ChromaDB deadlock
- MemPalace#1529 create_tunnel corrupted hyphenated wing names
- MemPalace#1424 save-hook truncated hyphenated project folders
- MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths
- MemPalace#1466 silent symlink skip now logged
- MemPalace#1441 macOS stock-bash 3.2 hook compatibility
- MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input
- MemPalace#1523 VACUUM + FTS5 rebuild after repair
- MemPalace#1548 FTS5 validation at end of mine
- plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466,
  MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585

Performance:
- MemPalace#1474 convo miner pre-fetches mined-set
- MemPalace#1487 rebuild_index progress callback
- MemPalace#1530 MCP cold-start diagnostics + opt-in warmup

Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment
verified per RELEASING.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants