feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language) by igorls · Pull Request #1483 · MemPalace/mempalace

igorls · 2026-05-12T22:01:47Z

Summary

Adds multilingual benchmarking to benchmarks.model_eval so shipping decisions reflect non-English users. New --language / --languages flags load dataset.{lang}.jsonl alongside dataset.jsonl; pt-BR / es / zh translations included (633 samples across 4 tasks). Inputs translated; proper nouns and labels stay English so cross-lingual scoring works without re-translating ground truth.
Adds --num-ctx to force options.num_ctx per Ollama request, overriding the model's Modelfile default. Required for apples-to-apples VRAM/TPS comparison across candidates with mismatched defaults (e.g. qwen3:4b-instruct-2507-q8_0 defaults to 32k = 9.7 GB resident; at 8k it's 5.6 GB).
Adds --embed-model with default flipped to embeddinggemma. The v1 nomic-embed-text cosine on EN↔PT-BR same-meaning pairs lands at ~0.607 (right at the 0.6 match threshold), so any phrasing drift collapses to false-negative. embeddinggemma lands ~0.766 with 2.7× the signal/noise spread. PT-BR memory_extraction recovered from 0.150 → 0.850 on the same model outputs after the swap — the previous score was a methodology artifact, not a model regression.
load_candidates() now synthesizes entries for tags not in candidates.yaml, so ad-hoc tags (e.g. igorls/gemma4-e4b-classifier:Q4_K_M) work via --candidates without yaml edits.
CSV gains a language column.

Why now

We needed to pick between igorls/gemma4-e4b-classifier:Q4_K_M and qwen3:4b-instruct-2507-q8_0 for MemPalace's 8 GB-VRAM tier. The English-only harness at n=30 couldn't separate them (deltas 1-6 points = noise-adjacent). The multilingual matrix (4 models × 5 tasks × 4 languages = 80 runs) produced clear signal: Gemma family owns open-set room cls in every language, Qwen wins entity by 5-7 points everywhere, classifier matches official Gemma within noise on every cell. The methodology fixes here are what made that signal trustworthy.

What this does NOT change

No production code touched outside the benchmark harness and mempalace/llm_client.py (added a num_ctx kwarg + **provider_kwargs forwarding in get_provider; no behavioral change unless the new kwarg is set).
Default --embed-model change does affect score numbers vs historical EN-only runs. The relative ranking is unchanged but absolute coverage/similarity scores will be slightly higher under the new default. Worth flagging when comparing new results to pre-this-PR baselines.

Test plan

uv run python -m benchmarks.model_eval.orchestrator --candidates qwen3:4b-instruct-2507-q4_K_M --tasks all --languages en,pt-BR --num-ctx 8192 --dataset-dir benchmarks/model_eval/datasets --output /tmp/pr-smoke.csv produces 10 rows, language column populated, no errors
Verify --num-ctx 8192 is actually applied: curl -s localhost:11434/api/ps shows context_length: 8192 for a model whose Modelfile default is higher
Verify --embed-model override: run with --embed-model nomic-embed-text and confirm scores match historical EN baselines on memory_extraction
Multilingual datasets load and produce non-zero accuracy on all 3 languages × 4 tasks
Existing single-language runs (no --languages flag) still produce identical output as before this PR

…l, num_ctx, language) Enables shipping decisions for non-English users and fair comparison across candidates whose Modelfile defaults disagree. - --language / --languages: load dataset.{lang}.jsonl alongside the base dataset.jsonl. CSV gains a language column. Synthesized candidate entries let ad-hoc model tags run without editing candidates.yaml. - --num-ctx: force Ollama options.num_ctx per request, overriding the model's Modelfile default. Required for apples-to-apples VRAM/TPS (qwen3:4b-q8 defaults to 32k = 9.7 GB resident; at 8k it's 5.6 GB). - --embed-model: thread the semantic-similarity embedding model through scoring. Default flips to embeddinggemma (was nomic-embed-text v1). Reason: v1 cosine on EN<->PT-BR same-meaning pairs sits at ~0.607 (right at the 0.6 match threshold), so any phrasing drift collapses to false-negative. embeddinggemma lands ~0.766 with 2.7x the signal/noise spread. PT-BR memory_extraction recovered 0.15 -> 0.85 on the same outputs after the swap. Datasets: 12 new files (pt-BR/es/zh x 4 tasks, 633 samples). Input text translated; proper nouns and labels stay English so cross-lingual scoring against the existing labels.jsonl works without re-translation.

gemini-code-assist

Code Review

This pull request introduces multi-language support for model evaluation by adding Spanish, Portuguese, and Chinese datasets and updating the orchestrator to process these variants. It also enhances the evaluation pipeline with support for multiple LLM providers, configurable embedding models, and context window overrides. A review comment correctly identified that the --embed-endpoint argument does not implement the defaulting logic described in its help text, which could lead to configuration issues when using remote Ollama instances.

gemini-code-assist · 2026-05-12T22:04:21Z

+    parser.add_argument("--embed-endpoint", default="http://localhost:11434",
+                        help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.")


The help text for --embed-endpoint states that it defaults to --endpoint when using the ollama provider, but this logic is not implemented. Currently, it always defaults to http://localhost:11434 regardless of the LLM endpoint configuration, which will cause issues when benchmarking remote Ollama instances.

Suggested change

parser.add_argument("--embed-endpoint", default="http://localhost:11434",

help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.")

parser.add_argument("--embed-endpoint", default=None,

help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider, otherwise http://localhost:11434.")

Copilot

Pull request overview

Adds multilingual benchmarking support to the benchmarks.model_eval harness and introduces parity controls so benchmark runs can be compared fairly across models/providers and context-window defaults.

Changes:

Add --language/--languages dataset selection and record language in JSON/CSV outputs.
Add --embed-model (default now embeddinggemma) for embedding-scored tasks, and --num-ctx to override Ollama options.num_ctx per request.
Allow --candidates to accept ad-hoc model tags not present in candidates.yaml (synthesized candidate entries).

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
mempalace/llm_client.py	Adds `num_ctx` support to `OllamaProvider` and forwards extra provider kwargs via `get_provider()`.
benchmarks/model_eval/runner.py	Adds language selection, embed model/endpoint plumbing, provider selection, and `num_ctx` forwarding into the single-run harness.
benchmarks/model_eval/orchestrator.py	Extends matrix runner to iterate languages, add `language` CSV column, embed model/endpoint flags, and ad-hoc candidate tag support.
benchmarks/model_eval/datasets/room_classification/dataset.zh.jsonl	Adds Chinese room_classification dataset variant.
benchmarks/model_eval/datasets/room_classification/dataset.pt-BR.jsonl	Adds pt-BR room_classification dataset variant.
benchmarks/model_eval/datasets/room_classification/dataset.es.jsonl	Adds Spanish room_classification dataset variant.
benchmarks/model_eval/datasets/memory_extraction/dataset.zh.jsonl	Adds Chinese memory_extraction dataset variant.
benchmarks/model_eval/datasets/memory_extraction/dataset.pt-BR.jsonl	Adds pt-BR memory_extraction dataset variant.
benchmarks/model_eval/datasets/memory_extraction/dataset.es.jsonl	Adds Spanish memory_extraction dataset variant.
benchmarks/model_eval/datasets/entity_extraction/dataset.zh.jsonl	Adds Chinese entity_extraction dataset variant.
benchmarks/model_eval/datasets/entity_extraction/dataset.pt-BR.jsonl	Adds pt-BR entity_extraction dataset variant.
benchmarks/model_eval/datasets/entity_extraction/dataset.es.jsonl	Adds Spanish entity_extraction dataset variant.
benchmarks/model_eval/datasets/calibration/dataset.zh.jsonl	Adds Chinese calibration dataset variant.
benchmarks/model_eval/datasets/calibration/dataset.pt-BR.jsonl	Adds pt-BR calibration dataset variant.
benchmarks/model_eval/datasets/calibration/dataset.es.jsonl	Adds Spanish calibration dataset variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    task_dir = dataset_dir / task
-    samples = load_jsonl(task_dir / "dataset.jsonl")
+    dataset_file = "dataset.jsonl" if language == "en" else f"dataset.{language}.jsonl"
+    dataset_path = task_dir / dataset_file
+    if not dataset_path.exists():


        strip_thinking=not args.no_strip_thinking,
+        llm_provider=args.llm_provider,
+        language=args.language,
+        embed_model=args.embed_model,
+        num_ctx=args.num_ctx,


… defaulting Addresses Copilot + gemini-code-assist review on #1483. 1. Path-traversal guard for --language. The value is interpolated into the dataset filename (`dataset.{language}.jsonl`), so unvalidated input could escape `task_dir`. Now: - regex `^[A-Za-z][A-Za-z0-9]*(?:[_-][A-Za-z0-9]+)?$` accepts en, pt-BR, zh-CN, fr_CA, etc. and rejects anything with path separators or `..` - belt-and-suspenders `Path.resolve().is_relative_to(task_dir)` check before opening the file 2. --embed-endpoint now defaults to None and is resolved after parsing: uses --endpoint when --llm-provider=ollama (so remote benchmark runs score against the same host), else http://localhost:11434. Help text now matches behavior. runner.py's CLI was also missing the flag entirely — added so single-task runs honor remote endpoints.

…nslated labels Adds 6 new language datasets (German, French, Hindi, Italian, Korean, Russian) across all 4 benchmark tasks (calibration, entity_extraction, memory_extraction, room_classification) — 630 samples total, same conventions as the existing pt-BR/es/zh datasets: inputs translated, labels/ground-truth stay English except where noted. Changes: - 24 new dataset.{de,fr,hi,it,ko,ru}.jsonl files across all 4 tasks - labels.ko.jsonl for memory_extraction: Korean ground-truth so the scorer compares Korean model output against Korean expected content instead of English (fixes ~20pp score gap identified during testing — see report) - runner.py: loads labels.{lang}.jsonl when present, falls back to labels.jsonl - orchestrator.py: adds --output-dir (writes <dir>/<lang>/YYYY-MM-DD-<host>.csv per language); --output single-file mode unchanged - candidates.yaml: adds community tier (igorls classifier variants, heretic) and local tier (gemma4:e4b) - translate_datasets.py: script used to generate the translations via Ollama; included so contributors can extend to new languages without manual work - reports/2026-05-13-multilingual.md: 210-run benchmark report across 6 models × 7 languages × 5 tasks on RTX 3080 Laptop 8 GB

…ted samples, KO labels Addresses the review feedback from igorls, gemini-code-assist, and Copilot. HIGH: - orchestrator: --output single-file mode now shares ONE (fh, writer) across all languages instead of opening N handles to the same path. The old code caused interleaved buffer corruption: first language opened "w", subsequent ones opened "a", and writes from independent file offsets could overwrite each other. Verified with a multi-language --output smoke test (4 rows written, all distinct). - 19 untranslated/empty samples re-translated: - dataset.de.jsonl: cal_017 - dataset.hi.jsonl entity_extraction: ent_020, ent_025, ent_032, ent_038 - dataset.hi.jsonl room_classification: rc_017, rc_026, rc_028, rc_040, rc_064, rc_089, rc_091 - dataset.ko.jsonl room_classification: rc_027, rc_067 - dataset.it.jsonl room_classification: rc_029, rc_030, rc_031, rc_032, rc_053 (previously empty strings) - labels.ko.jsonl: restored all proper nouns to English (Doreth, Saela, Ivora, Ren Solanke, Pol Krisat, Pell Halloran, Bramble, Hollowmounts Institute, Wendelsea, Bridgewater Community Garden, Wends, Drukar, Aerwyn cycle, Jaccard, Mason bee, Markdown). Also fixed mistranslation 유전자 사과 (genetic apple) → 재래종 사과 (heirloom apple). MEDIUM: - runner.py: refactored label-resolution one-liner into 3 readable lines and added an info log when falling back to English ground truth, so readers don't misread "score collapse" as model failure. LOW: - orchestrator: moved `import socket` to module top (PEP 8); removed unused `out_path` from the unpacking tuple. - translate_datasets.py: renamed loop variable `l` → `code` (ruff E741); made the _translate_one fallback return path explicit instead of relying on for-loop fall-through; added a privacy warning in the docstring flagging that the default `kimi-k2.6:cloud` sends prose to a remote endpoint and should not be used over real palace data. - 2026-05-13-multilingual.md: converted analytical paragraphs from Portuguese to English to match the existing repo convention.

…tended-languages feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels

…rison Without an explicit num_ctx, each candidate ran at its Modelfile default (32k for the Gemma4 variants, larger for qwen3), so VRAM and latency weren't comparable across families — a 32k-default model pre-allocates KV cache a 4k-default model doesn't. The flag's own docstring promised "apples-to-apples" but defaulted to None, defeating the intent. All current benchmark prompts fit comfortably under 4k tokens (memory_extraction is the longest at ~500). Users with longer prompts can still pass --num-ctx <larger>. Adds a methodology note to the 2026-05-13 multilingual report so its VRAM/latency numbers aren't conflated with future runs at the new default.

igorls · 2026-05-14T05:40:24Z

Validation run: top-2 models × 7 languages with `--num-ctx 4096`

Ran the top-2 models from the 2026-05-13 report (igorls/gemma4-e4b-classifier:Q8_0 + qwen3:4b-instruct-2507-q8_0) across all 7 languages to validate Bruno's #1503 fixes against this PR's harness. Two methodology findings worth flagging here on the parent PR.

Setup: NVIDIA RTX 3090, Ollama 0.23.2. Different hardware from the report's Mercurio (RTX 3080 Laptop), so absolute latency numbers aren't comparable across runs — but accuracy and VRAM-per-model still are.

1. `embeddinggemma` largely solves the non-EN memory_extraction collapse

The current code default in runner.py:64 is _EMBED_MODEL = "embeddinggemma", but the 2026-05-13 report header says nomic-embed-text. So this run effectively re-tests the report's "critical methodology issue" with a different embedder.

Compare qwen3-4b-q8 memory_extraction:

	EN	DE	FR	HI	IT	KO	RU
Report (nomic-embed-text)	0.950	0.287	0.438	0.463	0.463	0.400	0.212
This run (embeddinggemma)	0.925	0.925	0.812	0.887	0.838	0.863	0.800

Same dataset, same labels.jsonl fallback for everything except KO. The "all non-EN scores collapse ~0.52-0.63 pp" finding in the report was mostly an embedding-model artifact, not a model-quality issue. The hypothesis in the methodology note ("re-run with embeddinggemma to separate scoring vs model effect") was correct.

2. `--num-ctx 4096` brings VRAM into line with model size

Without an explicit --num-ctx, each candidate used its Modelfile default (32k for Gemma4 variants). ollama ps showed classifier-q8 at 9.7 GB resident at the start of this run. After restarting with --num-ctx 4096:

	classifier-q8 VRAM	qwen3-4b-q8 VRAM
Modelfile defaults (`ollama ps`)	9.7 GB	(not measured this run)
`--num-ctx 4096` (this run)	8693 MB	5196 MB

~1 GB shaved off classifier-q8 by not pre-allocating 32k KV cache it never uses. Accuracy unaffected (all prompts fit comfortably under 4k). Pushed as commit 8591d13 on feat/benchmark-multilingual — --num-ctx now defaults to 4096 in both orchestrator.py and runner.py.

Latency presumably improved too, but I can't quantify it from this run alone — would need a same-hardware before/after on z690-ex-glacial to isolate from the hardware delta vs Mercurio.

Accuracy ranking (this run, EN baseline)

task	classifier-q8	qwen3-4b-q8
room_classification closed	0.622	0.564
room_classification open	0.652	0.576
entity_extraction	0.746	0.768
memory_extraction	0.855	0.864
calibration	0.964	0.950

Same overall picture as the report: classifier-q8 wins room tasks, qwen3-4b-q8 wins extraction tasks. On this hardware, qwen3-4b-q8 ran at e2e_p50 ~125 ms on room tasks vs classifier-q8's ~239 ms, with 5.2 GB VRAM vs 8.7 GB.

Follow-ups worth doing before this PR lands on develop

Update the 2026-05-13 report. Its "memory_extraction collapses non-EN" narrative is now misleading — the collapse was specific to nomic-embed-text and largely disappears with embeddinggemma. A short addendum noting this would prevent the report being quoted out of context after merge.
Reconcile embed-model default with the report. Either change the code default back to nomic-embed-text so it matches the report's header, or re-run the full 210-row matrix with embeddinggemma and replace the report. Right now a fresh run produces numbers that don't match the report.
The silent-fallback log Bruno added in feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels #1503 is working great — info: language=de labels=labels.jsonl (no labels.de.jsonl found ...) showed up for every non-EN+non-KO run on memory_extraction. Easy to spot which numbers are methodology-bound.

igorls · 2026-05-14T07:35:31Z

Embedding model deep-dive: production gap + winner

Following from the prior comment, dug into the embedder question all the way through. Headline: MemPalace's production embedder cannot find multilingual memories at all, and there's a drop-in zero-migration upgrade that fixes it. Details below.

The production gap (this is the real finding)

MemPalace ships with all-MiniLM-L6-v2 (ChromaDB default, English-only training, 384-dim, via ONNX in mempalace/embedding.py). Tested it on the multilingual datasets in this PR by computing cos(emb(EN_text), emb(translation)): same content, different language. A capable multilingual embedder gives ~0.85+; a monolingual one gives roughly random:

embedder	de	fr	hi	it	ko	ru	non-EN avg
all-MiniLM-L6-v2 (prod)	0.435	0.488	0.284	0.361	0.333	0.169	0.345
nomic-v2-moe	0.829	0.820	0.773	0.810	0.731	0.778	0.790
bge-m3 (no prefix)	0.888	0.885	0.867	0.883	0.850	0.866	0.873
embeddinggemma + sim prefix	0.897	0.899	0.884	0.868	0.864	0.879	0.882

RU at 0.169 means a Russian conversation and its identical English translation embed to nearly orthogonal vectors. A Russian-speaking MemPalace user effectively cannot find their own stored memories. Same story (less severe) for Korean, Hindi, Italian, French, German.

This breaks the "100% recall is the design requirement" promise from CLAUDE.md for any non-English content.

The fix: embeddinggemma-300m-ONNX at 384-dim via MRL

Validated the ONNX port (onnx-community/embeddinggemma-300m-ONNX) directly against Ollama's gguf to confirm production-format parity:

lang	Ollama gguf 768d	ONNX q8 768d	ONNX q8 384d (MRL)	Δ q8 vs gguf
de	0.897	0.897	0.902	-0.001
fr	0.899	0.899	0.904	+0.000
hi	0.884	0.883	0.893	-0.001
it	0.868	0.866	0.889	-0.002
ko	0.864	0.864	0.879	-0.000
ru	0.879	0.879	0.888	+0.000
avg	0.882	0.881	0.893	-0.000

Two surprises worth noting:

q8 ONNX is lossless vs gguf (max delta 0.002). No quantization-loss risk. Ships at ~300 MB.
384-dim MRL truncation outperforms full 768-dim (0.893 vs 0.881). Known Matryoshka property: the first dims are trained to carry the most semantically dense signal. So we keep ChromaDB's 384-dim collections and get higher quality than full 768d.

Net change for production: all-MiniLM-L6-v2 → embeddinggemma-300m-ONNX q8 @ 384d (MRL truncated). Same dim, no schema change, no re-index. +210 MB on disk. +0.548 absolute non-EN similarity.

Methodology notes worth bundling here

The prefix matters. embeddinggemma without prefix scores 0.829 cos; with "task: sentence similarity | query: " it jumps to 0.882. The current runner.py doesn't apply prefixes, so the benchmark mildly under-counts embeddinggemma's quality. Worth a small follow-up to add an embedder-specific prefix table.
e5-small couldn't be benchmarked. Both Ollama community ports tried (qllama/, jeffh/intfloat-) crash with EOF during embedding (known issue with some community GGUF embedding ports). embeddinggemma at 384-dim via MRL covers e5-small's main appeal (zero re-index) anyway.
bge-m3 is the close runner-up. 0.873 vs embeddinggemma's 0.882, MIT license vs Gemma (custom), no prefix needed. Real backup if the Gemma license becomes a blocker for distribution.

Implemented

Initial implementation landed on this branch as commit 51702e9:

New EmbeddinggemmaONNX class in mempalace/embedding.py (lazy hf_hub_download, onnxruntime inference, sim prefix, MRL→384d, L2-normalized).
New MEMPALACE_EMBEDDING_MODEL env var. Default stays minilm for back-compat. Opt-in embeddinggemma.
New [multilingual] extra in pyproject (huggingface_hub + tokenizers + numpy). Core deps unchanged.
9 existing test_embedding.py tests + 120 config/embedding-touched tests pass. Lint clean.

Follow-ups still open: offline tests for the new EF (mock hf_hub_download), docs note on running mempalace repair rebuild-index after switching, and a friendlier startup warning when ChromaDB rejects reads from an EF-name mismatch.

…embedder MemPalace's default embedder (all-MiniLM-L6-v2) is English-only-trained. Cross-lingual cosine similarity on parallel-translated text averages 0.35 across DE/FR/HI/IT/KO/RU — vs 0.88 for embeddinggemma-300m ONNX (q8) with the semantic-similarity prefix. RU is the worst at 0.17, meaning a Russian memory and its identical English translation embed to nearly orthogonal vectors. Multilingual users effectively cannot retrieve their own memories. This commit adds embeddinggemma-300m as an opt-in alternative: * New EmbeddinggemmaONNX class implementing ChromaDB's EF protocol. Lazy-downloads model_quantized.onnx (~300 MB) via huggingface_hub on first use; cached under ~/.cache/huggingface/. Applies the sim prefix, runs onnxruntime inference, truncates to 384 dims via Matryoshka (MRL), L2-normalizes. * MRL truncation to 384d is intentional: matches MiniLM's vector width so collection schemas don't change, and validation showed 384d MRL actually outperforms full 768d on these similarity tasks (0.893 vs 0.881 avg) — known property of MRL training. * MEMPALACE_EMBEDDING_MODEL env (default "minilm" for back-compat). Switching models on an existing palace requires re-embedding — ChromaDB rejects reads with a mismatched EF name. Run `mempalace repair rebuild-index` after changing the value. * New optional dep group: pip install mempalace[multilingual] Adds huggingface_hub + tokenizers + numpy. Core deps unchanged. ONNX q8 validated lossless vs the Ollama gguf benchmarked previously (max delta 0.002 cos across 240 parallel pairs).

Three follow-ups bundled for the embeddinggemma EF added in 51702e9: 1. Offline tests for EmbeddinggemmaONNX (10 tests, 0.08s, no network). Mocks huggingface_hub.hf_hub_download, tokenizers.Tokenizer.from_file, and onnxruntime.InferenceSession so CI never pulls the 300 MB model. Guarded with pytest.importorskip so the file is skipped when the multilingual extra isn't installed. Covers: stable name(), lazy-load runs exactly once, output shape (n, 384) after MRL truncation, L2 normalization, sim prefix applied, dispatch from get_embedding_function(model="embeddinggemma"), cache key separates models, helpful ImportError when deps missing, env override. 2. Friendlier ChromaDB EF-name-mismatch error. Switching MEMPALACE_EMBEDDING_MODEL on an existing palace previously surfaced ChromaDB's bare "Embedding function conflict: new: X vs persisted: Y" ValueError. Now ChromaBackend.get_collection() wraps that error and points users at the two recovery paths: revert the env var, or run `mempalace repair rebuild-index --palace <path>`. New _explain_ef_mismatch helper + 3 tests (unit + end-to-end). 3. Docs: CHANGELOG [Unreleased] entry covers both the new EF and the error wrapper. README Requirements section mentions the multilingual extra and points at the embedding.py docstring for the migration note.

Onboarding now asks the user once, on first run, whether to use the multilingual embedding model. The default answer is yes — defaulting to English-only made the recall promise effectively unreachable for any non-English content (cross-lingual cos ~0.35 vs ~0.88 for the multilingual model). The choice is written to config.json so subsequent runs pick the right EF without re-prompting; existing installs that never set the env var or ran onboarding stay on minilm for back-compat. MEMPALACE_EMBEDDING_MODEL still overrides both. Multilingual deps (huggingface_hub, tokenizers, numpy) move from the [multilingual] extra into core. The extra is kept as a no-op alias so existing install scripts keep working. The 300 MB ONNX model is still lazy-downloaded on first use, not at install time. `quick_setup` (the programmatic non-interactive path) grows an optional `embedding_model` arg so tests and benchmark scripts can pick a model without writing config.json by accident. EmbeddinggemmaONNX's "missing deps" error now points at the right recovery path (reinstall mempalace, since the deps are core) rather than the obsolete pip install mempalace[multilingual] hint. Tests: 9 new (3 _ask_embedding_model variants + 2 run_onboarding persistence + 2 quick_setup + 2 set_embedding_model round-trips). The existing 2 run_onboarding tests now patch _ask_embedding_model so they don't print to stdout.

Resolve conflicts: - backends/chroma.py: keep both new except handlers in get_collection (CollectionNotInitializedError from develop + EF-mismatch helper from this branch), ordered _ChromaNotFoundError before ValueError to match the create-branch handler order. - uv.lock: regenerated from merged pyproject.toml. Fix lint: ruff format mempalace/embedding.py + tests/test_embeddinggemma.py (CI now pins ruff==0.15.9 via develop's workflow update). Full suite: 1923 passed, 1 skipped.

Resolves conflicts in CHANGELOG.md and pyproject.toml by combining the multilingual-embedder additions (huggingface_hub/tokenizers/numpy core deps, [multilingual] alias, Features section) with develop's additions (python-dateutil core dep, [extract] extra, tunnel Bug Fixes and Internal sections). Prepares PR #1483 for merge into v3.3.6.

…1590, #1605)

Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.

Copilot AI review requested due to automatic review settings May 12, 2026 22:01

igorls requested a review from milla-jovovich as a code owner May 12, 2026 22:01

Copilot started reviewing on behalf of igorls May 12, 2026 22:02 View session

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Copilot AI reviewed May 12, 2026

View reviewed changes

igorls and others added 2 commits May 12, 2026 19:19

lealbrunocalhau mentioned this pull request May 13, 2026

feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels #1503

Merged

5 tasks

lealbrunocalhau and others added 3 commits May 14, 2026 00:11

Merge pull request #1503 from workblac/feat/benchmark-multilingual-ex…

6873426

…tended-languages feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels

igorls added 3 commits May 14, 2026 04:41

igorls added enhancement New feature or request area/i18n Multilingual, Unicode, non-English embeddings labels May 15, 2026

igorls added 3 commits May 18, 2026 17:58

Merge origin/develop into feat/benchmark-multilingual (#1548, #711, #…

b931151

…1590, #1605)

igorls merged commit df36eb3 into develop May 24, 2026
7 checks passed

igorls mentioned this pull request May 24, 2026

Release v3.3.6 #1610

Merged

3 tasks

milla-jovovich mentioned this pull request May 25, 2026

feat(entity): opt-in spaCy NER augmentation via mempalace[nlp] extra #1616

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language)#1483

feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language)#1483
igorls merged 12 commits into
developfrom
feat/benchmark-multilingual

igorls commented May 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

igorls commented May 14, 2026 •

edited

Loading

Uh oh!

igorls commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		parser.add_argument("--embed-endpoint", default="http://localhost:11434",
		help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.")

Conversation

igorls commented May 12, 2026

Summary

Why now

What this does NOT change

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

igorls commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation run: top-2 models × 7 languages with --num-ctx 4096

1. embeddinggemma largely solves the non-EN memory_extraction collapse

2. --num-ctx 4096 brings VRAM into line with model size

Accuracy ranking (this run, EN baseline)

Follow-ups worth doing before this PR lands on develop

Uh oh!

igorls commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Embedding model deep-dive: production gap + winner

The production gap (this is the real finding)

The fix: embeddinggemma-300m-ONNX at 384-dim via MRL

Methodology notes worth bundling here

Implemented

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

igorls commented May 14, 2026 •

edited

Loading

Validation run: top-2 models × 7 languages with `--num-ctx 4096`

1. `embeddinggemma` largely solves the non-EN memory_extraction collapse

2. `--num-ctx 4096` brings VRAM into line with model size

igorls commented May 14, 2026 •

edited

Loading