feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language)#1483
Conversation
…l, num_ctx, language)
Enables shipping decisions for non-English users and fair comparison across
candidates whose Modelfile defaults disagree.
- --language / --languages: load dataset.{lang}.jsonl alongside the base
dataset.jsonl. CSV gains a language column. Synthesized candidate
entries let ad-hoc model tags run without editing candidates.yaml.
- --num-ctx: force Ollama options.num_ctx per request, overriding the
model's Modelfile default. Required for apples-to-apples VRAM/TPS
(qwen3:4b-q8 defaults to 32k = 9.7 GB resident; at 8k it's 5.6 GB).
- --embed-model: thread the semantic-similarity embedding model through
scoring. Default flips to embeddinggemma (was nomic-embed-text v1).
Reason: v1 cosine on EN<->PT-BR same-meaning pairs sits at ~0.607
(right at the 0.6 match threshold), so any phrasing drift collapses
to false-negative. embeddinggemma lands ~0.766 with 2.7x the
signal/noise spread. PT-BR memory_extraction recovered 0.15 -> 0.85
on the same outputs after the swap.
Datasets: 12 new files (pt-BR/es/zh x 4 tasks, 633 samples). Input text
translated; proper nouns and labels stay English so cross-lingual
scoring against the existing labels.jsonl works without re-translation.
There was a problem hiding this comment.
Code Review
This pull request introduces multi-language support for model evaluation by adding Spanish, Portuguese, and Chinese datasets and updating the orchestrator to process these variants. It also enhances the evaluation pipeline with support for multiple LLM providers, configurable embedding models, and context window overrides. A review comment correctly identified that the --embed-endpoint argument does not implement the defaulting logic described in its help text, which could lead to configuration issues when using remote Ollama instances.
| parser.add_argument("--embed-endpoint", default="http://localhost:11434", | ||
| help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.") |
There was a problem hiding this comment.
The help text for --embed-endpoint states that it defaults to --endpoint when using the ollama provider, but this logic is not implemented. Currently, it always defaults to http://localhost:11434 regardless of the LLM endpoint configuration, which will cause issues when benchmarking remote Ollama instances.
| parser.add_argument("--embed-endpoint", default="http://localhost:11434", | |
| help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider.") | |
| parser.add_argument("--embed-endpoint", default=None, | |
| help="Endpoint for the embedding model (always Ollama). Defaults to --endpoint when using ollama provider, otherwise http://localhost:11434.") |
There was a problem hiding this comment.
Pull request overview
Adds multilingual benchmarking support to the benchmarks.model_eval harness and introduces parity controls so benchmark runs can be compared fairly across models/providers and context-window defaults.
Changes:
- Add
--language/--languagesdataset selection and recordlanguagein JSON/CSV outputs. - Add
--embed-model(default nowembeddinggemma) for embedding-scored tasks, and--num-ctxto override Ollamaoptions.num_ctxper request. - Allow
--candidatesto accept ad-hoc model tags not present incandidates.yaml(synthesized candidate entries).
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| mempalace/llm_client.py | Adds num_ctx support to OllamaProvider and forwards extra provider kwargs via get_provider(). |
| benchmarks/model_eval/runner.py | Adds language selection, embed model/endpoint plumbing, provider selection, and num_ctx forwarding into the single-run harness. |
| benchmarks/model_eval/orchestrator.py | Extends matrix runner to iterate languages, add language CSV column, embed model/endpoint flags, and ad-hoc candidate tag support. |
| benchmarks/model_eval/datasets/room_classification/dataset.zh.jsonl | Adds Chinese room_classification dataset variant. |
| benchmarks/model_eval/datasets/room_classification/dataset.pt-BR.jsonl | Adds pt-BR room_classification dataset variant. |
| benchmarks/model_eval/datasets/room_classification/dataset.es.jsonl | Adds Spanish room_classification dataset variant. |
| benchmarks/model_eval/datasets/memory_extraction/dataset.zh.jsonl | Adds Chinese memory_extraction dataset variant. |
| benchmarks/model_eval/datasets/memory_extraction/dataset.pt-BR.jsonl | Adds pt-BR memory_extraction dataset variant. |
| benchmarks/model_eval/datasets/memory_extraction/dataset.es.jsonl | Adds Spanish memory_extraction dataset variant. |
| benchmarks/model_eval/datasets/entity_extraction/dataset.zh.jsonl | Adds Chinese entity_extraction dataset variant. |
| benchmarks/model_eval/datasets/entity_extraction/dataset.pt-BR.jsonl | Adds pt-BR entity_extraction dataset variant. |
| benchmarks/model_eval/datasets/entity_extraction/dataset.es.jsonl | Adds Spanish entity_extraction dataset variant. |
| benchmarks/model_eval/datasets/calibration/dataset.zh.jsonl | Adds Chinese calibration dataset variant. |
| benchmarks/model_eval/datasets/calibration/dataset.pt-BR.jsonl | Adds pt-BR calibration dataset variant. |
| benchmarks/model_eval/datasets/calibration/dataset.es.jsonl | Adds Spanish calibration dataset variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| task_dir = dataset_dir / task | ||
| samples = load_jsonl(task_dir / "dataset.jsonl") | ||
| dataset_file = "dataset.jsonl" if language == "en" else f"dataset.{language}.jsonl" | ||
| dataset_path = task_dir / dataset_file | ||
| if not dataset_path.exists(): |
| strip_thinking=not args.no_strip_thinking, | ||
| llm_provider=args.llm_provider, | ||
| language=args.language, | ||
| embed_model=args.embed_model, | ||
| num_ctx=args.num_ctx, |
… defaulting Addresses Copilot + gemini-code-assist review on #1483. 1. Path-traversal guard for --language. The value is interpolated into the dataset filename (`dataset.{language}.jsonl`), so unvalidated input could escape `task_dir`. Now: - regex `^[A-Za-z][A-Za-z0-9]*(?:[_-][A-Za-z0-9]+)?$` accepts en, pt-BR, zh-CN, fr_CA, etc. and rejects anything with path separators or `..` - belt-and-suspenders `Path.resolve().is_relative_to(task_dir)` check before opening the file 2. --embed-endpoint now defaults to None and is resolved after parsing: uses --endpoint when --llm-provider=ollama (so remote benchmark runs score against the same host), else http://localhost:11434. Help text now matches behavior. runner.py's CLI was also missing the flag entirely — added so single-task runs honor remote endpoints.
…nslated labels
Adds 6 new language datasets (German, French, Hindi, Italian, Korean, Russian)
across all 4 benchmark tasks (calibration, entity_extraction, memory_extraction,
room_classification) — 630 samples total, same conventions as the existing
pt-BR/es/zh datasets: inputs translated, labels/ground-truth stay English
except where noted.
Changes:
- 24 new dataset.{de,fr,hi,it,ko,ru}.jsonl files across all 4 tasks
- labels.ko.jsonl for memory_extraction: Korean ground-truth so the scorer
compares Korean model output against Korean expected content instead of
English (fixes ~20pp score gap identified during testing — see report)
- runner.py: loads labels.{lang}.jsonl when present, falls back to labels.jsonl
- orchestrator.py: adds --output-dir (writes <dir>/<lang>/YYYY-MM-DD-<host>.csv
per language); --output single-file mode unchanged
- candidates.yaml: adds community tier (igorls classifier variants, heretic)
and local tier (gemma4:e4b)
- translate_datasets.py: script used to generate the translations via Ollama;
included so contributors can extend to new languages without manual work
- reports/2026-05-13-multilingual.md: 210-run benchmark report across
6 models × 7 languages × 5 tasks on RTX 3080 Laptop 8 GB
…ted samples, KO labels
Addresses the review feedback from igorls, gemini-code-assist, and Copilot.
HIGH:
- orchestrator: --output single-file mode now shares ONE (fh, writer) across
all languages instead of opening N handles to the same path. The old code
caused interleaved buffer corruption: first language opened "w", subsequent
ones opened "a", and writes from independent file offsets could overwrite
each other. Verified with a multi-language --output smoke test (4 rows
written, all distinct).
- 19 untranslated/empty samples re-translated:
- dataset.de.jsonl: cal_017
- dataset.hi.jsonl entity_extraction: ent_020, ent_025, ent_032, ent_038
- dataset.hi.jsonl room_classification: rc_017, rc_026, rc_028, rc_040,
rc_064, rc_089, rc_091
- dataset.ko.jsonl room_classification: rc_027, rc_067
- dataset.it.jsonl room_classification: rc_029, rc_030, rc_031, rc_032,
rc_053 (previously empty strings)
- labels.ko.jsonl: restored all proper nouns to English (Doreth, Saela, Ivora,
Ren Solanke, Pol Krisat, Pell Halloran, Bramble, Hollowmounts Institute,
Wendelsea, Bridgewater Community Garden, Wends, Drukar, Aerwyn cycle,
Jaccard, Mason bee, Markdown). Also fixed mistranslation 유전자 사과
(genetic apple) → 재래종 사과 (heirloom apple).
MEDIUM:
- runner.py: refactored label-resolution one-liner into 3 readable lines
and added an info log when falling back to English ground truth, so
readers don't misread "score collapse" as model failure.
LOW:
- orchestrator: moved `import socket` to module top (PEP 8); removed
unused `out_path` from the unpacking tuple.
- translate_datasets.py: renamed loop variable `l` → `code` (ruff E741);
made the _translate_one fallback return path explicit instead of relying
on for-loop fall-through; added a privacy warning in the docstring
flagging that the default `kimi-k2.6:cloud` sends prose to a remote
endpoint and should not be used over real palace data.
- 2026-05-13-multilingual.md: converted analytical paragraphs from
Portuguese to English to match the existing repo convention.
…tended-languages feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels
…rison Without an explicit num_ctx, each candidate ran at its Modelfile default (32k for the Gemma4 variants, larger for qwen3), so VRAM and latency weren't comparable across families — a 32k-default model pre-allocates KV cache a 4k-default model doesn't. The flag's own docstring promised "apples-to-apples" but defaulted to None, defeating the intent. All current benchmark prompts fit comfortably under 4k tokens (memory_extraction is the longest at ~500). Users with longer prompts can still pass --num-ctx <larger>. Adds a methodology note to the 2026-05-13 multilingual report so its VRAM/latency numbers aren't conflated with future runs at the new default.
Validation run: top-2 models × 7 languages with
|
| EN | DE | FR | HI | IT | KO | RU | |
|---|---|---|---|---|---|---|---|
| Report (nomic-embed-text) | 0.950 | 0.287 | 0.438 | 0.463 | 0.463 | 0.400 | 0.212 |
| This run (embeddinggemma) | 0.925 | 0.925 | 0.812 | 0.887 | 0.838 | 0.863 | 0.800 |
Same dataset, same labels.jsonl fallback for everything except KO. The "all non-EN scores collapse ~0.52-0.63 pp" finding in the report was mostly an embedding-model artifact, not a model-quality issue. The hypothesis in the methodology note ("re-run with embeddinggemma to separate scoring vs model effect") was correct.
2. --num-ctx 4096 brings VRAM into line with model size
Without an explicit --num-ctx, each candidate used its Modelfile default (32k for Gemma4 variants). ollama ps showed classifier-q8 at 9.7 GB resident at the start of this run. After restarting with --num-ctx 4096:
| classifier-q8 VRAM | qwen3-4b-q8 VRAM | |
|---|---|---|
Modelfile defaults (ollama ps) |
9.7 GB | (not measured this run) |
--num-ctx 4096 (this run) |
8693 MB | 5196 MB |
~1 GB shaved off classifier-q8 by not pre-allocating 32k KV cache it never uses. Accuracy unaffected (all prompts fit comfortably under 4k). Pushed as commit 8591d13 on feat/benchmark-multilingual — --num-ctx now defaults to 4096 in both orchestrator.py and runner.py.
Latency presumably improved too, but I can't quantify it from this run alone — would need a same-hardware before/after on z690-ex-glacial to isolate from the hardware delta vs Mercurio.
Accuracy ranking (this run, EN baseline)
| task | classifier-q8 | qwen3-4b-q8 |
|---|---|---|
| room_classification closed | 0.622 | 0.564 |
| room_classification open | 0.652 | 0.576 |
| entity_extraction | 0.746 | 0.768 |
| memory_extraction | 0.855 | 0.864 |
| calibration | 0.964 | 0.950 |
Same overall picture as the report: classifier-q8 wins room tasks, qwen3-4b-q8 wins extraction tasks. On this hardware, qwen3-4b-q8 ran at e2e_p50 ~125 ms on room tasks vs classifier-q8's ~239 ms, with 5.2 GB VRAM vs 8.7 GB.
Follow-ups worth doing before this PR lands on develop
- Update the 2026-05-13 report. Its "memory_extraction collapses non-EN" narrative is now misleading — the collapse was specific to
nomic-embed-textand largely disappears withembeddinggemma. A short addendum noting this would prevent the report being quoted out of context after merge. - Reconcile embed-model default with the report. Either change the code default back to
nomic-embed-textso it matches the report's header, or re-run the full 210-row matrix withembeddinggemmaand replace the report. Right now a fresh run produces numbers that don't match the report. - The silent-fallback log Bruno added in feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels #1503 is working great —
info: language=de labels=labels.jsonl (no labels.de.jsonl found ...)showed up for every non-EN+non-KO run on memory_extraction. Easy to spot which numbers are methodology-bound.
Embedding model deep-dive: production gap + winnerFollowing from the prior comment, dug into the embedder question all the way through. Headline: MemPalace's production embedder cannot find multilingual memories at all, and there's a drop-in zero-migration upgrade that fixes it. Details below. The production gap (this is the real finding)MemPalace ships with
RU at 0.169 means a Russian conversation and its identical English translation embed to nearly orthogonal vectors. A Russian-speaking MemPalace user effectively cannot find their own stored memories. Same story (less severe) for Korean, Hindi, Italian, French, German. This breaks the "100% recall is the design requirement" promise from CLAUDE.md for any non-English content. The fix: embeddinggemma-300m-ONNX at 384-dim via MRLValidated the ONNX port (
Two surprises worth noting:
Net change for production: Methodology notes worth bundling here
ImplementedInitial implementation landed on this branch as commit
Follow-ups still open: offline tests for the new EF (mock hf_hub_download), docs note on running |
…embedder MemPalace's default embedder (all-MiniLM-L6-v2) is English-only-trained. Cross-lingual cosine similarity on parallel-translated text averages 0.35 across DE/FR/HI/IT/KO/RU — vs 0.88 for embeddinggemma-300m ONNX (q8) with the semantic-similarity prefix. RU is the worst at 0.17, meaning a Russian memory and its identical English translation embed to nearly orthogonal vectors. Multilingual users effectively cannot retrieve their own memories. This commit adds embeddinggemma-300m as an opt-in alternative: * New EmbeddinggemmaONNX class implementing ChromaDB's EF protocol. Lazy-downloads model_quantized.onnx (~300 MB) via huggingface_hub on first use; cached under ~/.cache/huggingface/. Applies the sim prefix, runs onnxruntime inference, truncates to 384 dims via Matryoshka (MRL), L2-normalizes. * MRL truncation to 384d is intentional: matches MiniLM's vector width so collection schemas don't change, and validation showed 384d MRL actually outperforms full 768d on these similarity tasks (0.893 vs 0.881 avg) — known property of MRL training. * MEMPALACE_EMBEDDING_MODEL env (default "minilm" for back-compat). Switching models on an existing palace requires re-embedding — ChromaDB rejects reads with a mismatched EF name. Run `mempalace repair rebuild-index` after changing the value. * New optional dep group: pip install mempalace[multilingual] Adds huggingface_hub + tokenizers + numpy. Core deps unchanged. ONNX q8 validated lossless vs the Ollama gguf benchmarked previously (max delta 0.002 cos across 240 parallel pairs).
Three follow-ups bundled for the embeddinggemma EF added in 51702e9: 1. Offline tests for EmbeddinggemmaONNX (10 tests, 0.08s, no network). Mocks huggingface_hub.hf_hub_download, tokenizers.Tokenizer.from_file, and onnxruntime.InferenceSession so CI never pulls the 300 MB model. Guarded with pytest.importorskip so the file is skipped when the multilingual extra isn't installed. Covers: stable name(), lazy-load runs exactly once, output shape (n, 384) after MRL truncation, L2 normalization, sim prefix applied, dispatch from get_embedding_function(model="embeddinggemma"), cache key separates models, helpful ImportError when deps missing, env override. 2. Friendlier ChromaDB EF-name-mismatch error. Switching MEMPALACE_EMBEDDING_MODEL on an existing palace previously surfaced ChromaDB's bare "Embedding function conflict: new: X vs persisted: Y" ValueError. Now ChromaBackend.get_collection() wraps that error and points users at the two recovery paths: revert the env var, or run `mempalace repair rebuild-index --palace <path>`. New _explain_ef_mismatch helper + 3 tests (unit + end-to-end). 3. Docs: CHANGELOG [Unreleased] entry covers both the new EF and the error wrapper. README Requirements section mentions the multilingual extra and points at the embedding.py docstring for the migration note.
Onboarding now asks the user once, on first run, whether to use the multilingual embedding model. The default answer is yes — defaulting to English-only made the recall promise effectively unreachable for any non-English content (cross-lingual cos ~0.35 vs ~0.88 for the multilingual model). The choice is written to config.json so subsequent runs pick the right EF without re-prompting; existing installs that never set the env var or ran onboarding stay on minilm for back-compat. MEMPALACE_EMBEDDING_MODEL still overrides both. Multilingual deps (huggingface_hub, tokenizers, numpy) move from the [multilingual] extra into core. The extra is kept as a no-op alias so existing install scripts keep working. The 300 MB ONNX model is still lazy-downloaded on first use, not at install time. `quick_setup` (the programmatic non-interactive path) grows an optional `embedding_model` arg so tests and benchmark scripts can pick a model without writing config.json by accident. EmbeddinggemmaONNX's "missing deps" error now points at the right recovery path (reinstall mempalace, since the deps are core) rather than the obsolete pip install mempalace[multilingual] hint. Tests: 9 new (3 _ask_embedding_model variants + 2 run_onboarding persistence + 2 quick_setup + 2 set_embedding_model round-trips). The existing 2 run_onboarding tests now patch _ask_embedding_model so they don't print to stdout.
Resolve conflicts: - backends/chroma.py: keep both new except handlers in get_collection (CollectionNotInitializedError from develop + EF-mismatch helper from this branch), ordered _ChromaNotFoundError before ValueError to match the create-branch handler order. - uv.lock: regenerated from merged pyproject.toml. Fix lint: ruff format mempalace/embedding.py + tests/test_embeddinggemma.py (CI now pins ruff==0.15.9 via develop's workflow update). Full suite: 1923 passed, 1 skipped.
Resolves conflicts in CHANGELOG.md and pyproject.toml by combining the multilingual-embedder additions (huggingface_hub/tokenizers/numpy core deps, [multilingual] alias, Features section) with develop's additions (python-dateutil core dep, [extract] extra, tunnel Bug Fixes and Internal sections). Prepares PR #1483 for merge into v3.3.6.
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json, .codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the major user-facing entries that landed without changelog entries during the cycle: Features: - MemPalace#1555 office-document mining via --mode extract + virtual line numbers - MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a) - MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph) - MemPalace#1565 cross-wing tunnels auto-promoted from hallways - MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels - MemPalace#1236 API-tool transcripts auto-route to wing_api - MemPalace#711 hooks.auto_save toggle for silent-mode sessions - MemPalace#1605 COCA content-word filter for entity detection - MemPalace#1557 case-insensitive entity matching at mine time - MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default Bug Fixes (selected, user-visible): - MemPalace#1540 silent data loss in three unchunked upsert sites - MemPalace#1538 paragraph chunker oversized chunks - MemPalace#1554 per-file chunk cap too low for transcripts - MemPalace#1562 Windows hook subprocess/ChromaDB deadlock - MemPalace#1529 create_tunnel corrupted hyphenated wing names - MemPalace#1424 save-hook truncated hyphenated project folders - MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths - MemPalace#1466 silent symlink skip now logged - MemPalace#1441 macOS stock-bash 3.2 hook compatibility - MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input - MemPalace#1523 VACUUM + FTS5 rebuild after repair - MemPalace#1548 FTS5 validation at end of mine - plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466, MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585 Performance: - MemPalace#1474 convo miner pre-fetches mined-set - MemPalace#1487 rebuild_index progress callback - MemPalace#1530 MCP cold-start diagnostics + opt-in warmup Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment verified per RELEASING.md.
Summary
benchmarks.model_evalso shipping decisions reflect non-English users. New--language/--languagesflags loaddataset.{lang}.jsonlalongsidedataset.jsonl; pt-BR / es / zh translations included (633 samples across 4 tasks). Inputs translated; proper nouns and labels stay English so cross-lingual scoring works without re-translating ground truth.--num-ctxto forceoptions.num_ctxper Ollama request, overriding the model's Modelfile default. Required for apples-to-apples VRAM/TPS comparison across candidates with mismatched defaults (e.g.qwen3:4b-instruct-2507-q8_0defaults to 32k = 9.7 GB resident; at 8k it's 5.6 GB).--embed-modelwith default flipped toembeddinggemma. The v1nomic-embed-textcosine on EN↔PT-BR same-meaning pairs lands at ~0.607 (right at the 0.6 match threshold), so any phrasing drift collapses to false-negative. embeddinggemma lands ~0.766 with 2.7× the signal/noise spread. PT-BRmemory_extractionrecovered from 0.150 → 0.850 on the same model outputs after the swap — the previous score was a methodology artifact, not a model regression.load_candidates()now synthesizes entries for tags not incandidates.yaml, so ad-hoc tags (e.g.igorls/gemma4-e4b-classifier:Q4_K_M) work via--candidateswithout yaml edits.languagecolumn.Why now
We needed to pick between
igorls/gemma4-e4b-classifier:Q4_K_Mandqwen3:4b-instruct-2507-q8_0for MemPalace's 8 GB-VRAM tier. The English-only harness at n=30 couldn't separate them (deltas 1-6 points = noise-adjacent). The multilingual matrix (4 models × 5 tasks × 4 languages = 80 runs) produced clear signal: Gemma family owns open-set room cls in every language, Qwen wins entity by 5-7 points everywhere, classifier matches official Gemma within noise on every cell. The methodology fixes here are what made that signal trustworthy.What this does NOT change
mempalace/llm_client.py(added anum_ctxkwarg +**provider_kwargsforwarding inget_provider; no behavioral change unless the new kwarg is set).--embed-modelchange does affect score numbers vs historical EN-only runs. The relative ranking is unchanged but absolute coverage/similarity scores will be slightly higher under the new default. Worth flagging when comparing new results to pre-this-PR baselines.Test plan
uv run python -m benchmarks.model_eval.orchestrator --candidates qwen3:4b-instruct-2507-q4_K_M --tasks all --languages en,pt-BR --num-ctx 8192 --dataset-dir benchmarks/model_eval/datasets --output /tmp/pr-smoke.csvproduces 10 rows, language column populated, no errors--num-ctx 8192is actually applied:curl -s localhost:11434/api/psshowscontext_length: 8192for a model whose Modelfile default is higher--embed-modeloverride: run with--embed-model nomic-embed-textand confirm scores match historical EN baselines onmemory_extraction--languagesflag) still produce identical output as before this PR