feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels by lealbrunocalhau · Pull Request #1503 · MemPalace/mempalace

lealbrunocalhau · 2026-05-13T23:28:37Z

Summary

Builds on top of #1483 with 6 additional languages and a few methodology fixes discovered during testing.

New datasets — 630 samples

Language	calibration	entity_extraction	memory_extraction	room_classification
DE (German)	✅	✅	✅	✅
FR (French)	✅	✅	✅	✅
HI (Hindi)	✅	✅	✅	✅
IT (Italian)	✅	✅	✅	✅
KO (Korean)	✅	✅	✅	✅
RU (Russian)	✅	✅	✅	✅

Same conventions as pt-BR/es/zh: inputs translated, proper nouns and system labels (entity types, room slugs, memory types) stay English.

labels.ko.jsonl for memory_extraction

During a 210-run matrix (6 models × 7 languages × 5 tasks) we noticed memory_extraction scores collapsed ~0.5 pp for non-EN. Investigation showed models were correctly extracting memories in the input language, but we were scoring against English ground truth → artificial score drop.

Added labels.ko.jsonl with Korean ground-truth content as a starting point. runner.py now loads labels.{lang}.jsonl when present, falling back to labels.jsonl. Quick test: KO score went from 0.40 → 0.60 after the fix (EN baseline 1.0). The remaining gap is real model difficulty, not methodology noise.

The other languages still use English labels for now — contributed as-is so the datasets are available for testing. Generating all translated labels is a follow-up.

--output-dir

Alternative to --output for multilingual matrix runs. Writes <dir>/<lang>/YYYY-MM-DD-<host>.csv per language instead of one flat file. The existing --output single-file mode is unchanged and --num-ctx, --llm-provider, --embed-endpoint all work exactly as before.

Other additions

translate_datasets.py: script used to generate translations via Ollama — contributors can extend to new languages
candidates.yaml: community tier (igorls classifier variants, heretic) and local tier
reports/2026-05-13-multilingual.md: full 210-run benchmark report

Test plan

--languages en,de,fr,hi,it,ko,ru --output /tmp/test.csv produces 7 rows per model/task, no errors
--output-dir /tmp/results creates en/, de/, ... subfolders each with a CSV
--num-ctx 8192 still works (Igor's flag, unchanged)
KO memory_extraction score is higher with labels.ko.jsonl present than without
Existing EN-only runs produce identical output (backward compat)

…nslated labels Adds 6 new language datasets (German, French, Hindi, Italian, Korean, Russian) across all 4 benchmark tasks (calibration, entity_extraction, memory_extraction, room_classification) — 630 samples total, same conventions as the existing pt-BR/es/zh datasets: inputs translated, labels/ground-truth stay English except where noted. Changes: - 24 new dataset.{de,fr,hi,it,ko,ru}.jsonl files across all 4 tasks - labels.ko.jsonl for memory_extraction: Korean ground-truth so the scorer compares Korean model output against Korean expected content instead of English (fixes ~20pp score gap identified during testing — see report) - runner.py: loads labels.{lang}.jsonl when present, falls back to labels.jsonl - orchestrator.py: adds --output-dir (writes <dir>/<lang>/YYYY-MM-DD-<host>.csv per language); --output single-file mode unchanged - candidates.yaml: adds community tier (igorls classifier variants, heretic) and local tier (gemma4:e4b) - translate_datasets.py: script used to generate the translations via Ollama; included so contributors can extend to new languages without manual work - reports/2026-05-13-multilingual.md: 210-run benchmark report across 6 models × 7 languages × 5 tasks on RTX 3080 Laptop 8 GB

gemini-code-assist

Code Review

This pull request introduces multilingual support to the model evaluation benchmark suite, adding translated datasets for six languages and a new script to automate translations via LLM. The orchestrator has been updated to support per-language output directories. The code review identifies a high-severity issue in the orchestrator where multiple file handles are opened for the same path in single-file mode, which could lead to data corruption. Additional feedback includes moving imports to the module level for PEP 8 compliance and using safer dictionary access in the translation script to prevent potential KeyError exceptions.

gemini-code-assist · 2026-05-13T23:34:06Z

+        if p not in seen_paths:
+            fh = open(p, "w", newline="", encoding="utf-8")
+            w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
+            w.writeheader()
+            fh.flush()
+            seen_paths.add(p)
+        else:
+            fh = open(p, "a", newline="", encoding="utf-8")
+            w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
+        lang_files[lang] = (fh, w, p)


When --output is used (single file mode), this logic opens the same file multiple times in different modes ('w' for the first language, 'a' for subsequent ones). This results in multiple active file handles to the same path, which can lead to data corruption or interleaved writes due to buffering. It is better to reuse the same file handle for a given path.

Suggested change

if p not in seen_paths:

fh = open(p, "w", newline="", encoding="utf-8")

w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)

w.writeheader()

fh.flush()

seen_paths.add(p)

else:

fh = open(p, "a", newline="", encoding="utf-8")

w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)

lang_files[lang] = (fh, w, p)

# Pre-create all output files (headers written before the first run starts).

lang_files: dict[str, tuple] = {}

path_to_fh: dict[Path, tuple] = {}

for lang in languages:

p = _csv_path(lang)

p.parent.mkdir(parents=True, exist_ok=True)

if p not in path_to_fh:

fh = open(p, "w", newline="", encoding="utf-8")

w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)

w.writeheader()

fh.flush()

path_to_fh[p] = (fh, w)

lang_files[lang] = (*path_to_fh[p], p)

print(f" {lang} → {p}")

gemini-code-assist · 2026-05-13T23:34:06Z

-    args.output.parent.mkdir(parents=True, exist_ok=True)
+    # Resolve output path(s). --output-dir writes one CSV per language so results
+    # from long multilingual runs stay grouped by locale and are easier to diff.
+    import socket


The import socket statement should be moved to the top of the file with other imports to follow PEP 8 guidelines and ensure consistency.

References

Imports should always be put at the top of the file, just after any module comments and docstrings, and before module globals and constants. ^(link)

gemini-code-assist · 2026-05-13T23:34:06Z

+    results = [None] * len(samples)
+
+    work = [
+        (i, s["id"], s.get(field, ""), language_name, model, endpoint)


Accessing s["id"] directly assumes that every sample in the dataset contains an 'id' field. If a sample is missing this key, the script will crash with a KeyError. Using .get() with a fallback is safer.

Suggested change

(i, s["id"], s.get(field, ""), language_name, model, endpoint)

(i, s.get("id", f"idx_{i}"), s.get(field, ""), language_name, model, endpoint)

Copilot

Pull request overview

Adds multilingual benchmark coverage and related harness improvements for evaluating model behavior across DE/FR/HI/IT/KO/RU datasets.

Changes:

Adds translated benchmark datasets across calibration, entity extraction, memory extraction, and room classification.
Adds --output-dir support for per-language CSV output.
Adds translated Korean memory labels, translation tooling, new candidate tiers, and a multilingual benchmark report.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`benchmarks/model_eval/orchestrator.py`	Adds `--output-dir` and per-language CSV handling.
`benchmarks/model_eval/runner.py`	Loads language-specific labels when available.
`benchmarks/model_eval/translate_datasets.py`	Adds Ollama-based dataset translation utility.
`benchmarks/model_eval/candidates.yaml`	Adds community/local candidate entries.
`benchmarks/model_eval/reports/2026-05-13-multilingual.md`	Adds multilingual benchmark report.
`benchmarks/model_eval/datasets/calibration/dataset.de.jsonl`	Adds German calibration dataset.
`benchmarks/model_eval/datasets/calibration/dataset.fr.jsonl`	Adds French calibration dataset.
`benchmarks/model_eval/datasets/calibration/dataset.hi.jsonl`	Adds Hindi calibration dataset.
`benchmarks/model_eval/datasets/calibration/dataset.it.jsonl`	Adds Italian calibration dataset.
`benchmarks/model_eval/datasets/calibration/dataset.ko.jsonl`	Adds Korean calibration dataset.
`benchmarks/model_eval/datasets/calibration/dataset.ru.jsonl`	Adds Russian calibration dataset.
`benchmarks/model_eval/datasets/entity_extraction/dataset.de.jsonl`	Adds German entity extraction dataset.
`benchmarks/model_eval/datasets/entity_extraction/dataset.fr.jsonl`	Adds French entity extraction dataset.
`benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl`	Adds Hindi entity extraction dataset.
`benchmarks/model_eval/datasets/entity_extraction/dataset.it.jsonl`	Adds Italian entity extraction dataset.
`benchmarks/model_eval/datasets/entity_extraction/dataset.ko.jsonl`	Adds Korean entity extraction dataset.
`benchmarks/model_eval/datasets/entity_extraction/dataset.ru.jsonl`	Adds Russian entity extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/dataset.de.jsonl`	Adds German memory extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/dataset.fr.jsonl`	Adds French memory extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/dataset.hi.jsonl`	Adds Hindi memory extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/dataset.it.jsonl`	Adds Italian memory extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/dataset.ko.jsonl`	Adds Korean memory extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/dataset.ru.jsonl`	Adds Russian memory extraction dataset.
`benchmarks/model_eval/datasets/memory_extraction/labels.ko.jsonl`	Adds Korean memory extraction labels.
`benchmarks/model_eval/datasets/room_classification/dataset.de.jsonl`	Adds German room classification dataset.
`benchmarks/model_eval/datasets/room_classification/dataset.fr.jsonl`	Adds French room classification dataset.
`benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl`	Adds Hindi room classification dataset.
`benchmarks/model_eval/datasets/room_classification/dataset.it.jsonl`	Adds Italian room classification dataset.
`benchmarks/model_eval/datasets/room_classification/dataset.ko.jsonl`	Adds Korean room classification dataset.
`benchmarks/model_eval/datasets/room_classification/dataset.ru.jsonl`	Adds Russian room classification dataset.

Comments suppressed due to low confidence (6)

benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:25

This Hindi dataset row is still entirely English, so the hi entity-extraction benchmark includes untranslated input and can overstate Hindi performance for this sample.

{"id": "ent_025", "text": "Aria and the user had a long conversation about the Embedding Spaces visualization tool. The user, Saela, wants to add a temporal slider so she can replay how clusters formed over the indexing period. Sketched the API. Coordinating with Brennan Lyle on the rendering side — he has experience with WebGL from his time at Bridgewater Visualization."}

benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:32

This Hindi dataset row is still entirely English, which contradicts the translated-input methodology and makes this hi sample non-comparable with the rest of the Hindi dataset.

{"id": "ent_032", "text": "Bramble visited the Hollowmounts Institute experimental garden. Their Native Species program is doing remarkable work with seed-saving for regional ecotypes. Met with their lead horticulturist, Iset Karadzic — same Iset as the nursery, she consults with the Institute as well. She gave me a packet of cardinal flower seed from the local provenance."}

benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:38

This Hindi dataset row is still entirely English, so this sample does not exercise Hindi entity extraction and can skew the reported hi score.

{"id": "ent_038", "text": "Solas worked through a thread-safety audit for the Distributed Tracing client library. Found two race conditions that ThreadSanitizer hadn't caught — they only manifested at very high concurrency on the Crestmoor production stack. Mette Olafsen confirmed she could reproduce. Pushed fixes; both reviewed by Brennan Lyle."}

benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl:26

This Hindi room-classification row remains in English, which contradicts the translated-input methodology and makes this sample non-comparable with the rest of the Hindi dataset.

{"id": "rc_026", "agent": "Solas", "session_summary": "Wrote a small LLVM pass that hoists loop-invariant GEPs above their containing loops. The trick was getting the alias analysis to confirm the base pointer doesn't escape across the loop boundary. Tested on a few microbenchmarks; saw 4-12% speedups on the inner loops where it triggered.", "include_messy_features": false}

benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl:28

Most of this Hindi sample is still English prose. Even with include_messy_features, the natural-language input should be translated so the Hindi benchmark is not partially measuring English performance.

{"id": "rc_028", "agent": "Solas", "session_summary": "long debug — parser was inf-looping on certain inputs. left-recursive rule i'd added during refactoring. combinators dont handle left recursion natively (packrat+memoization does, but ours isnt packrat). restructured the affected rules into pratt-style operator precedence:\n```\nlet parse_expr min_prec =\n  let lhs = ref (parse_atom ()) in\n  while peek_prec () >= min_prec do ...\n```\nworks now. ALSO need to add a fuzzer corpus check before tagging the release.", "include_messy_features": true}

benchmarks/model_eval/datasets/memory_extraction/labels.ko.jsonl:15

This label no longer preserves the proper noun Doreth from the source/input text, so a correct Korean extraction that keeps Doreth can be scored against mismatched ground truth.

{"id": "mem_015", "memories": [{"type": "commitment", "content": "주말 종료 전에 독스에게 유전자 사과의 가지치기 일정을 이메일로 보내겠다."}]}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  # ── Local tier: models already pulled, no extra disk needed ──────────
+  - tag: gemma4:e4b
+    family: gemma4
+    size_b: 4.0
+    variant: instruct
+    quantization: default
+    expected_vram_mb: 4510
+    tier: local


+{"id": "ent_017", "text": "Bramble ने Pollinator Paths के लिए Bridgewater Schools की बगीचा समिति के साथ परामर्श किया। वे केवल देशी पौधों का ही रोपण चाहते हैं लेकिन उनका बजट सख्त है। Wendelsea Native Plant Nursery के plug trays से शुरुआत करने की सलाह दी गई — ये gallon containers से सस्ते हैं और जल्दी जड़ पकड़ लेते हैं। उनके संस्थापक Iset Karadzic से बात की गई, जिन्होंने स्कूलों को अपनी गैर-लाभकारी दर देने के लिए सहमति दी।"}
+{"id": "ent_018", "text": "Solas ने Pol Krisat के साथ मिलकर नई parser combinator library को Aerwyn Labs codebase में एकीकृत किया। एकीकरण साफ-सुथरा था, सिवाय एक जगह के जहाँ उनका custom error type हमारे Distributed Tracing context से टकरा रहा था। एक छोटे adapter trait को पेश करके इसे हल किया गया। आसान review के लिए एकीकरण को एक अलग PR में push किया गया।"}
+{"id": "ent_019", "text": "Thresh Bridgewater Studio के नेतृत्व के साथ एक योजना बैठक में शामिल हुआ। उनके CFO Ivora Tinn कार्यकारी टीम के लिए एक नया Cashflow Models dashboard लॉन्च कर रही हैं। संचालन निदेशक Karis Tornau विभाग-वार drill-down चाहती हैं; Ivora केवल उच्च-स्तरीय सारांश चाहती हैं। समझौता: दो views, एक ही अंतर्निहित data।"}
+{"id": "ent_020", "text": "Aria explored the relationship between attention entropy and reasoning faithfulness. The Hollowmounts Institute paper from last year claimed they were positively correlated; the Aerwyn Labs replication found the opposite. Drafted a synthesis that argues both papers are right but for different model scales. Need to verify this on the Crestmoor cluster before claiming anything."}


+{"id": "rc_014", "agent": "Aria", "session_summary": "परिणाम तालिका के एक ड्राफ्ट की समीक्षा की। उपयोगकर्ता ने एक paired t-test चलाया था, लेकिन अंतर स्पष्ट रूप से bimodal हैं — छोटे सुधारों का एक समूह और बड़े सुधारों का एक समूह है। Wilcoxon signed-rank test पर स्विच करने और माध्यिका अंतर तथा bootstrap CI दोनों की रिपोर्टिंग करने की सिफारिश की।", "include_messy_features": false}
+{"id": "rc_015", "agent": "Aria", "session_summary": "आते हुए दस्तावेज़ एम्बेडिंग्स पर स्ट्रीमिंग HDBSCAN के लिए कोड का मसौदा तैयार किया। तरकीब यह है कि न्यूनतम स्पैनिंग ट्री को क्रमिक रूप से बनाए रखना हो; हर बैच पर पूर्ण पुनः क्लस्टरिंग बहुत धीमी है। Crestmoor समूह का एक 2022 का पेपर मिला जिसमें सही एल्गोरिदम है। कल लागू करेंगे।", "include_messy_features": false}
+{"id": "rc_016", "agent": "Aria", "session_summary": "tex compile error फिर से। \\usepackage{algorithm2e} का \\usepackage{algorithmic} से संघर्ष होता है जब दोनों acmart द्वारा लोड किए जाते हैं। समाधान: केवल algorithm2e लोड करें और \\SetAlgoNoLine का उपयोग करें। साथ ही bibliography style [smith2023] प्रविष्टि में अनुपस्थित 'doi' field की शिकायत कर रहा था — इसे दस्ती जोड़ दिया।", "include_messy_features": true}
+{"id": "rc_017", "agent": "Aria", "session_summary": "Talked through whether log-uniform or beta priors are better for hyperparameter search in a small-data regime. The user was running a Bayesian optimization and getting weird convergence. The issue was their search space was huge (5 dims, broad bounds) and they only had 30 evals — that's just not enough budget. Suggested either narrowing the bounds or switching to random search to diagnose.", "include_messy_features": false}


+{"id": "mem_013", "memories": [{"type": "fact", "content": "웨인셀라 식물원의 올해 첫 피어나는 날짜는 장기 평균보다 8일 더 일찍 나타났다."}, {"type": "fact", "content": "웨인셀라 식물원의 첫 피어나는 기록은 18년 전부터 시작되었으며, 이는 기록상 가장 조기의 봄이다."}]}
+{"id": "mem_014", "memories": [{"type": "opinion", "content": "언어적 편견(웨인스는 드루카르보다 더 부드러운 말을 한다)을 가상의 민족에 적용하면 페이지에서 단순화되고 인물들을 평평하게 만들 수 있다."}]}
+{"id": "mem_015", "memories": [{"type": "commitment", "content": "주말 종료 전에 독스에게 유전자 사과의 가지치기 일정을 이메일로 보내겠다."}]}


+            completed += 1
+            print(f"  [{completed}/{len(samples)}] {sample_id}", flush=True)


+            if attempt == 2:
+                print(f"    ERROR on {sample_id}: {e}", file=sys.stderr, flush=True)
+                return idx, sample_id, text  # fall back to English


+        for task in tasks:
+            field = TASK_TEXT_FIELD[task]
+            src = args.dataset_dir / task / "dataset.jsonl"


+        if p not in seen_paths:
+            fh = open(p, "w", newline="", encoding="utf-8")
+            w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
+            w.writeheader()
+            fh.flush()
+            seen_paths.add(p)
+        else:
+            fh = open(p, "a", newline="", encoding="utf-8")


        )
    samples = load_jsonl(dataset_path)
-    labels = load_jsonl(task_dir / "labels.jsonl")
+    labels_path = task_dir / (f"labels.{language}.jsonl" if language != "en" and (task_dir / f"labels.{language}.jsonl").exists() else "labels.jsonl")


igorls · 2026-05-14T01:36:15Z

Overview

Stacks on feat/benchmark-multilingual (#1483). Adds 6 translated dataset languages (DE/FR/HI/IT/KO/RU), --output-dir for per-language CSV output, labels.ko.jsonl to fix the cross-language scoring bug on memory_extraction, a translate_datasets.py Ollama helper, new community/local candidate tiers, and a 210-run multilingual benchmark report.

The methodology fix (per-language labels) is well-motivated and the report is genuinely useful. Most concerns are localized.

Issues

High — `--output` single-file mode opens N handles to the same path

In orchestrator.py:

for lang in languages:
    p = _csv_path(lang)
    p.parent.mkdir(parents=True, exist_ok=True)
    if p not in seen_paths:
        fh = open(p, "w", ...)
        ...
        seen_paths.add(p)
    else:
        fh = open(p, "a", ...)   # ← second handle to the SAME file
    lang_files[lang] = (fh, w, p)

When --output one.csv is combined with multiple languages, _csv_path() returns the same Path for every language. The first iteration opens a "w" handle; subsequent iterations open additional "a" handles to the same file. All handles stay open simultaneously, each with its own buffer and offset. Interleaved writer.writerow(...) + fh.flush() from different handles can overwrite each other (the "w" handle's position advances independently of "a" appends). gemini-code-assist flagged this — it's a real correctness bug, not just style.

Fix: detect single-file mode once and share one (fh, writer) pair across all languages:

if args.output:
    fh = open(args.output, "w", newline="", encoding="utf-8")
    w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS); w.writeheader(); fh.flush()
    lang_files = {lang: (fh, w, args.output) for lang in languages}
else:
    # per-language files as today (no path collisions possible)

The test plan item "Existing EN-only runs produce identical output" passes only because a single-language run never hits the collision — a multi-language --output run would.

Medium — Untranslated samples in `dataset.hi.jsonl`

ent_020, ent_025, ent_032, ent_038 in datasets/entity_extraction/dataset.hi.jsonl are still entirely English, and Copilot flagged rc_026, rc_028 in the Hindi room-classification set. These inflate the Hindi numbers (the model isn't actually processing Hindi for those rows). Worth a quick sweep — grep -L '[ऀ-ॿ]'-style — across all six new languages before the report's numbers get cited downstream. The Russian set is worth double-checking for the same reason (the report calls RU the lowest scorer).

Medium — `labels.ko.jsonl` doesn't honor the proper-noun rule

The translation prompt and the input datasets are explicit that proper nouns stay English. The Korean labels don't follow the same rule:

labels.ko.jsonl:15 replaces Doreth with 독스 (transliteration) and translates "heritage apple" as 유전자 사과 ("genetic apple" — a mistranslation; should be 재래종/고대 사과).
Other lines transliterate names inconsistently: 세라 for Saela, 이보라 for Ivora, 렌 소란케 for Ren Solanke, 브램블 for Bramble, 폴 크리사트 for Pol Krisat, while keeping Solas/Ivora/Tartine Lab in English elsewhere.

Cosine scoring with nomic-embed-text is somewhat robust to this, but inconsistency in the ground truth still adds noise to exactly the metric this file was added to fix. Worth one pass to normalize names back to English before more contributors copy this format for DE/FR/IT/HI/RU labels.

Medium — `runner.py` label-resolution one-liner

labels_path = task_dir / (f"labels.{language}.jsonl" if language != "en" and (task_dir / f"labels.{language}.jsonl").exists() else "labels.jsonl")

Two readability nits and one real concern:

.exists() is computed inside a conditional expression that builds the path string twice. Split it:

candidate = task_dir / f"labels.{language}.jsonl"
labels_path = candidate if language != "en" and candidate.exists() else task_dir / "labels.jsonl"

Silent fallback. When DE/FR/HI/IT/RU runs hit memory_extraction, they fall back to labels.jsonl (English ground truth) without any log line. That's exactly the bug the PR was written to address, and the report (§ Memory Extraction) acknowledges the resulting score collapse. At minimum, print a one-line info: language=de labels=labels.jsonl (no labels.de.jsonl found) so future readers don't draw "the model is bad at German" from numbers that are actually methodology-bound.

Low

import socket inside main() in orchestrator — move to module top (gemini's comment, PEP 8).
Unused out_path in fh, writer, out_path = lang_files[language].
l as loop variable in langs = [l.strip() for l in args.languages.split(",") ...] in translate_datasets.py — ruff E741 (ambiguous with 1). Use lang or code.
_translate_one implicit-None return path is hard to read on inspection. The function always returns in practice (attempt 0/1 raise → continue; attempt 2 always returns either the success or the fallback), but a for/else or explicit return at the bottom would make that obvious.
Report file is bilingual PT/EN — the analytical paragraphs in reports/2026-05-13-multilingual.md are in Portuguese while tables and headers are English. Existing benchmark reports in the repo are English-only; worth aligning for searchability.
_SYSTEM_PROMPT hardcodes proper-noun lists. Reliable, but if datasets grow, the list will drift. A --proper-nouns-file argument (one name per line) would future-proof it without complicating the common case.

Style / convention

Translation script timeouts (30s connect / 120s read) and temperature=0.1 are sensible. Streaming approach is correct for cloud models with high TTFT.
--output-dir schema (<dir>/<lang>/YYYY-MM-DD-<host>.csv) matches existing single-file naming. Good.
The community / local tier additions to candidates.yaml are well-documented with notes.

Security / privacy

translate_datasets.py defaults to kimi-k2.6:cloud. Cloud Ollama tags send dataset prose to a remote endpoint. The benchmark inputs in this PR are synthetic, so it's fine here, but the script's docstring should call out "do not run this on real user data" — someone will repurpose it on their own palace eventually, and that would violate the local-first principle in CLAUDE.md.

Test coverage

No new unit tests, which is reasonable for benchmark-tooling changes. The PR's test plan is a runbook rather than automated tests. Worth at least one regression: a 1-line check that _csv_path("en") returns the right path under both --output and --output-dir — that would have caught the file-handle bug above.

Summary

Solid contribution; the methodology insight (extract-in-input-language → need translated labels) is the kind of finding that justifies the whole PR. Blockers: the orchestrator single-file-mode bug and the untranslated Hindi samples. The rest can be follow-ups, including extending labels.{lang}.jsonl to the remaining five languages (already called out as future work).

…untranslated samples, KO labels Addresses the review feedback from igorls, gemini-code-assist, and Copilot. HIGH: - orchestrator: --output single-file mode now shares ONE (fh, writer) across all languages instead of opening N handles to the same path. The old code caused interleaved buffer corruption: first language opened "w", subsequent ones opened "a", and writes from independent file offsets could overwrite each other. Verified with a multi-language --output smoke test (4 rows written, all distinct). - 19 untranslated/empty samples re-translated: - dataset.de.jsonl: cal_017 - dataset.hi.jsonl entity_extraction: ent_020, ent_025, ent_032, ent_038 - dataset.hi.jsonl room_classification: rc_017, rc_026, rc_028, rc_040, rc_064, rc_089, rc_091 - dataset.ko.jsonl room_classification: rc_027, rc_067 - dataset.it.jsonl room_classification: rc_029, rc_030, rc_031, rc_032, rc_053 (previously empty strings) - labels.ko.jsonl: restored all proper nouns to English (Doreth, Saela, Ivora, Ren Solanke, Pol Krisat, Pell Halloran, Bramble, Hollowmounts Institute, Wendelsea, Bridgewater Community Garden, Wends, Drukar, Aerwyn cycle, Jaccard, Mason bee, Markdown). Also fixed mistranslation 유전자 사과 (genetic apple) → 재래종 사과 (heirloom apple). MEDIUM: - runner.py: refactored label-resolution one-liner into 3 readable lines and added an info log when falling back to English ground truth, so readers don't misread "score collapse" as model failure. LOW: - orchestrator: moved `import socket` to module top (PEP 8); removed unused `out_path` from the unpacking tuple. - translate_datasets.py: renamed loop variable `l` → `code` (ruff E741); made the _translate_one fallback return path explicit instead of relying on for-loop fall-through; added a privacy warning in the docstring flagging that the default `kimi-k2.6:cloud` sends prose to a remote endpoint and should not be used over real palace data. - 2026-05-13-multilingual.md: converted analytical paragraphs from Portuguese to English to match the existing repo convention.

lealbrunocalhau · 2026-05-14T03:17:23Z

Review fixes pushed ✓

All feedback from your review has been addressed:

orchestrator.py

Fixed critical file-handle bug in --output mode (was opening multiple handles to same file with independent buffers)
Now shares single (fh, writer) pair across all languages in single-file mode
Added explicit deduplication in finally block to prevent double-close

runner.py

Refactored label resolution from one-liner to readable 3-line version
Added explicit fallback logging when no language-specific labels exist
Log: "info: language=de labels=labels.jsonl (no labels.de.jsonl found — scoring against English ground truth)"

datasets

Fixed 19 untranslated samples across 4 languages (DE, HI, IT, KO)
All samples re-translated via Ollama with proper noun preservation rules

labels.ko.jsonl

Restored all transliterated proper nouns to English (30+ entries)
Fixed mistranslation: 유전자 사과 (genetic apple) → 재래종 사과 (heirloom apple)

Code style

Moved import socket to module top (PEP 8)
Fixed ruff E741 (ambiguous variable name)
Made _translate_one return path explicit

reports

Converted 2026-05-13-multilingual.md to English-only (was Portuguese)
Updated methodology note explaining Korean labels fix

Ready for next review pass.

lealbrunocalhau requested review from igorls and milla-jovovich as code owners May 13, 2026 23:28

igorls requested a review from Copilot May 13, 2026 23:30

Copilot started reviewing on behalf of igorls May 13, 2026 23:31 View session

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

Copilot AI reviewed May 13, 2026

View reviewed changes

igorls merged commit 6873426 into MemPalace:feat/benchmark-multilingual May 14, 2026

igorls mentioned this pull request May 14, 2026

feat(benchmarks): multilingual datasets + parity controls (embed model, num_ctx, language) #1483

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels#1503

feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels#1503
igorls merged 2 commits into
MemPalace:feat/benchmark-multilingualfrom
workblac:feat/benchmark-multilingual-extended-languages

lealbrunocalhau commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

igorls commented May 14, 2026

Uh oh!

lealbrunocalhau commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	(i, s["id"], s.get(field, ""), language_name, model, endpoint)
	(i, s.get("id", f"idx_{i}"), s.get(field, ""), language_name, model, endpoint)

		completed += 1
		print(f" [{completed}/{len(samples)}] {sample_id}", flush=True)

Conversation

lealbrunocalhau commented May 13, 2026

Summary

New datasets — 630 samples

labels.ko.jsonl for memory_extraction

--output-dir

Other additions

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

igorls commented May 14, 2026

Overview

Issues

High — --output single-file mode opens N handles to the same path

Medium — Untranslated samples in dataset.hi.jsonl

Medium — labels.ko.jsonl doesn't honor the proper-noun rule

Medium — runner.py label-resolution one-liner

Low

Style / convention

Security / privacy

Test coverage

Summary

Uh oh!

lealbrunocalhau commented May 14, 2026

Review fixes pushed ✓

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

High — `--output` single-file mode opens N handles to the same path

Medium — Untranslated samples in `dataset.hi.jsonl`

Medium — `labels.ko.jsonl` doesn't honor the proper-noun rule

Medium — `runner.py` label-resolution one-liner