Skip to content

feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels#1503

Merged
igorls merged 2 commits into
MemPalace:feat/benchmark-multilingualfrom
workblac:feat/benchmark-multilingual-extended-languages
May 14, 2026
Merged

feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels#1503
igorls merged 2 commits into
MemPalace:feat/benchmark-multilingualfrom
workblac:feat/benchmark-multilingual-extended-languages

Conversation

@lealbrunocalhau

Copy link
Copy Markdown
Contributor

Summary

Builds on top of #1483 with 6 additional languages and a few methodology fixes discovered during testing.

New datasets — 630 samples

Language calibration entity_extraction memory_extraction room_classification
DE (German)
FR (French)
HI (Hindi)
IT (Italian)
KO (Korean)
RU (Russian)

Same conventions as pt-BR/es/zh: inputs translated, proper nouns and system labels (entity types, room slugs, memory types) stay English.

labels.ko.jsonl for memory_extraction

During a 210-run matrix (6 models × 7 languages × 5 tasks) we noticed memory_extraction scores collapsed ~0.5 pp for non-EN. Investigation showed models were correctly extracting memories in the input language, but we were scoring against English ground truth → artificial score drop.

Added labels.ko.jsonl with Korean ground-truth content as a starting point. runner.py now loads labels.{lang}.jsonl when present, falling back to labels.jsonl. Quick test: KO score went from 0.40 → 0.60 after the fix (EN baseline 1.0). The remaining gap is real model difficulty, not methodology noise.

The other languages still use English labels for now — contributed as-is so the datasets are available for testing. Generating all translated labels is a follow-up.

--output-dir

Alternative to --output for multilingual matrix runs. Writes <dir>/<lang>/YYYY-MM-DD-<host>.csv per language instead of one flat file. The existing --output single-file mode is unchanged and --num-ctx, --llm-provider, --embed-endpoint all work exactly as before.

Other additions

  • translate_datasets.py: script used to generate translations via Ollama — contributors can extend to new languages
  • candidates.yaml: community tier (igorls classifier variants, heretic) and local tier
  • reports/2026-05-13-multilingual.md: full 210-run benchmark report

Test plan

  • --languages en,de,fr,hi,it,ko,ru --output /tmp/test.csv produces 7 rows per model/task, no errors
  • --output-dir /tmp/results creates en/, de/, ... subfolders each with a CSV
  • --num-ctx 8192 still works (Igor's flag, unchanged)
  • KO memory_extraction score is higher with labels.ko.jsonl present than without
  • Existing EN-only runs produce identical output (backward compat)

…nslated labels

Adds 6 new language datasets (German, French, Hindi, Italian, Korean, Russian)
across all 4 benchmark tasks (calibration, entity_extraction, memory_extraction,
room_classification) — 630 samples total, same conventions as the existing
pt-BR/es/zh datasets: inputs translated, labels/ground-truth stay English
except where noted.

Changes:
- 24 new dataset.{de,fr,hi,it,ko,ru}.jsonl files across all 4 tasks
- labels.ko.jsonl for memory_extraction: Korean ground-truth so the scorer
  compares Korean model output against Korean expected content instead of
  English (fixes ~20pp score gap identified during testing — see report)
- runner.py: loads labels.{lang}.jsonl when present, falls back to labels.jsonl
- orchestrator.py: adds --output-dir (writes <dir>/<lang>/YYYY-MM-DD-<host>.csv
  per language); --output single-file mode unchanged
- candidates.yaml: adds community tier (igorls classifier variants, heretic)
  and local tier (gemma4:e4b)
- translate_datasets.py: script used to generate the translations via Ollama;
  included so contributors can extend to new languages without manual work
- reports/2026-05-13-multilingual.md: 210-run benchmark report across
  6 models × 7 languages × 5 tasks on RTX 3080 Laptop 8 GB

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multilingual support to the model evaluation benchmark suite, adding translated datasets for six languages and a new script to automate translations via LLM. The orchestrator has been updated to support per-language output directories. The code review identifies a high-severity issue in the orchestrator where multiple file handles are opened for the same path in single-file mode, which could lead to data corruption. Additional feedback includes moving imports to the module level for PEP 8 compliance and using safer dictionary access in the translation script to prevent potential KeyError exceptions.

Comment thread benchmarks/model_eval/orchestrator.py Outdated
Comment on lines +223 to +232
if p not in seen_paths:
fh = open(p, "w", newline="", encoding="utf-8")
w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
w.writeheader()
fh.flush()
seen_paths.add(p)
else:
fh = open(p, "a", newline="", encoding="utf-8")
w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
lang_files[lang] = (fh, w, p)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When --output is used (single file mode), this logic opens the same file multiple times in different modes ('w' for the first language, 'a' for subsequent ones). This results in multiple active file handles to the same path, which can lead to data corruption or interleaved writes due to buffering. It is better to reuse the same file handle for a given path.

Suggested change
if p not in seen_paths:
fh = open(p, "w", newline="", encoding="utf-8")
w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
w.writeheader()
fh.flush()
seen_paths.add(p)
else:
fh = open(p, "a", newline="", encoding="utf-8")
w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
lang_files[lang] = (fh, w, p)
# Pre-create all output files (headers written before the first run starts).
lang_files: dict[str, tuple] = {}
path_to_fh: dict[Path, tuple] = {}
for lang in languages:
p = _csv_path(lang)
p.parent.mkdir(parents=True, exist_ok=True)
if p not in path_to_fh:
fh = open(p, "w", newline="", encoding="utf-8")
w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
w.writeheader()
fh.flush()
path_to_fh[p] = (fh, w)
lang_files[lang] = (*path_to_fh[p], p)
print(f" {lang}{p}")

Comment thread benchmarks/model_eval/orchestrator.py Outdated
args.output.parent.mkdir(parents=True, exist_ok=True)
# Resolve output path(s). --output-dir writes one CSV per language so results
# from long multilingual runs stay grouped by locale and are easier to diff.
import socket

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import socket statement should be moved to the top of the file with other imports to follow PEP 8 guidelines and ensure consistency.

References
  1. Imports should always be put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)

results = [None] * len(samples)

work = [
(i, s["id"], s.get(field, ""), language_name, model, endpoint)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing s["id"] directly assumes that every sample in the dataset contains an 'id' field. If a sample is missing this key, the script will crash with a KeyError. Using .get() with a fallback is safer.

Suggested change
(i, s["id"], s.get(field, ""), language_name, model, endpoint)
(i, s.get("id", f"idx_{i}"), s.get(field, ""), language_name, model, endpoint)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multilingual benchmark coverage and related harness improvements for evaluating model behavior across DE/FR/HI/IT/KO/RU datasets.

Changes:

  • Adds translated benchmark datasets across calibration, entity extraction, memory extraction, and room classification.
  • Adds --output-dir support for per-language CSV output.
  • Adds translated Korean memory labels, translation tooling, new candidate tiers, and a multilingual benchmark report.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
benchmarks/model_eval/orchestrator.py Adds --output-dir and per-language CSV handling.
benchmarks/model_eval/runner.py Loads language-specific labels when available.
benchmarks/model_eval/translate_datasets.py Adds Ollama-based dataset translation utility.
benchmarks/model_eval/candidates.yaml Adds community/local candidate entries.
benchmarks/model_eval/reports/2026-05-13-multilingual.md Adds multilingual benchmark report.
benchmarks/model_eval/datasets/calibration/dataset.de.jsonl Adds German calibration dataset.
benchmarks/model_eval/datasets/calibration/dataset.fr.jsonl Adds French calibration dataset.
benchmarks/model_eval/datasets/calibration/dataset.hi.jsonl Adds Hindi calibration dataset.
benchmarks/model_eval/datasets/calibration/dataset.it.jsonl Adds Italian calibration dataset.
benchmarks/model_eval/datasets/calibration/dataset.ko.jsonl Adds Korean calibration dataset.
benchmarks/model_eval/datasets/calibration/dataset.ru.jsonl Adds Russian calibration dataset.
benchmarks/model_eval/datasets/entity_extraction/dataset.de.jsonl Adds German entity extraction dataset.
benchmarks/model_eval/datasets/entity_extraction/dataset.fr.jsonl Adds French entity extraction dataset.
benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl Adds Hindi entity extraction dataset.
benchmarks/model_eval/datasets/entity_extraction/dataset.it.jsonl Adds Italian entity extraction dataset.
benchmarks/model_eval/datasets/entity_extraction/dataset.ko.jsonl Adds Korean entity extraction dataset.
benchmarks/model_eval/datasets/entity_extraction/dataset.ru.jsonl Adds Russian entity extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/dataset.de.jsonl Adds German memory extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/dataset.fr.jsonl Adds French memory extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/dataset.hi.jsonl Adds Hindi memory extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/dataset.it.jsonl Adds Italian memory extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/dataset.ko.jsonl Adds Korean memory extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/dataset.ru.jsonl Adds Russian memory extraction dataset.
benchmarks/model_eval/datasets/memory_extraction/labels.ko.jsonl Adds Korean memory extraction labels.
benchmarks/model_eval/datasets/room_classification/dataset.de.jsonl Adds German room classification dataset.
benchmarks/model_eval/datasets/room_classification/dataset.fr.jsonl Adds French room classification dataset.
benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl Adds Hindi room classification dataset.
benchmarks/model_eval/datasets/room_classification/dataset.it.jsonl Adds Italian room classification dataset.
benchmarks/model_eval/datasets/room_classification/dataset.ko.jsonl Adds Korean room classification dataset.
benchmarks/model_eval/datasets/room_classification/dataset.ru.jsonl Adds Russian room classification dataset.
Comments suppressed due to low confidence (6)

benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:25

  • This Hindi dataset row is still entirely English, so the hi entity-extraction benchmark includes untranslated input and can overstate Hindi performance for this sample.
{"id": "ent_025", "text": "Aria and the user had a long conversation about the Embedding Spaces visualization tool. The user, Saela, wants to add a temporal slider so she can replay how clusters formed over the indexing period. Sketched the API. Coordinating with Brennan Lyle on the rendering side — he has experience with WebGL from his time at Bridgewater Visualization."}

benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:32

  • This Hindi dataset row is still entirely English, which contradicts the translated-input methodology and makes this hi sample non-comparable with the rest of the Hindi dataset.
{"id": "ent_032", "text": "Bramble visited the Hollowmounts Institute experimental garden. Their Native Species program is doing remarkable work with seed-saving for regional ecotypes. Met with their lead horticulturist, Iset Karadzic — same Iset as the nursery, she consults with the Institute as well. She gave me a packet of cardinal flower seed from the local provenance."}

benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:38

  • This Hindi dataset row is still entirely English, so this sample does not exercise Hindi entity extraction and can skew the reported hi score.
{"id": "ent_038", "text": "Solas worked through a thread-safety audit for the Distributed Tracing client library. Found two race conditions that ThreadSanitizer hadn't caught — they only manifested at very high concurrency on the Crestmoor production stack. Mette Olafsen confirmed she could reproduce. Pushed fixes; both reviewed by Brennan Lyle."}

benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl:26

  • This Hindi room-classification row remains in English, which contradicts the translated-input methodology and makes this sample non-comparable with the rest of the Hindi dataset.
{"id": "rc_026", "agent": "Solas", "session_summary": "Wrote a small LLVM pass that hoists loop-invariant GEPs above their containing loops. The trick was getting the alias analysis to confirm the base pointer doesn't escape across the loop boundary. Tested on a few microbenchmarks; saw 4-12% speedups on the inner loops where it triggered.", "include_messy_features": false}

benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl:28

  • Most of this Hindi sample is still English prose. Even with include_messy_features, the natural-language input should be translated so the Hindi benchmark is not partially measuring English performance.
{"id": "rc_028", "agent": "Solas", "session_summary": "long debug — parser was inf-looping on certain inputs. left-recursive rule i'd added during refactoring. combinators dont handle left recursion natively (packrat+memoization does, but ours isnt packrat). restructured the affected rules into pratt-style operator precedence:\n```\nlet parse_expr min_prec =\n  let lhs = ref (parse_atom ()) in\n  while peek_prec () >= min_prec do ...\n```\nworks now. ALSO need to add a fuzzer corpus check before tagging the release.", "include_messy_features": true}

benchmarks/model_eval/datasets/memory_extraction/labels.ko.jsonl:15

  • This label no longer preserves the proper noun Doreth from the source/input text, so a correct Korean extraction that keeps Doreth can be scored against mismatched ground truth.
{"id": "mem_015", "memories": [{"type": "commitment", "content": "주말 종료 전에 독스에게 유전자 사과의 가지치기 일정을 이메일로 보내겠다."}]}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +313 to +320
# ── Local tier: models already pulled, no extra disk needed ──────────
- tag: gemma4:e4b
family: gemma4
size_b: 4.0
variant: instruct
quantization: default
expected_vram_mb: 4510
tier: local
{"id": "ent_017", "text": "Bramble ने Pollinator Paths के लिए Bridgewater Schools की बगीचा समिति के साथ परामर्श किया। वे केवल देशी पौधों का ही रोपण चाहते हैं लेकिन उनका बजट सख्त है। Wendelsea Native Plant Nursery के plug trays से शुरुआत करने की सलाह दी गई — ये gallon containers से सस्ते हैं और जल्दी जड़ पकड़ लेते हैं। उनके संस्थापक Iset Karadzic से बात की गई, जिन्होंने स्कूलों को अपनी गैर-लाभकारी दर देने के लिए सहमति दी।"}
{"id": "ent_018", "text": "Solas ने Pol Krisat के साथ मिलकर नई parser combinator library को Aerwyn Labs codebase में एकीकृत किया। एकीकरण साफ-सुथरा था, सिवाय एक जगह के जहाँ उनका custom error type हमारे Distributed Tracing context से टकरा रहा था। एक छोटे adapter trait को पेश करके इसे हल किया गया। आसान review के लिए एकीकरण को एक अलग PR में push किया गया।"}
{"id": "ent_019", "text": "Thresh Bridgewater Studio के नेतृत्व के साथ एक योजना बैठक में शामिल हुआ। उनके CFO Ivora Tinn कार्यकारी टीम के लिए एक नया Cashflow Models dashboard लॉन्च कर रही हैं। संचालन निदेशक Karis Tornau विभाग-वार drill-down चाहती हैं; Ivora केवल उच्च-स्तरीय सारांश चाहती हैं। समझौता: दो views, एक ही अंतर्निहित data।"}
{"id": "ent_020", "text": "Aria explored the relationship between attention entropy and reasoning faithfulness. The Hollowmounts Institute paper from last year claimed they were positively correlated; the Aerwyn Labs replication found the opposite. Drafted a synthesis that argues both papers are right but for different model scales. Need to verify this on the Crestmoor cluster before claiming anything."}
{"id": "rc_014", "agent": "Aria", "session_summary": "परिणाम तालिका के एक ड्राफ्ट की समीक्षा की। उपयोगकर्ता ने एक paired t-test चलाया था, लेकिन अंतर स्पष्ट रूप से bimodal हैं — छोटे सुधारों का एक समूह और बड़े सुधारों का एक समूह है। Wilcoxon signed-rank test पर स्विच करने और माध्यिका अंतर तथा bootstrap CI दोनों की रिपोर्टिंग करने की सिफारिश की।", "include_messy_features": false}
{"id": "rc_015", "agent": "Aria", "session_summary": "आते हुए दस्तावेज़ एम्बेडिंग्स पर स्ट्रीमिंग HDBSCAN के लिए कोड का मसौदा तैयार किया। तरकीब यह है कि न्यूनतम स्पैनिंग ट्री को क्रमिक रूप से बनाए रखना हो; हर बैच पर पूर्ण पुनः क्लस्टरिंग बहुत धीमी है। Crestmoor समूह का एक 2022 का पेपर मिला जिसमें सही एल्गोरिदम है। कल लागू करेंगे।", "include_messy_features": false}
{"id": "rc_016", "agent": "Aria", "session_summary": "tex compile error फिर से। \\usepackage{algorithm2e} का \\usepackage{algorithmic} से संघर्ष होता है जब दोनों acmart द्वारा लोड किए जाते हैं। समाधान: केवल algorithm2e लोड करें और \\SetAlgoNoLine का उपयोग करें। साथ ही bibliography style [smith2023] प्रविष्टि में अनुपस्थित 'doi' field की शिकायत कर रहा था — इसे दस्ती जोड़ दिया।", "include_messy_features": true}
{"id": "rc_017", "agent": "Aria", "session_summary": "Talked through whether log-uniform or beta priors are better for hyperparameter search in a small-data regime. The user was running a Bayesian optimization and getting weird convergence. The issue was their search space was huge (5 dims, broad bounds) and they only had 30 evals — that's just not enough budget. Suggested either narrowing the bounds or switching to random search to diagnose.", "include_messy_features": false}
Comment on lines +13 to +15
{"id": "mem_013", "memories": [{"type": "fact", "content": "웨인셀라 식물원의 올해 첫 피어나는 날짜는 장기 평균보다 8일 더 일찍 나타났다."}, {"type": "fact", "content": "웨인셀라 식물원의 첫 피어나는 기록은 18년 전부터 시작되었으며, 이는 기록상 가장 조기의 봄이다."}]}
{"id": "mem_014", "memories": [{"type": "opinion", "content": "언어적 편견(웨인스는 드루카르보다 더 부드러운 말을 한다)을 가상의 민족에 적용하면 페이지에서 단순화되고 인물들을 평평하게 만들 수 있다."}]}
{"id": "mem_015", "memories": [{"type": "commitment", "content": "주말 종료 전에 독스에게 유전자 사과의 가지치기 일정을 이메일로 보내겠다."}]}
Comment on lines +153 to +154
completed += 1
print(f" [{completed}/{len(samples)}] {sample_id}", flush=True)
Comment on lines +116 to +118
if attempt == 2:
print(f" ERROR on {sample_id}: {e}", file=sys.stderr, flush=True)
return idx, sample_id, text # fall back to English
Comment on lines +186 to +188
for task in tasks:
field = TASK_TEXT_FIELD[task]
src = args.dataset_dir / task / "dataset.jsonl"
Comment thread benchmarks/model_eval/orchestrator.py Outdated
Comment on lines +223 to +230
if p not in seen_paths:
fh = open(p, "w", newline="", encoding="utf-8")
w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS)
w.writeheader()
fh.flush()
seen_paths.add(p)
else:
fh = open(p, "a", newline="", encoding="utf-8")
Comment thread benchmarks/model_eval/runner.py Outdated
)
samples = load_jsonl(dataset_path)
labels = load_jsonl(task_dir / "labels.jsonl")
labels_path = task_dir / (f"labels.{language}.jsonl" if language != "en" and (task_dir / f"labels.{language}.jsonl").exists() else "labels.jsonl")
@igorls

igorls commented May 14, 2026

Copy link
Copy Markdown
Member

Overview

Stacks on feat/benchmark-multilingual (#1483). Adds 6 translated dataset languages (DE/FR/HI/IT/KO/RU), --output-dir for per-language CSV output, labels.ko.jsonl to fix the cross-language scoring bug on memory_extraction, a translate_datasets.py Ollama helper, new community/local candidate tiers, and a 210-run multilingual benchmark report.

The methodology fix (per-language labels) is well-motivated and the report is genuinely useful. Most concerns are localized.

Issues

High — --output single-file mode opens N handles to the same path

In orchestrator.py:

for lang in languages:
    p = _csv_path(lang)
    p.parent.mkdir(parents=True, exist_ok=True)
    if p not in seen_paths:
        fh = open(p, "w", ...)
        ...
        seen_paths.add(p)
    else:
        fh = open(p, "a", ...)   # ← second handle to the SAME file
    lang_files[lang] = (fh, w, p)

When --output one.csv is combined with multiple languages, _csv_path() returns the same Path for every language. The first iteration opens a "w" handle; subsequent iterations open additional "a" handles to the same file. All handles stay open simultaneously, each with its own buffer and offset. Interleaved writer.writerow(...) + fh.flush() from different handles can overwrite each other (the "w" handle's position advances independently of "a" appends). gemini-code-assist flagged this — it's a real correctness bug, not just style.

Fix: detect single-file mode once and share one (fh, writer) pair across all languages:

if args.output:
    fh = open(args.output, "w", newline="", encoding="utf-8")
    w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS); w.writeheader(); fh.flush()
    lang_files = {lang: (fh, w, args.output) for lang in languages}
else:
    # per-language files as today (no path collisions possible)

The test plan item "Existing EN-only runs produce identical output" passes only because a single-language run never hits the collision — a multi-language --output run would.

Medium — Untranslated samples in dataset.hi.jsonl

ent_020, ent_025, ent_032, ent_038 in datasets/entity_extraction/dataset.hi.jsonl are still entirely English, and Copilot flagged rc_026, rc_028 in the Hindi room-classification set. These inflate the Hindi numbers (the model isn't actually processing Hindi for those rows). Worth a quick sweep — grep -L '[ऀ-ॿ]'-style — across all six new languages before the report's numbers get cited downstream. The Russian set is worth double-checking for the same reason (the report calls RU the lowest scorer).

Medium — labels.ko.jsonl doesn't honor the proper-noun rule

The translation prompt and the input datasets are explicit that proper nouns stay English. The Korean labels don't follow the same rule:

  • labels.ko.jsonl:15 replaces Doreth with 독스 (transliteration) and translates "heritage apple" as 유전자 사과 ("genetic apple" — a mistranslation; should be 재래종/고대 사과).
  • Other lines transliterate names inconsistently: 세라 for Saela, 이보라 for Ivora, 렌 소란케 for Ren Solanke, 브램블 for Bramble, 폴 크리사트 for Pol Krisat, while keeping Solas/Ivora/Tartine Lab in English elsewhere.

Cosine scoring with nomic-embed-text is somewhat robust to this, but inconsistency in the ground truth still adds noise to exactly the metric this file was added to fix. Worth one pass to normalize names back to English before more contributors copy this format for DE/FR/IT/HI/RU labels.

Medium — runner.py label-resolution one-liner

labels_path = task_dir / (f"labels.{language}.jsonl" if language != "en" and (task_dir / f"labels.{language}.jsonl").exists() else "labels.jsonl")

Two readability nits and one real concern:

  1. .exists() is computed inside a conditional expression that builds the path string twice. Split it:
    candidate = task_dir / f"labels.{language}.jsonl"
    labels_path = candidate if language != "en" and candidate.exists() else task_dir / "labels.jsonl"
  2. Silent fallback. When DE/FR/HI/IT/RU runs hit memory_extraction, they fall back to labels.jsonl (English ground truth) without any log line. That's exactly the bug the PR was written to address, and the report (§ Memory Extraction) acknowledges the resulting score collapse. At minimum, print a one-line info: language=de labels=labels.jsonl (no labels.de.jsonl found) so future readers don't draw "the model is bad at German" from numbers that are actually methodology-bound.

Low

  • import socket inside main() in orchestrator — move to module top (gemini's comment, PEP 8).
  • Unused out_path in fh, writer, out_path = lang_files[language].
  • l as loop variable in langs = [l.strip() for l in args.languages.split(",") ...] in translate_datasets.py — ruff E741 (ambiguous with 1). Use lang or code.
  • _translate_one implicit-None return path is hard to read on inspection. The function always returns in practice (attempt 0/1 raise → continue; attempt 2 always returns either the success or the fallback), but a for/else or explicit return at the bottom would make that obvious.
  • Report file is bilingual PT/EN — the analytical paragraphs in reports/2026-05-13-multilingual.md are in Portuguese while tables and headers are English. Existing benchmark reports in the repo are English-only; worth aligning for searchability.
  • _SYSTEM_PROMPT hardcodes proper-noun lists. Reliable, but if datasets grow, the list will drift. A --proper-nouns-file argument (one name per line) would future-proof it without complicating the common case.

Style / convention

  • Translation script timeouts (30s connect / 120s read) and temperature=0.1 are sensible. Streaming approach is correct for cloud models with high TTFT.
  • --output-dir schema (<dir>/<lang>/YYYY-MM-DD-<host>.csv) matches existing single-file naming. Good.
  • The community / local tier additions to candidates.yaml are well-documented with notes.

Security / privacy

translate_datasets.py defaults to kimi-k2.6:cloud. Cloud Ollama tags send dataset prose to a remote endpoint. The benchmark inputs in this PR are synthetic, so it's fine here, but the script's docstring should call out "do not run this on real user data" — someone will repurpose it on their own palace eventually, and that would violate the local-first principle in CLAUDE.md.

Test coverage

No new unit tests, which is reasonable for benchmark-tooling changes. The PR's test plan is a runbook rather than automated tests. Worth at least one regression: a 1-line check that _csv_path("en") returns the right path under both --output and --output-dir — that would have caught the file-handle bug above.

Summary

Solid contribution; the methodology insight (extract-in-input-language → need translated labels) is the kind of finding that justifies the whole PR. Blockers: the orchestrator single-file-mode bug and the untranslated Hindi samples. The rest can be follow-ups, including extending labels.{lang}.jsonl to the remaining five languages (already called out as future work).

…untranslated samples, KO labels

Addresses the review feedback from igorls, gemini-code-assist, and Copilot.

HIGH:
- orchestrator: --output single-file mode now shares ONE (fh, writer) across
  all languages instead of opening N handles to the same path. The old code
  caused interleaved buffer corruption: first language opened "w", subsequent
  ones opened "a", and writes from independent file offsets could overwrite
  each other. Verified with a multi-language --output smoke test (4 rows
  written, all distinct).
- 19 untranslated/empty samples re-translated:
  - dataset.de.jsonl: cal_017
  - dataset.hi.jsonl entity_extraction: ent_020, ent_025, ent_032, ent_038
  - dataset.hi.jsonl room_classification: rc_017, rc_026, rc_028, rc_040,
    rc_064, rc_089, rc_091
  - dataset.ko.jsonl room_classification: rc_027, rc_067
  - dataset.it.jsonl room_classification: rc_029, rc_030, rc_031, rc_032,
    rc_053 (previously empty strings)
- labels.ko.jsonl: restored all proper nouns to English (Doreth, Saela, Ivora,
  Ren Solanke, Pol Krisat, Pell Halloran, Bramble, Hollowmounts Institute,
  Wendelsea, Bridgewater Community Garden, Wends, Drukar, Aerwyn cycle,
  Jaccard, Mason bee, Markdown). Also fixed mistranslation 유전자 사과
  (genetic apple) → 재래종 사과 (heirloom apple).

MEDIUM:
- runner.py: refactored label-resolution one-liner into 3 readable lines
  and added an info log when falling back to English ground truth, so
  readers don't misread "score collapse" as model failure.

LOW:
- orchestrator: moved `import socket` to module top (PEP 8); removed
  unused `out_path` from the unpacking tuple.
- translate_datasets.py: renamed loop variable `l` → `code` (ruff E741);
  made the _translate_one fallback return path explicit instead of relying
  on for-loop fall-through; added a privacy warning in the docstring
  flagging that the default `kimi-k2.6:cloud` sends prose to a remote
  endpoint and should not be used over real palace data.
- 2026-05-13-multilingual.md: converted analytical paragraphs from
  Portuguese to English to match the existing repo convention.
@lealbrunocalhau

Copy link
Copy Markdown
Contributor Author

Review fixes pushed ✓

All feedback from your review has been addressed:

orchestrator.py

  • Fixed critical file-handle bug in --output mode (was opening multiple handles to same file with independent buffers)
  • Now shares single (fh, writer) pair across all languages in single-file mode
  • Added explicit deduplication in finally block to prevent double-close

runner.py

  • Refactored label resolution from one-liner to readable 3-line version
  • Added explicit fallback logging when no language-specific labels exist
  • Log: "info: language=de labels=labels.jsonl (no labels.de.jsonl found — scoring against English ground truth)"

datasets

  • Fixed 19 untranslated samples across 4 languages (DE, HI, IT, KO)
  • All samples re-translated via Ollama with proper noun preservation rules

labels.ko.jsonl

  • Restored all transliterated proper nouns to English (30+ entries)
  • Fixed mistranslation: 유전자 사과 (genetic apple) → 재래종 사과 (heirloom apple)

Code style

  • Moved import socket to module top (PEP 8)
  • Fixed ruff E741 (ambiguous variable name)
  • Made _translate_one return path explicit

reports

  • Converted 2026-05-13-multilingual.md to English-only (was Portuguese)
  • Updated methodology note explaining Korean labels fix

Ready for next review pass.

@igorls igorls merged commit 6873426 into MemPalace:feat/benchmark-multilingual May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants