feat(benchmarks): add DE/FR/HI/IT/KO/RU datasets + --output-dir + translated labels#1503
Conversation
…nslated labels
Adds 6 new language datasets (German, French, Hindi, Italian, Korean, Russian)
across all 4 benchmark tasks (calibration, entity_extraction, memory_extraction,
room_classification) — 630 samples total, same conventions as the existing
pt-BR/es/zh datasets: inputs translated, labels/ground-truth stay English
except where noted.
Changes:
- 24 new dataset.{de,fr,hi,it,ko,ru}.jsonl files across all 4 tasks
- labels.ko.jsonl for memory_extraction: Korean ground-truth so the scorer
compares Korean model output against Korean expected content instead of
English (fixes ~20pp score gap identified during testing — see report)
- runner.py: loads labels.{lang}.jsonl when present, falls back to labels.jsonl
- orchestrator.py: adds --output-dir (writes <dir>/<lang>/YYYY-MM-DD-<host>.csv
per language); --output single-file mode unchanged
- candidates.yaml: adds community tier (igorls classifier variants, heretic)
and local tier (gemma4:e4b)
- translate_datasets.py: script used to generate the translations via Ollama;
included so contributors can extend to new languages without manual work
- reports/2026-05-13-multilingual.md: 210-run benchmark report across
6 models × 7 languages × 5 tasks on RTX 3080 Laptop 8 GB
There was a problem hiding this comment.
Code Review
This pull request introduces multilingual support to the model evaluation benchmark suite, adding translated datasets for six languages and a new script to automate translations via LLM. The orchestrator has been updated to support per-language output directories. The code review identifies a high-severity issue in the orchestrator where multiple file handles are opened for the same path in single-file mode, which could lead to data corruption. Additional feedback includes moving imports to the module level for PEP 8 compliance and using safer dictionary access in the translation script to prevent potential KeyError exceptions.
| if p not in seen_paths: | ||
| fh = open(p, "w", newline="", encoding="utf-8") | ||
| w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS) | ||
| w.writeheader() | ||
| fh.flush() | ||
| seen_paths.add(p) | ||
| else: | ||
| fh = open(p, "a", newline="", encoding="utf-8") | ||
| w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS) | ||
| lang_files[lang] = (fh, w, p) |
There was a problem hiding this comment.
When --output is used (single file mode), this logic opens the same file multiple times in different modes ('w' for the first language, 'a' for subsequent ones). This results in multiple active file handles to the same path, which can lead to data corruption or interleaved writes due to buffering. It is better to reuse the same file handle for a given path.
| if p not in seen_paths: | |
| fh = open(p, "w", newline="", encoding="utf-8") | |
| w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS) | |
| w.writeheader() | |
| fh.flush() | |
| seen_paths.add(p) | |
| else: | |
| fh = open(p, "a", newline="", encoding="utf-8") | |
| w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS) | |
| lang_files[lang] = (fh, w, p) | |
| # Pre-create all output files (headers written before the first run starts). | |
| lang_files: dict[str, tuple] = {} | |
| path_to_fh: dict[Path, tuple] = {} | |
| for lang in languages: | |
| p = _csv_path(lang) | |
| p.parent.mkdir(parents=True, exist_ok=True) | |
| if p not in path_to_fh: | |
| fh = open(p, "w", newline="", encoding="utf-8") | |
| w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS) | |
| w.writeheader() | |
| fh.flush() | |
| path_to_fh[p] = (fh, w) | |
| lang_files[lang] = (*path_to_fh[p], p) | |
| print(f" {lang} → {p}") |
| args.output.parent.mkdir(parents=True, exist_ok=True) | ||
| # Resolve output path(s). --output-dir writes one CSV per language so results | ||
| # from long multilingual runs stay grouped by locale and are easier to diff. | ||
| import socket |
There was a problem hiding this comment.
The import socket statement should be moved to the top of the file with other imports to follow PEP 8 guidelines and ensure consistency.
References
- Imports should always be put at the top of the file, just after any module comments and docstrings, and before module globals and constants. (link)
| results = [None] * len(samples) | ||
|
|
||
| work = [ | ||
| (i, s["id"], s.get(field, ""), language_name, model, endpoint) |
There was a problem hiding this comment.
Accessing s["id"] directly assumes that every sample in the dataset contains an 'id' field. If a sample is missing this key, the script will crash with a KeyError. Using .get() with a fallback is safer.
| (i, s["id"], s.get(field, ""), language_name, model, endpoint) | |
| (i, s.get("id", f"idx_{i}"), s.get(field, ""), language_name, model, endpoint) |
There was a problem hiding this comment.
Pull request overview
Adds multilingual benchmark coverage and related harness improvements for evaluating model behavior across DE/FR/HI/IT/KO/RU datasets.
Changes:
- Adds translated benchmark datasets across calibration, entity extraction, memory extraction, and room classification.
- Adds
--output-dirsupport for per-language CSV output. - Adds translated Korean memory labels, translation tooling, new candidate tiers, and a multilingual benchmark report.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
benchmarks/model_eval/orchestrator.py |
Adds --output-dir and per-language CSV handling. |
benchmarks/model_eval/runner.py |
Loads language-specific labels when available. |
benchmarks/model_eval/translate_datasets.py |
Adds Ollama-based dataset translation utility. |
benchmarks/model_eval/candidates.yaml |
Adds community/local candidate entries. |
benchmarks/model_eval/reports/2026-05-13-multilingual.md |
Adds multilingual benchmark report. |
benchmarks/model_eval/datasets/calibration/dataset.de.jsonl |
Adds German calibration dataset. |
benchmarks/model_eval/datasets/calibration/dataset.fr.jsonl |
Adds French calibration dataset. |
benchmarks/model_eval/datasets/calibration/dataset.hi.jsonl |
Adds Hindi calibration dataset. |
benchmarks/model_eval/datasets/calibration/dataset.it.jsonl |
Adds Italian calibration dataset. |
benchmarks/model_eval/datasets/calibration/dataset.ko.jsonl |
Adds Korean calibration dataset. |
benchmarks/model_eval/datasets/calibration/dataset.ru.jsonl |
Adds Russian calibration dataset. |
benchmarks/model_eval/datasets/entity_extraction/dataset.de.jsonl |
Adds German entity extraction dataset. |
benchmarks/model_eval/datasets/entity_extraction/dataset.fr.jsonl |
Adds French entity extraction dataset. |
benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl |
Adds Hindi entity extraction dataset. |
benchmarks/model_eval/datasets/entity_extraction/dataset.it.jsonl |
Adds Italian entity extraction dataset. |
benchmarks/model_eval/datasets/entity_extraction/dataset.ko.jsonl |
Adds Korean entity extraction dataset. |
benchmarks/model_eval/datasets/entity_extraction/dataset.ru.jsonl |
Adds Russian entity extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/dataset.de.jsonl |
Adds German memory extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/dataset.fr.jsonl |
Adds French memory extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/dataset.hi.jsonl |
Adds Hindi memory extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/dataset.it.jsonl |
Adds Italian memory extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/dataset.ko.jsonl |
Adds Korean memory extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/dataset.ru.jsonl |
Adds Russian memory extraction dataset. |
benchmarks/model_eval/datasets/memory_extraction/labels.ko.jsonl |
Adds Korean memory extraction labels. |
benchmarks/model_eval/datasets/room_classification/dataset.de.jsonl |
Adds German room classification dataset. |
benchmarks/model_eval/datasets/room_classification/dataset.fr.jsonl |
Adds French room classification dataset. |
benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl |
Adds Hindi room classification dataset. |
benchmarks/model_eval/datasets/room_classification/dataset.it.jsonl |
Adds Italian room classification dataset. |
benchmarks/model_eval/datasets/room_classification/dataset.ko.jsonl |
Adds Korean room classification dataset. |
benchmarks/model_eval/datasets/room_classification/dataset.ru.jsonl |
Adds Russian room classification dataset. |
Comments suppressed due to low confidence (6)
benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:25
- This Hindi dataset row is still entirely English, so the
hientity-extraction benchmark includes untranslated input and can overstate Hindi performance for this sample.
{"id": "ent_025", "text": "Aria and the user had a long conversation about the Embedding Spaces visualization tool. The user, Saela, wants to add a temporal slider so she can replay how clusters formed over the indexing period. Sketched the API. Coordinating with Brennan Lyle on the rendering side — he has experience with WebGL from his time at Bridgewater Visualization."}
benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:32
- This Hindi dataset row is still entirely English, which contradicts the translated-input methodology and makes this
hisample non-comparable with the rest of the Hindi dataset.
{"id": "ent_032", "text": "Bramble visited the Hollowmounts Institute experimental garden. Their Native Species program is doing remarkable work with seed-saving for regional ecotypes. Met with their lead horticulturist, Iset Karadzic — same Iset as the nursery, she consults with the Institute as well. She gave me a packet of cardinal flower seed from the local provenance."}
benchmarks/model_eval/datasets/entity_extraction/dataset.hi.jsonl:38
- This Hindi dataset row is still entirely English, so this sample does not exercise Hindi entity extraction and can skew the reported
hiscore.
{"id": "ent_038", "text": "Solas worked through a thread-safety audit for the Distributed Tracing client library. Found two race conditions that ThreadSanitizer hadn't caught — they only manifested at very high concurrency on the Crestmoor production stack. Mette Olafsen confirmed she could reproduce. Pushed fixes; both reviewed by Brennan Lyle."}
benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl:26
- This Hindi room-classification row remains in English, which contradicts the translated-input methodology and makes this sample non-comparable with the rest of the Hindi dataset.
{"id": "rc_026", "agent": "Solas", "session_summary": "Wrote a small LLVM pass that hoists loop-invariant GEPs above their containing loops. The trick was getting the alias analysis to confirm the base pointer doesn't escape across the loop boundary. Tested on a few microbenchmarks; saw 4-12% speedups on the inner loops where it triggered.", "include_messy_features": false}
benchmarks/model_eval/datasets/room_classification/dataset.hi.jsonl:28
- Most of this Hindi sample is still English prose. Even with
include_messy_features, the natural-language input should be translated so the Hindi benchmark is not partially measuring English performance.
{"id": "rc_028", "agent": "Solas", "session_summary": "long debug — parser was inf-looping on certain inputs. left-recursive rule i'd added during refactoring. combinators dont handle left recursion natively (packrat+memoization does, but ours isnt packrat). restructured the affected rules into pratt-style operator precedence:\n```\nlet parse_expr min_prec =\n let lhs = ref (parse_atom ()) in\n while peek_prec () >= min_prec do ...\n```\nworks now. ALSO need to add a fuzzer corpus check before tagging the release.", "include_messy_features": true}
benchmarks/model_eval/datasets/memory_extraction/labels.ko.jsonl:15
- This label no longer preserves the proper noun
Dorethfrom the source/input text, so a correct Korean extraction that keepsDorethcan be scored against mismatched ground truth.
{"id": "mem_015", "memories": [{"type": "commitment", "content": "주말 종료 전에 독스에게 유전자 사과의 가지치기 일정을 이메일로 보내겠다."}]}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ── Local tier: models already pulled, no extra disk needed ────────── | ||
| - tag: gemma4:e4b | ||
| family: gemma4 | ||
| size_b: 4.0 | ||
| variant: instruct | ||
| quantization: default | ||
| expected_vram_mb: 4510 | ||
| tier: local |
| {"id": "ent_017", "text": "Bramble ने Pollinator Paths के लिए Bridgewater Schools की बगीचा समिति के साथ परामर्श किया। वे केवल देशी पौधों का ही रोपण चाहते हैं लेकिन उनका बजट सख्त है। Wendelsea Native Plant Nursery के plug trays से शुरुआत करने की सलाह दी गई — ये gallon containers से सस्ते हैं और जल्दी जड़ पकड़ लेते हैं। उनके संस्थापक Iset Karadzic से बात की गई, जिन्होंने स्कूलों को अपनी गैर-लाभकारी दर देने के लिए सहमति दी।"} | ||
| {"id": "ent_018", "text": "Solas ने Pol Krisat के साथ मिलकर नई parser combinator library को Aerwyn Labs codebase में एकीकृत किया। एकीकरण साफ-सुथरा था, सिवाय एक जगह के जहाँ उनका custom error type हमारे Distributed Tracing context से टकरा रहा था। एक छोटे adapter trait को पेश करके इसे हल किया गया। आसान review के लिए एकीकरण को एक अलग PR में push किया गया।"} | ||
| {"id": "ent_019", "text": "Thresh Bridgewater Studio के नेतृत्व के साथ एक योजना बैठक में शामिल हुआ। उनके CFO Ivora Tinn कार्यकारी टीम के लिए एक नया Cashflow Models dashboard लॉन्च कर रही हैं। संचालन निदेशक Karis Tornau विभाग-वार drill-down चाहती हैं; Ivora केवल उच्च-स्तरीय सारांश चाहती हैं। समझौता: दो views, एक ही अंतर्निहित data।"} | ||
| {"id": "ent_020", "text": "Aria explored the relationship between attention entropy and reasoning faithfulness. The Hollowmounts Institute paper from last year claimed they were positively correlated; the Aerwyn Labs replication found the opposite. Drafted a synthesis that argues both papers are right but for different model scales. Need to verify this on the Crestmoor cluster before claiming anything."} |
| {"id": "rc_014", "agent": "Aria", "session_summary": "परिणाम तालिका के एक ड्राफ्ट की समीक्षा की। उपयोगकर्ता ने एक paired t-test चलाया था, लेकिन अंतर स्पष्ट रूप से bimodal हैं — छोटे सुधारों का एक समूह और बड़े सुधारों का एक समूह है। Wilcoxon signed-rank test पर स्विच करने और माध्यिका अंतर तथा bootstrap CI दोनों की रिपोर्टिंग करने की सिफारिश की।", "include_messy_features": false} | ||
| {"id": "rc_015", "agent": "Aria", "session_summary": "आते हुए दस्तावेज़ एम्बेडिंग्स पर स्ट्रीमिंग HDBSCAN के लिए कोड का मसौदा तैयार किया। तरकीब यह है कि न्यूनतम स्पैनिंग ट्री को क्रमिक रूप से बनाए रखना हो; हर बैच पर पूर्ण पुनः क्लस्टरिंग बहुत धीमी है। Crestmoor समूह का एक 2022 का पेपर मिला जिसमें सही एल्गोरिदम है। कल लागू करेंगे।", "include_messy_features": false} | ||
| {"id": "rc_016", "agent": "Aria", "session_summary": "tex compile error फिर से। \\usepackage{algorithm2e} का \\usepackage{algorithmic} से संघर्ष होता है जब दोनों acmart द्वारा लोड किए जाते हैं। समाधान: केवल algorithm2e लोड करें और \\SetAlgoNoLine का उपयोग करें। साथ ही bibliography style [smith2023] प्रविष्टि में अनुपस्थित 'doi' field की शिकायत कर रहा था — इसे दस्ती जोड़ दिया।", "include_messy_features": true} | ||
| {"id": "rc_017", "agent": "Aria", "session_summary": "Talked through whether log-uniform or beta priors are better for hyperparameter search in a small-data regime. The user was running a Bayesian optimization and getting weird convergence. The issue was their search space was huge (5 dims, broad bounds) and they only had 30 evals — that's just not enough budget. Suggested either narrowing the bounds or switching to random search to diagnose.", "include_messy_features": false} |
| {"id": "mem_013", "memories": [{"type": "fact", "content": "웨인셀라 식물원의 올해 첫 피어나는 날짜는 장기 평균보다 8일 더 일찍 나타났다."}, {"type": "fact", "content": "웨인셀라 식물원의 첫 피어나는 기록은 18년 전부터 시작되었으며, 이는 기록상 가장 조기의 봄이다."}]} | ||
| {"id": "mem_014", "memories": [{"type": "opinion", "content": "언어적 편견(웨인스는 드루카르보다 더 부드러운 말을 한다)을 가상의 민족에 적용하면 페이지에서 단순화되고 인물들을 평평하게 만들 수 있다."}]} | ||
| {"id": "mem_015", "memories": [{"type": "commitment", "content": "주말 종료 전에 독스에게 유전자 사과의 가지치기 일정을 이메일로 보내겠다."}]} |
| completed += 1 | ||
| print(f" [{completed}/{len(samples)}] {sample_id}", flush=True) |
| if attempt == 2: | ||
| print(f" ERROR on {sample_id}: {e}", file=sys.stderr, flush=True) | ||
| return idx, sample_id, text # fall back to English |
| for task in tasks: | ||
| field = TASK_TEXT_FIELD[task] | ||
| src = args.dataset_dir / task / "dataset.jsonl" |
| if p not in seen_paths: | ||
| fh = open(p, "w", newline="", encoding="utf-8") | ||
| w = csv.DictWriter(fh, fieldnames=CSV_COLUMNS) | ||
| w.writeheader() | ||
| fh.flush() | ||
| seen_paths.add(p) | ||
| else: | ||
| fh = open(p, "a", newline="", encoding="utf-8") |
| ) | ||
| samples = load_jsonl(dataset_path) | ||
| labels = load_jsonl(task_dir / "labels.jsonl") | ||
| labels_path = task_dir / (f"labels.{language}.jsonl" if language != "en" and (task_dir / f"labels.{language}.jsonl").exists() else "labels.jsonl") |
OverviewStacks on The methodology fix (per-language labels) is well-motivated and the report is genuinely useful. Most concerns are localized. IssuesHigh —
|
…untranslated samples, KO labels
Addresses the review feedback from igorls, gemini-code-assist, and Copilot.
HIGH:
- orchestrator: --output single-file mode now shares ONE (fh, writer) across
all languages instead of opening N handles to the same path. The old code
caused interleaved buffer corruption: first language opened "w", subsequent
ones opened "a", and writes from independent file offsets could overwrite
each other. Verified with a multi-language --output smoke test (4 rows
written, all distinct).
- 19 untranslated/empty samples re-translated:
- dataset.de.jsonl: cal_017
- dataset.hi.jsonl entity_extraction: ent_020, ent_025, ent_032, ent_038
- dataset.hi.jsonl room_classification: rc_017, rc_026, rc_028, rc_040,
rc_064, rc_089, rc_091
- dataset.ko.jsonl room_classification: rc_027, rc_067
- dataset.it.jsonl room_classification: rc_029, rc_030, rc_031, rc_032,
rc_053 (previously empty strings)
- labels.ko.jsonl: restored all proper nouns to English (Doreth, Saela, Ivora,
Ren Solanke, Pol Krisat, Pell Halloran, Bramble, Hollowmounts Institute,
Wendelsea, Bridgewater Community Garden, Wends, Drukar, Aerwyn cycle,
Jaccard, Mason bee, Markdown). Also fixed mistranslation 유전자 사과
(genetic apple) → 재래종 사과 (heirloom apple).
MEDIUM:
- runner.py: refactored label-resolution one-liner into 3 readable lines
and added an info log when falling back to English ground truth, so
readers don't misread "score collapse" as model failure.
LOW:
- orchestrator: moved `import socket` to module top (PEP 8); removed
unused `out_path` from the unpacking tuple.
- translate_datasets.py: renamed loop variable `l` → `code` (ruff E741);
made the _translate_one fallback return path explicit instead of relying
on for-loop fall-through; added a privacy warning in the docstring
flagging that the default `kimi-k2.6:cloud` sends prose to a remote
endpoint and should not be used over real palace data.
- 2026-05-13-multilingual.md: converted analytical paragraphs from
Portuguese to English to match the existing repo convention.
Review fixes pushed ✓All feedback from your review has been addressed: orchestrator.py
runner.py
datasets
labels.ko.jsonl
Code style
reports
Ready for next review pass. |
Summary
Builds on top of #1483 with 6 additional languages and a few methodology fixes discovered during testing.
New datasets — 630 samples
Same conventions as pt-BR/es/zh: inputs translated, proper nouns and system labels (entity types, room slugs, memory types) stay English.
labels.ko.jsonl for memory_extraction
During a 210-run matrix (6 models × 7 languages × 5 tasks) we noticed
memory_extractionscores collapsed ~0.5 pp for non-EN. Investigation showed models were correctly extracting memories in the input language, but we were scoring against English ground truth → artificial score drop.Added
labels.ko.jsonlwith Korean ground-truth content as a starting point.runner.pynow loadslabels.{lang}.jsonlwhen present, falling back tolabels.jsonl. Quick test: KO score went from 0.40 → 0.60 after the fix (EN baseline 1.0). The remaining gap is real model difficulty, not methodology noise.The other languages still use English labels for now — contributed as-is so the datasets are available for testing. Generating all translated labels is a follow-up.
--output-dir
Alternative to
--outputfor multilingual matrix runs. Writes<dir>/<lang>/YYYY-MM-DD-<host>.csvper language instead of one flat file. The existing--outputsingle-file mode is unchanged and--num-ctx,--llm-provider,--embed-endpointall work exactly as before.Other additions
translate_datasets.py: script used to generate translations via Ollama — contributors can extend to new languagescandidates.yaml:communitytier (igorls classifier variants, heretic) andlocaltierreports/2026-05-13-multilingual.md: full 210-run benchmark reportTest plan
--languages en,de,fr,hi,it,ko,ru --output /tmp/test.csvproduces 7 rows per model/task, no errors--output-dir /tmp/resultscreatesen/,de/, ... subfolders each with a CSV--num-ctx 8192still works (Igor's flag, unchanged)memory_extractionscore is higher withlabels.ko.jsonlpresent than without