Migrate multilingual parallel-sentences scripts to hf datasets by omkar-334 · Pull Request #3784 · huggingface/sentence-transformers

omkar-334 · 2026-05-24T13:04:35Z

The three get_parallel_data_*.py prep scripts downloaded raw parallel corpora (sbert.net, tatoeba.org, fbaipublicfiles WikiMatrix) and wrote per-language-pair tsv files. The actual training script make_multilingual.py already loads the maintained Hugging Face datasets directly, so these prep scripts were the only remaining raw downloads. This rewrites them to source from the same HF datasets, producing the same tsv outputs.

Changes

Each script now iterates sentence-transformers/parallel-sentences-{talks,wikimatrix,tatoeba} per language-pair config (en-de, en-es, ...) and writes the stripped english/non_english columns (skipping empty pairs) to the same tsv filenames. WikiMatrix is already filtered by LASER similarity score, and tatoeba's 3-letter output filenames are mapped to the dataset's 2-letter config names.

Testing the changes

End-to-end: ran the old (raw download) and new (HF) versions of each script, then compared the combined train+dev pairs per language as a set (an md5 of the sorted pairs, which ignores the dev/train split-boundary difference).

talks: byte-identical for all 6 target languages (en-de/es/fr/it/ar/tr) once the values are
stripped, matching the old script's normalization. Same unique-pair count and same md5 per
language.
wikimatrix: 100% of sampled raw pairs (above the 1.075 score) present in the HF dataset.
tatoeba: same source; a fresh live download has a few extra newer pairs (Tatoeba is updated
continuously), so the HF snapshot is a subset of today's live data, not the other way around.

Note: comparing the split files directly will differ, because the new scripts use the HF train/dev split (dev ~991) rather than the old "first 1000 = dev" boundary. The data is the same; only the train/dev partition differs.

tomaarsen · 2026-06-02T10:20:24Z

Hello!

Thanks for this, looks good! Merging this now, I think the test failures are due to rate limits, which should be fine.

Tom Aarsen

omkar-334 and others added 2 commits May 24, 2026 18:32

Migrate multilingual parallel-sentences scripts to hf datasets

5dae441

Print real pair counts

aff42b8

tomaarsen merged commit af7acbf into huggingface:main Jun 2, 2026
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate multilingual parallel-sentences scripts to hf datasets#3784

Migrate multilingual parallel-sentences scripts to hf datasets#3784
tomaarsen merged 2 commits into
huggingface:mainfrom
omkar-334:migrate-parallel-sentences-to-hf-datasets

omkar-334 commented May 24, 2026

Uh oh!

tomaarsen commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

omkar-334 commented May 24, 2026

Changes

Testing the changes

Uh oh!

tomaarsen commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants