Skip to content

Migrate multilingual parallel-sentences scripts to hf datasets#3784

Merged
tomaarsen merged 2 commits into
huggingface:mainfrom
omkar-334:migrate-parallel-sentences-to-hf-datasets
Jun 2, 2026
Merged

Migrate multilingual parallel-sentences scripts to hf datasets#3784
tomaarsen merged 2 commits into
huggingface:mainfrom
omkar-334:migrate-parallel-sentences-to-hf-datasets

Conversation

@omkar-334

Copy link
Copy Markdown
Contributor

The three get_parallel_data_*.py prep scripts downloaded raw parallel corpora (sbert.net, tatoeba.org, fbaipublicfiles WikiMatrix) and wrote per-language-pair tsv files. The actual training script make_multilingual.py already loads the maintained Hugging Face datasets directly, so these prep scripts were the only remaining raw downloads. This rewrites them to source from the same HF datasets, producing the same tsv outputs.

Changes

Each script now iterates sentence-transformers/parallel-sentences-{talks,wikimatrix,tatoeba} per language-pair config (en-de, en-es, ...) and writes the stripped english/non_english columns (skipping empty pairs) to the same tsv filenames. WikiMatrix is already filtered by LASER similarity score, and tatoeba's 3-letter output filenames are mapped to the dataset's 2-letter config names.

Testing the changes

End-to-end: ran the old (raw download) and new (HF) versions of each script, then compared the combined train+dev pairs per language as a set (an md5 of the sorted pairs, which ignores the dev/train split-boundary difference).

  • talks: byte-identical for all 6 target languages (en-de/es/fr/it/ar/tr) once the values are
    stripped, matching the old script's normalization. Same unique-pair count and same md5 per
    language.
  • wikimatrix: 100% of sampled raw pairs (above the 1.075 score) present in the HF dataset.
  • tatoeba: same source; a fresh live download has a few extra newer pairs (Tatoeba is updated
    continuously), so the HF snapshot is a subset of today's live data, not the other way around.

Note: comparing the split files directly will differ, because the new scripts use the HF train/dev split (dev ~991) rather than the old "first 1000 = dev" boundary. The data is the same; only the train/dev partition differs.

@tomaarsen

Copy link
Copy Markdown
Member

Hello!

Thanks for this, looks good! Merging this now, I think the test failures are due to rate limits, which should be fine.

  • Tom Aarsen

@tomaarsen tomaarsen merged commit af7acbf into huggingface:main Jun 2, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants