Migrate multilingual parallel-sentences scripts to hf datasets#3784
Merged
tomaarsen merged 2 commits intoJun 2, 2026
Merged
Conversation
Member
|
Hello! Thanks for this, looks good! Merging this now, I think the test failures are due to rate limits, which should be fine.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The three
get_parallel_data_*.pyprep scripts downloaded raw parallel corpora (sbert.net, tatoeba.org, fbaipublicfiles WikiMatrix) and wrote per-language-pair tsv files. The actual training scriptmake_multilingual.pyalready loads the maintained Hugging Face datasets directly, so these prep scripts were the only remaining raw downloads. This rewrites them to source from the same HF datasets, producing the same tsv outputs.Changes
Each script now iterates
sentence-transformers/parallel-sentences-{talks,wikimatrix,tatoeba}per language-pair config (en-de,en-es, ...) and writes the strippedenglish/non_englishcolumns (skipping empty pairs) to the same tsv filenames. WikiMatrix is already filtered by LASER similarity score, and tatoeba's 3-letter output filenames are mapped to the dataset's 2-letter config names.Testing the changes
End-to-end: ran the old (raw download) and new (HF) versions of each script, then compared the combined train+dev pairs per language as a set (an md5 of the sorted pairs, which ignores the dev/train split-boundary difference).
stripped, matching the old script's normalization. Same unique-pair count and same md5 per
language.
continuously), so the HF snapshot is a subset of today's live data, not the other way around.
Note: comparing the split files directly will differ, because the new scripts use the HF
train/devsplit (dev ~991) rather than the old "first 1000 = dev" boundary. The data is the same; only the train/dev partition differs.