Extend MTEB with French datasets#218
Conversation
Masakhane dataset and french script for classification
* add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
HAL S2S dataset creation and evaluation on clustering task.
* Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
…pusparcuspc Inherit OpusparcusPC init from MultilingualTask
remove train split from evaluation
* put script on HF dataset repos * remove scripts
* add trust remote code arg * leave corpus as dict * remove trust remote code
add bucc and tatoeba bitextmining tasks
I think we are good with the changes, and the results are there : https://huggingface.co/datasets/mteb/results/discussions/28 |
Amazing! I think this one would still be nice #218 (comment) if okay with you but other than that we can merge both! |
* add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits
@Muennighoff we made the changes for #218 (comment), can you please have a look at |
|
Yes let's merge ! 💯 |
|
Still need to update MasakhaNEWS files in https://huggingface.co/datasets/mteb/results/discussions/28 I think & then I will add a French leaderboard tab if you want! 🙌 Should I just link to your GitHub accounts / this PR for the |
|
I think the leaderboard(s) would consist of the following: TASK_LIST_CLASSIFICATION_FR = [
"AmazonReviewsClassification (fr)",
"MasakhaNEWSClassification (fra)",
"MassiveIntentClassification (fr)",
"MassiveScenarioClassification (fr)",
]
TASK_LIST_CLUSTERING_FR = [
"AlloProfClusteringP2P",
"AlloProfClusteringS2S",
"HALClusteringS2S",
"MLSUMClusteringP2P",
"MLSUMClusteringS2S",
"MasakhaNEWSClusteringP2P (fra)",
"MasakhaNEWSClusteringS2S (fra)",
]
TASK_LIST_PAIR_CLASSIFICATION_FR = [
"OpusparcusPC (fr)",
"PawsX (fr)",
]
TASK_LIST_RERANKING_FR = [
"AlloprofReranking",
"SyntecReranking",
]
TASK_LIST_RETRIEVAL_FR = [
"AlloprofRetrieval",
"BSARDRetrieval",
"MintakaRetrieval",
"MultiLongDocRetrieval",
"SyntecRetrieval",
"XPQARetrieval",
]
TASK_LIST_STS_FR = [
"STS22 (fr)",
"STSBenchmarkMultilingualSTS (fr)",
"SICKFr",
]
TASK_LIST_SUMMARIZATION_FR = ["SummEvalFr"]It's a total of 25 so also enough for an overall leaderboard tab imo. Also I added a few other French tasks like xPQA etc. not from this PR - would be great if you could eval on them too 🙌 |
|
We will re-run the evaluations on MasakhaNEWS, we will also run evals on the datasets you added. The PR will be updated soon. For the Concerning the list of tasks for the leaderboard, I think you can add these two tasks in |
Amazing!
Cool! Happy to help if I can be useful! I will put your GitHubs in the meantime 👍
Good catch - 27 datasets then! |
|
@Muennighoff do you need models charcteristics like their size, embedding dim and max token length for the leaderboard ? We have already gathered them in these two files (CSV or JSON, you choose what suits you most): |
|
Do you also want to run the French split of |
|
I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀 Edit: A few models still missing, updating.... |
That's so great, thanks ! 🚀 Just one small thing, could you please add @schmarion in the credits section ? We'll check Edit: yes some models and a dataset (SummEvalFr) are missing |
|
Thanks a lot ! So good to finally see the results there 🤩 . Could you please rebind my name in the credits. I happen to have changed my account name recently and the link is broken... @MathieuCiancone |
|
Hi guys! I just want to congratulate everybody who has worked on this! You made my day! |
I looked into the dataset and it is actually quite huge (not even talking about the English subset 😅). Even for text embedding API (which tend to be cheap), embedding the whole corpus would lead to significant API costs (including openai 3 embedding model, mistral, voyage and cohere). I think we will save this for smaller datasets, as a greater variety of datasets is more insightful that bigger datasets... 🤔 |
So for French, it's 20 queries & 10K docs w/ 9.6K avg characters per doc. I think it's comparable to TREC-COVID which is 171K docs but only ~900 avg characters per doc. TREC-COVID is one of the cheapest English Retrieval datasets. Also for many models that cannot handle the long sequence length of 9.6K characters it'll be cheaper. But ofc up to you, no worries if you don't want to include it 👍 |
Adding: