rename `validation`split to `test`for MindSmallReranking by NouamaneTazi · Pull Request #36 · embeddings-benchmark/mteb

NouamaneTazi · 2022-08-03T23:23:02Z

Fixes #28 along with https://huggingface.co/datasets/mteb/mind_small/discussions/1

* add Masakhane dataset config * add trigram lang code for dataset who use it * create french script eval * fix French word * add some documentation * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * 4 pair classification (#10) * add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * Clustering with HAL S2S dataset (#11) HAL S2S dataset creation and evaluation on clustering task. * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * DiaBLa and Flores Bitext Mining evaluation (#12) * Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * fix change on langmapping * reset alphabetical order * add revision handling * Clustering: Add AlloProf dataset (#17) AlloProf dataset for clustering task * handling of revision * change split + add revision handling * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * add script to process and upload alloprof on HF * adding dataset processing for mteb * refactor few thing * reset alphabetical order * add revision handling * handling of revision * change split + add revision handling * use eval variable * alphabetic order * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * 16 add datasets to readmemd (#18) * run task table * run task table * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * run task table * refresh readme * refresh readme * run task table * refresh readme --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> * load only test split (#25) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Update mteb/tasks/BitextMining/DiaBLaBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/HALClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * renaming masakhane (#28) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Syntec dataset addition (#26) * add scrpit to process & load to HF * add script to enable download of data from HF * add syntec dataset files to gitignore * add syntecretrieval * add syntec retrival * build dataloading script * remove datasets * correct typo --------- Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> * 30 add syntec reranking (#31) * change name to secify retrieval * add reranking tasks * create script to upload dataset fo reranking task * create reranking task * add reranking tasks * add model name in description * SummEval translated to french (#32) * 7 sts (#33) * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * add coma * Adding sick fr dataset to sts tasks (#34) * Adding sick fr dataset to sts tasks * modifying dataset in load function to have the right column names * Fix alloprof dataset (#36) * change revision to use * remove duplicate data * change main metric because dataset is hard (#37) * Fix alloprof dataset (#40) * change revision to use * remove duplicate data * change revision * handle queries train test split * change dataset creation method * change revision * handle queries train test split * change dataset creation method * Fix DiaBLa by inheriting CrossLingual class (#42) * Fix DiaBLa by inheriting CrossLingual class * remove remaining print * Fix DiaBLa integration * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Classification/MasakhaNEWSClassification.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/abstasks/AbsTaskPairClassification.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> * Update README.md * Update scripts/data/syntec/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/data/alloprof/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Retrieval/HagridRetrieval.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringP2P.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py * Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py * Update mteb/tasks/STS/SickFrSTS.py * Inherit OpusparcusPC init from MultilingualTask * remove unnecessary init * Remove train split from evaluation on MasakhaNEWSClassification (#52) remove train split from evaluation * put script on HF dataset repos (#56) * put script on HF dataset repos * remove scripts * 49 fix dictionnary in syntecretrieval (#54) * add trust remote code arg * leave corpus as dict * remove trust remote code * add Tatoeba & BUCC BitextMining tasks (#57) add bucc and tatoeba bitextmining tasks * 46 add other languages to masakhaneweclusterings2s and p2p (#58) * add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: mciancone <73994289+Sunalwing@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com> Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>

* add Masakhane dataset config * add trigram lang code for dataset who use it * create french script eval * fix French word * add some documentation * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * 4 pair classification (#10) * add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * Clustering with HAL S2S dataset (#11) HAL S2S dataset creation and evaluation on clustering task. * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * DiaBLa and Flores Bitext Mining evaluation (#12) * Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * fix change on langmapping * reset alphabetical order * add revision handling * Clustering: Add AlloProf dataset (#17) AlloProf dataset for clustering task * handling of revision * change split + add revision handling * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * add script to process and upload alloprof on HF * adding dataset processing for mteb * refactor few thing * reset alphabetical order * add revision handling * handling of revision * change split + add revision handling * use eval variable * alphabetic order * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * 16 add datasets to readmemd (#18) * run task table * run task table * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * run task table * refresh readme * refresh readme * run task table * refresh readme --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> * load only test split (#25) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Update mteb/tasks/BitextMining/DiaBLaBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/HALClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * renaming masakhane (#28) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Syntec dataset addition (#26) * add scrpit to process & load to HF * add script to enable download of data from HF * add syntec dataset files to gitignore * add syntecretrieval * add syntec retrival * build dataloading script * remove datasets * correct typo --------- Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> * 30 add syntec reranking (#31) * change name to secify retrieval * add reranking tasks * create script to upload dataset fo reranking task * create reranking task * add reranking tasks * add model name in description * SummEval translated to french (#32) * 7 sts (#33) * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * add coma * Adding sick fr dataset to sts tasks (#34) * Adding sick fr dataset to sts tasks * modifying dataset in load function to have the right column names * Fix alloprof dataset (#36) * change revision to use * remove duplicate data * change main metric because dataset is hard (#37) * Fix alloprof dataset (#40) * change revision to use * remove duplicate data * change revision * handle queries train test split * change dataset creation method * change revision * handle queries train test split * change dataset creation method * Fix DiaBLa by inheriting CrossLingual class (#42) * Fix DiaBLa by inheriting CrossLingual class * remove remaining print * Fix DiaBLa integration * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Classification/MasakhaNEWSClassification.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/abstasks/AbsTaskPairClassification.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> * Update README.md * Update scripts/data/syntec/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/data/alloprof/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Retrieval/HagridRetrieval.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringP2P.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py * Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py * Update mteb/tasks/STS/SickFrSTS.py * Inherit OpusparcusPC init from MultilingualTask * remove unnecessary init * Remove train split from evaluation on MasakhaNEWSClassification (#52) remove train split from evaluation * put script on HF dataset repos (#56) * put script on HF dataset repos * remove scripts * 49 fix dictionnary in syntecretrieval (#54) * add trust remote code arg * leave corpus as dict * remove trust remote code * add Tatoeba & BUCC BitextMining tasks (#57) add bucc and tatoeba bitextmining tasks * 46 add other languages to masakhaneweclusterings2s and p2p (#58) * add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits * remove Hagrid (#60) --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr> Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com> Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>

rename validationsplit to test

9c4d5c6

NouamaneTazi merged commit 6fc710b into main Aug 3, 2022

KennethEnevoldsen deleted the mindsmall-test branch March 20, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rename `validation`split to `test`for MindSmallReranking#36

rename `validation`split to `test`for MindSmallReranking#36
NouamaneTazi merged 1 commit into
mainfrom
mindsmall-test

NouamaneTazi commented Aug 3, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NouamaneTazi commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NouamaneTazi commented Aug 3, 2022 •

edited

Loading