rename validationsplit to testfor MindSmallReranking#36
Merged
Conversation
Muennighoff
added a commit
that referenced
this pull request
Feb 22, 2024
* add Masakhane dataset config * add trigram lang code for dataset who use it * create french script eval * fix French word * add some documentation * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * 4 pair classification (#10) * add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * Clustering with HAL S2S dataset (#11) HAL S2S dataset creation and evaluation on clustering task. * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * DiaBLa and Flores Bitext Mining evaluation (#12) * Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * fix change on langmapping * reset alphabetical order * add revision handling * Clustering: Add AlloProf dataset (#17) AlloProf dataset for clustering task * handling of revision * change split + add revision handling * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * add script to process and upload alloprof on HF * adding dataset processing for mteb * refactor few thing * reset alphabetical order * add revision handling * handling of revision * change split + add revision handling * use eval variable * alphabetic order * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * 16 add datasets to readmemd (#18) * run task table * run task table * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * run task table * refresh readme * refresh readme * run task table * refresh readme --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> * load only test split (#25) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Update mteb/tasks/BitextMining/DiaBLaBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/HALClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * renaming masakhane (#28) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Syntec dataset addition (#26) * add scrpit to process & load to HF * add script to enable download of data from HF * add syntec dataset files to gitignore * add syntecretrieval * add syntec retrival * build dataloading script * remove datasets * correct typo --------- Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> * 30 add syntec reranking (#31) * change name to secify retrieval * add reranking tasks * create script to upload dataset fo reranking task * create reranking task * add reranking tasks * add model name in description * SummEval translated to french (#32) * 7 sts (#33) * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * add coma * Adding sick fr dataset to sts tasks (#34) * Adding sick fr dataset to sts tasks * modifying dataset in load function to have the right column names * Fix alloprof dataset (#36) * change revision to use * remove duplicate data * change main metric because dataset is hard (#37) * Fix alloprof dataset (#40) * change revision to use * remove duplicate data * change revision * handle queries train test split * change dataset creation method * change revision * handle queries train test split * change dataset creation method * Fix DiaBLa by inheriting CrossLingual class (#42) * Fix DiaBLa by inheriting CrossLingual class * remove remaining print * Fix DiaBLa integration * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Classification/MasakhaNEWSClassification.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/abstasks/AbsTaskPairClassification.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> * Update README.md * Update scripts/data/syntec/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/data/alloprof/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Retrieval/HagridRetrieval.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringP2P.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py * Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py * Update mteb/tasks/STS/SickFrSTS.py * Inherit OpusparcusPC init from MultilingualTask * remove unnecessary init * Remove train split from evaluation on MasakhaNEWSClassification (#52) remove train split from evaluation * put script on HF dataset repos (#56) * put script on HF dataset repos * remove scripts * 49 fix dictionnary in syntecretrieval (#54) * add trust remote code arg * leave corpus as dict * remove trust remote code * add Tatoeba & BUCC BitextMining tasks (#57) add bucc and tatoeba bitextmining tasks * 46 add other languages to masakhaneweclusterings2s and p2p (#58) * add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: mciancone <73994289+Sunalwing@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com> Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>
Muennighoff
added a commit
that referenced
this pull request
Feb 27, 2024
* add Masakhane dataset config * add trigram lang code for dataset who use it * create french script eval * fix French word * add some documentation * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * 4 pair classification (#10) * add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * Clustering with HAL S2S dataset (#11) HAL S2S dataset creation and evaluation on clustering task. * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * DiaBLa and Flores Bitext Mining evaluation (#12) * Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * fix change on langmapping * reset alphabetical order * add revision handling * Clustering: Add AlloProf dataset (#17) AlloProf dataset for clustering task * handling of revision * change split + add revision handling * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * add script to process and upload alloprof on HF * adding dataset processing for mteb * refactor few thing * reset alphabetical order * add revision handling * handling of revision * change split + add revision handling * use eval variable * alphabetic order * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * 16 add datasets to readmemd (#18) * run task table * run task table * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * run task table * refresh readme * refresh readme * run task table * refresh readme --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> * load only test split (#25) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Update mteb/tasks/BitextMining/DiaBLaBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/HALClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * renaming masakhane (#28) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Syntec dataset addition (#26) * add scrpit to process & load to HF * add script to enable download of data from HF * add syntec dataset files to gitignore * add syntecretrieval * add syntec retrival * build dataloading script * remove datasets * correct typo --------- Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> * 30 add syntec reranking (#31) * change name to secify retrieval * add reranking tasks * create script to upload dataset fo reranking task * create reranking task * add reranking tasks * add model name in description * SummEval translated to french (#32) * 7 sts (#33) * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * add coma * Adding sick fr dataset to sts tasks (#34) * Adding sick fr dataset to sts tasks * modifying dataset in load function to have the right column names * Fix alloprof dataset (#36) * change revision to use * remove duplicate data * change main metric because dataset is hard (#37) * Fix alloprof dataset (#40) * change revision to use * remove duplicate data * change revision * handle queries train test split * change dataset creation method * change revision * handle queries train test split * change dataset creation method * Fix DiaBLa by inheriting CrossLingual class (#42) * Fix DiaBLa by inheriting CrossLingual class * remove remaining print * Fix DiaBLa integration * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Classification/MasakhaNEWSClassification.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/abstasks/AbsTaskPairClassification.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> * Update README.md * Update scripts/data/syntec/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/data/alloprof/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Retrieval/HagridRetrieval.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringP2P.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py * Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py * Update mteb/tasks/STS/SickFrSTS.py * Inherit OpusparcusPC init from MultilingualTask * remove unnecessary init * Remove train split from evaluation on MasakhaNEWSClassification (#52) remove train split from evaluation * put script on HF dataset repos (#56) * put script on HF dataset repos * remove scripts * 49 fix dictionnary in syntecretrieval (#54) * add trust remote code arg * leave corpus as dict * remove trust remote code * add Tatoeba & BUCC BitextMining tasks (#57) add bucc and tatoeba bitextmining tasks * 46 add other languages to masakhaneweclusterings2s and p2p (#58) * add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits * remove Hagrid (#60) --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr> Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com> Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #28 along with https://huggingface.co/datasets/mteb/mind_small/discussions/1