Extend MTEB with French datasets by GabrielSequeira · Pull Request #218 · embeddings-benchmark/mteb

GabrielSequeira · 2024-01-30T09:36:59Z

Adding:

French evaluation script
Bitext Mining
- Flores
- DiaBLa
Classification
- MasakhaNews
Pair Classification
- Opusparcus
- Add multilingual management
Clustering
- Dataset creation for HAL S2S clustering and upload on Hugging Face
- Script creating data for AlloProf dataset
- AlloProfClusteringP2P
- AlloProfClusteringS2S
- AlloProfClusteringP2P
- MLSUMClusteringS2S
- MasakhaNewsClusteringS2S
- MasakhaNewsClusteringP2P
Retrieval
Reranking
- AlloprofReranking
- SyntecReranking
STS
- STSBenchmarkMultilingualSTS
- SickFrSTS
Summariaztion
- SummEvalFrSummarization

Masakhane dataset and french script for classification

…retrieval

* add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

…retrieval

HAL S2S dataset creation and evaluation on clustering task.

* Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

…pusparcuspc Inherit OpusparcusPC init from MultilingualTask

remove train split from evaluation

* put script on HF dataset repos * remove scripts

* add trust remote code arg * leave corpus as dict * remove trust remote code

add bucc and tatoeba bitextmining tasks

GabrielSequeira · 2024-02-21T14:38:22Z

Yeah ! 🚀 I made the change, do you need the json results ?

Yes that would be great!

Hey @Muennighoff, do you need the results to be merged in this repository https://huggingface.co/datasets/mteb/results, or can we juste give you the json files we have ?

Yeah feel free to directly open a PR there e.g. like this one: #218 (comment)

You just need to create a folder for each model and place the files in there; You can clone the repo with git clone ... ; instructions are also here https://huggingface.co/datasets/mteb/results/discussions?new_pr=true

I think we are good with the changes, and the results are there : https://huggingface.co/datasets/mteb/results/discussions/28

Muennighoff · 2024-02-21T14:52:46Z

Yeah ! 🚀 I made the change, do you need the json results ?

Yes that would be great!

Hey @Muennighoff, do you need the results to be merged in this repository https://huggingface.co/datasets/mteb/results, or can we juste give you the json files we have ?

Yeah feel free to directly open a PR there e.g. like this one: #218 (comment)
You just need to create a folder for each model and place the files in there; You can clone the repo with git clone ... ; instructions are also here https://huggingface.co/datasets/mteb/results/discussions?new_pr=true

I think we are good with the changes, and the results are there : https://huggingface.co/datasets/mteb/results/discussions/28

Amazing! I think this one would still be nice #218 (comment) if okay with you but other than that we can merge both!

* add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits

imenelydiaker · 2024-02-22T14:11:51Z

Amazing! I think this one would still be nice #218 (comment) if okay with you but other than that we can merge both!

@Muennighoff we made the changes for #218 (comment), can you please have a look at AbsTaskClustering, MasakhaNEWSClusteringP2P and MasakhaNEWSClusteringS2S files ?

Muennighoff

LGTM! Let's merge?

GabrielSequeira · 2024-02-22T14:34:12Z

Yes let's merge ! 💯

Muennighoff · 2024-02-22T14:50:20Z

Still need to update MasakhaNEWS files in https://huggingface.co/datasets/mteb/results/discussions/28 I think & then I will add a French leaderboard tab if you want! 🙌 Should I just link to your GitHub accounts / this PR for the Credits of the new tabs or are you planning to write a paper on it?

Muennighoff · 2024-02-22T15:15:25Z

I think the leaderboard(s) would consist of the following:

TASK_LIST_CLASSIFICATION_FR = [
    "AmazonReviewsClassification (fr)",
    "MasakhaNEWSClassification (fra)",
    "MassiveIntentClassification (fr)",
    "MassiveScenarioClassification (fr)",
]

TASK_LIST_CLUSTERING_FR = [
    "AlloProfClusteringP2P",
    "AlloProfClusteringS2S",
    "HALClusteringS2S",
    "MLSUMClusteringP2P",
    "MLSUMClusteringS2S",
    "MasakhaNEWSClusteringP2P (fra)",
    "MasakhaNEWSClusteringS2S (fra)",
]

TASK_LIST_PAIR_CLASSIFICATION_FR = [
    "OpusparcusPC (fr)",
    "PawsX (fr)",
]

TASK_LIST_RERANKING_FR = [
    "AlloprofReranking",
    "SyntecReranking",
]
TASK_LIST_RETRIEVAL_FR = [
    "AlloprofRetrieval",
    "BSARDRetrieval",
    "MintakaRetrieval",
    "MultiLongDocRetrieval",
    "SyntecRetrieval",
    "XPQARetrieval",
]
TASK_LIST_STS_FR = [
    "STS22 (fr)",
    "STSBenchmarkMultilingualSTS (fr)",
    "SICKFr",
]
TASK_LIST_SUMMARIZATION_FR = ["SummEvalFr"]

It's a total of 25 so also enough for an overall leaderboard tab imo.
Let me know if I am missing anything!

Also I added a few other French tasks like xPQA etc. not from this PR - would be great if you could eval on them too 🙌

imenelydiaker · 2024-02-22T15:41:42Z

We will re-run the evaluations on MasakhaNEWS, we will also run evals on the datasets you added. The PR will be updated soon.

For the Credits, we have a paper that is under review for the moment. You can still add our GitHub accounts maybe ?

Concerning the list of tasks for the leaderboard, I think you can add these two tasks in TASK_LIST_CLASSIFICATION: "MTOPDomainClassification", "MTOPIntentClassification".

Muennighoff · 2024-02-22T16:03:14Z

We will re-run the evaluations on MasakhaNEWS, we will also run evals on the datasets you added. The PR will be updated soon.

Amazing!

For the Credits, we have a paper that is under review for the moment. You can still add our GitHub accounts maybe ?

Cool! Happy to help if I can be useful! I will put your GitHubs in the meantime 👍

Concerning the list of tasks for the leaderboard,

Good catch - 27 datasets then!

imenelydiaker · 2024-02-27T15:14:06Z

@Muennighoff do you need models charcteristics like their size, embedding dim and max token length for the leaderboard ? We have already gathered them in these two files (CSV or JSON, you choose what suits you most):
JSON: https://github.com/Lyon-NLP/mtebscripts/blob/main/script_mteb_french/results_analysis/model_specs.json
CSV: https://github.com/Lyon-NLP/mtebscripts/blob/main/script_mteb_french/results_analysis/models_characteristics.csv

Muennighoff · 2024-02-28T19:40:18Z

Do you also want to run the French split of MultiLongDocRetrieval and include it? I saw it wasn't included in the result files

Muennighoff · 2024-02-28T20:49:48Z

I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀
Let me know if we should change anything / feel free to open a PR! Also would be cool to still include MultiLongDocRetrieval I think :)

Edit: A few models still missing, updating....

imenelydiaker · 2024-02-28T21:08:55Z

I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀 Let me know if we should change anything / feel free to open a PR! Also would be cool to still include MultiLongDocRetrieval I think :)

That's so great, thanks ! 🚀 Just one small thing, could you please add @schmarion in the credits section ?

We'll check MultiLongDocRetrieval and will open a PR with the results soon.

Edit: yes some models and a dataset (SummEvalFr) are missing

MathieuCiancone · 2024-02-28T21:19:03Z

Thanks a lot ! So good to finally see the results there 🤩 . Could you please rebind my name in the credits. I happen to have changed my account name recently and the link is broken... @MathieuCiancone

contrebande-labs · 2024-02-29T05:09:59Z

Hi guys! I just want to congratulate everybody who has worked on this! You made my day!

MathieuCiancone · 2024-02-29T09:07:05Z

I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀 Let me know if we should change anything / feel free to open a PR! Also would be cool to still include MultiLongDocRetrieval I think :)

Edit: A few models still missing, updating....

I looked into the dataset and it is actually quite huge (not even talking about the English subset 😅). Even for text embedding API (which tend to be cheap), embedding the whole corpus would lead to significant API costs (including openai 3 embedding model, mistral, voyage and cohere). I think we will save this for smaller datasets, as a greater variety of datasets is more insightful that bigger datasets... 🤔

Muennighoff · 2024-02-29T13:34:50Z

I looked into the dataset and it is actually quite huge (not even talking about the English subset 😅). Even for text embedding API (which tend to be cheap), embedding the whole corpus would lead to significant API costs (including openai 3 embedding model, mistral, voyage and cohere). I think we will save this for smaller datasets, as a greater variety of datasets is more insightful that bigger datasets... 🤔

So for French, it's 20 queries & 10K docs w/ 9.6K avg characters per doc. I think it's comparable to TREC-COVID which is 171K docs but only ~900 avg characters per doc. TREC-COVID is one of the cheapest English Retrieval datasets. Also for many models that cannot handle the long sequence length of 9.6K characters it'll be cheaper. But ofc up to you, no worries if you don't want to include it 👍

gsequeiraOS and others added 30 commits November 7, 2023 09:26

add Masakhane dataset config

a799517

add trigram lang code for dataset who use it

0a4e666

create french script eval

7e0e818

fix French word

e83730f

add some documentation

5a06efd

Merge pull request #9 from Lyon-NLP/2-classification

0164f06

Masakhane dataset and french script for classification

add script to process and upload alloprof on HF

45c4e6d

build script for HF

6802c26

adding dataset processing for mteb

965e019

add script to process and upload alloprof on HF

0aaf4cd

build script for HF

6d94f4e

adding dataset processing for mteb

ffc0835

Merge branch '6-retrieval' of github.com:Lyon-NLP/mteb-french into 6-…

1107256

…retrieval

refactor few thing

4ba1e6a

remove whitespaces

03dfa6d

4 pair classification (#10)

6f89003

* add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

add script to process and upload alloprof on HF

a489743

build script for HF

23c6917

adding dataset processing for mteb

0f2f9b1

refactor few thing

59aee44

remove whitespaces

988c964

Merge branch '6-retrieval' of github.com:Lyon-NLP/mteb-french into 6-…

d012c5d

…retrieval

Clustering with HAL S2S dataset (#11)

400d88a

HAL S2S dataset creation and evaluation on clustering task.

adding BSARD dataset

43168c2

add BSARD to benchmark

a8e7219

adding Hagrid dataset

c95dd53

add script to process and upload alloprof on HF

fd16153

build script for HF

ce134d0

adding dataset processing for mteb

6976f09

wissam-sib and others added 7 commits February 20, 2024 23:05

Inherit OpusparcusPC init from MultilingualTask

30708b4

remove unnecessary init

b823121

Merge pull request #53 from Lyon-NLP/47-inherit-multilingualtask-in-o…

2427490

…pusparcuspc Inherit OpusparcusPC init from MultilingualTask

Remove train split from evaluation on MasakhaNEWSClassification (#52)

abdaac0

remove train split from evaluation

put script on HF dataset repos (#56)

8aa6064

* put script on HF dataset repos * remove scripts

49 fix dictionnary in syntecretrieval (#54)

dd37bcf

* add trust remote code arg * leave corpus as dict * remove trust remote code

add Tatoeba & BUCC BitextMining tasks (#57)

b7573b7

add bucc and tatoeba bitextmining tasks

46 add other languages to masakhaneweclusterings2s and p2p (#58)

742aeb1

* add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits

Muennighoff approved these changes Feb 22, 2024

View reviewed changes

Muennighoff merged commit 3d8b8ec into embeddings-benchmark:main Feb 22, 2024

Muennighoff mentioned this pull request Apr 4, 2024

Additional Dataset: FLORES200 #53

Closed

Muennighoff mentioned this pull request May 20, 2024

Integrate with MTEB? kaistAI/InstructIR#3

Open

Muennighoff mentioned this pull request May 31, 2024

Integrate with MTEB? gowitheflow-1998/RAR-b#4

Closed

Muennighoff mentioned this pull request Jul 10, 2024

Integrate with MTEB? CoIR-team/coir#4

Closed

Uh oh!

Conversation

GabrielSequeira commented Jan 30, 2024

Uh oh!

GabrielSequeira commented Feb 21, 2024

Uh oh!

Muennighoff commented Feb 21, 2024

Uh oh!

imenelydiaker commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muennighoff left a comment

Choose a reason for hiding this comment

Uh oh!

GabrielSequeira commented Feb 22, 2024

Uh oh!

Muennighoff commented Feb 22, 2024

Uh oh!

Muennighoff commented Feb 22, 2024

Uh oh!

imenelydiaker commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muennighoff commented Feb 22, 2024

Uh oh!

imenelydiaker commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Muennighoff commented Feb 28, 2024

Uh oh!

Muennighoff commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imenelydiaker commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MathieuCiancone commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

contrebande-labs commented Feb 29, 2024

Uh oh!

MathieuCiancone commented Feb 29, 2024

Uh oh!

Muennighoff commented Feb 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

imenelydiaker commented Feb 22, 2024 •

edited

Loading

imenelydiaker commented Feb 22, 2024 •

edited

Loading

imenelydiaker commented Feb 27, 2024 •

edited

Loading

Muennighoff commented Feb 28, 2024 •

edited

Loading

imenelydiaker commented Feb 28, 2024 •

edited

Loading

MathieuCiancone commented Feb 28, 2024 •

edited

Loading