Skip to content

Extend MTEB with French datasets#218

Merged
Muennighoff merged 117 commits into
embeddings-benchmark:mainfrom
Lyon-NLP:main
Feb 22, 2024
Merged

Extend MTEB with French datasets#218
Muennighoff merged 117 commits into
embeddings-benchmark:mainfrom
Lyon-NLP:main

Conversation

@GabrielSequeira

Copy link
Copy Markdown
Contributor

Adding:

gsequeiraOS and others added 30 commits November 7, 2023 09:26
Masakhane dataset and french script for classification
* add Opusparcus dataset

* multilingual usage

* use eval_split of config files

* change eval_split according to data

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
HAL S2S dataset creation and evaluation on clustering task.
* Add DiaBLa dataset for bitext mining

* Add DiaBLa dataset for bitext mining

* deduplicate bitext task

* add Flores

* format files

* add flores to evaluation script

* remove prints

* add revision

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
wissam-sib and others added 7 commits February 20, 2024 23:05
…pusparcuspc

Inherit OpusparcusPC init from MultilingualTask
* put script on HF dataset repos

* remove scripts
* add trust remote code arg

* leave corpus as dict

* remove trust remote code
add bucc and tatoeba bitextmining tasks
@GabrielSequeira

Copy link
Copy Markdown
Contributor Author

Yeah ! 🚀 I made the change, do you need the json results ?

Yes that would be great!

Hey @Muennighoff, do you need the results to be merged in this repository https://huggingface.co/datasets/mteb/results, or can we juste give you the json files we have ?

Yeah feel free to directly open a PR there e.g. like this one: #218 (comment)

You just need to create a folder for each model and place the files in there; You can clone the repo with git clone ... ; instructions are also here https://huggingface.co/datasets/mteb/results/discussions?new_pr=true

I think we are good with the changes, and the results are there : https://huggingface.co/datasets/mteb/results/discussions/28

@Muennighoff

Copy link
Copy Markdown
Contributor

Yeah ! 🚀 I made the change, do you need the json results ?

Yes that would be great!

Hey @Muennighoff, do you need the results to be merged in this repository https://huggingface.co/datasets/mteb/results, or can we juste give you the json files we have ?

Yeah feel free to directly open a PR there e.g. like this one: #218 (comment)
You just need to create a folder for each model and place the files in there; You can clone the repo with git clone ... ; instructions are also here https://huggingface.co/datasets/mteb/results/discussions?new_pr=true

I think we are good with the changes, and the results are there : https://huggingface.co/datasets/mteb/results/discussions/28

Amazing! I think this one would still be nice #218 (comment) if okay with you but other than that we can merge both!

* add other language to clustering tasks

* fix main score and S2S task

* update run fr becnhmark script

* Update run_mteb_french.py

* Update AbsTaskClustering.py

* remove train and validation splits
@imenelydiaker

imenelydiaker commented Feb 22, 2024

Copy link
Copy Markdown
Contributor

Amazing! I think this one would still be nice #218 (comment) if okay with you but other than that we can merge both!

@Muennighoff we made the changes for #218 (comment), can you please have a look at AbsTaskClustering, MasakhaNEWSClusteringP2P and MasakhaNEWSClusteringS2S files ?

@Muennighoff Muennighoff left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's merge?

@GabrielSequeira

Copy link
Copy Markdown
Contributor Author

Yes let's merge ! 💯

@Muennighoff Muennighoff merged commit 3d8b8ec into embeddings-benchmark:main Feb 22, 2024
@Muennighoff

Copy link
Copy Markdown
Contributor

Still need to update MasakhaNEWS files in https://huggingface.co/datasets/mteb/results/discussions/28 I think & then I will add a French leaderboard tab if you want! 🙌 Should I just link to your GitHub accounts / this PR for the Credits of the new tabs or are you planning to write a paper on it?

@Muennighoff

Copy link
Copy Markdown
Contributor

I think the leaderboard(s) would consist of the following:

TASK_LIST_CLASSIFICATION_FR = [
    "AmazonReviewsClassification (fr)",
    "MasakhaNEWSClassification (fra)",
    "MassiveIntentClassification (fr)",
    "MassiveScenarioClassification (fr)",
]

TASK_LIST_CLUSTERING_FR = [
    "AlloProfClusteringP2P",
    "AlloProfClusteringS2S",
    "HALClusteringS2S",
    "MLSUMClusteringP2P",
    "MLSUMClusteringS2S",
    "MasakhaNEWSClusteringP2P (fra)",
    "MasakhaNEWSClusteringS2S (fra)",
]

TASK_LIST_PAIR_CLASSIFICATION_FR = [
    "OpusparcusPC (fr)",
    "PawsX (fr)",
]

TASK_LIST_RERANKING_FR = [
    "AlloprofReranking",
    "SyntecReranking",
]
TASK_LIST_RETRIEVAL_FR = [
    "AlloprofRetrieval",
    "BSARDRetrieval",
    "MintakaRetrieval",
    "MultiLongDocRetrieval",
    "SyntecRetrieval",
    "XPQARetrieval",
]
TASK_LIST_STS_FR = [
    "STS22 (fr)",
    "STSBenchmarkMultilingualSTS (fr)",
    "SICKFr",
]
TASK_LIST_SUMMARIZATION_FR = ["SummEvalFr"]

It's a total of 25 so also enough for an overall leaderboard tab imo.
Let me know if I am missing anything!

Also I added a few other French tasks like xPQA etc. not from this PR - would be great if you could eval on them too 🙌

@imenelydiaker

imenelydiaker commented Feb 22, 2024

Copy link
Copy Markdown
Contributor

We will re-run the evaluations on MasakhaNEWS, we will also run evals on the datasets you added. The PR will be updated soon.

For the Credits, we have a paper that is under review for the moment. You can still add our GitHub accounts maybe ?

Concerning the list of tasks for the leaderboard, I think you can add these two tasks in TASK_LIST_CLASSIFICATION: "MTOPDomainClassification", "MTOPIntentClassification".

@Muennighoff

Copy link
Copy Markdown
Contributor

We will re-run the evaluations on MasakhaNEWS, we will also run evals on the datasets you added. The PR will be updated soon.

Amazing!

For the Credits, we have a paper that is under review for the moment. You can still add our GitHub accounts maybe ?

Cool! Happy to help if I can be useful! I will put your GitHubs in the meantime 👍

Concerning the list of tasks for the leaderboard,

Good catch - 27 datasets then!

@imenelydiaker

imenelydiaker commented Feb 27, 2024

Copy link
Copy Markdown
Contributor

@Muennighoff do you need models charcteristics like their size, embedding dim and max token length for the leaderboard ? We have already gathered them in these two files (CSV or JSON, you choose what suits you most):
JSON: https://github.com/Lyon-NLP/mtebscripts/blob/main/script_mteb_french/results_analysis/model_specs.json
CSV: https://github.com/Lyon-NLP/mtebscripts/blob/main/script_mteb_french/results_analysis/models_characteristics.csv

@Muennighoff

Copy link
Copy Markdown
Contributor

Do you also want to run the French split of MultiLongDocRetrieval and include it? I saw it wasn't included in the result files

@Muennighoff

Muennighoff commented Feb 28, 2024

Copy link
Copy Markdown
Contributor

I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀
Let me know if we should change anything / feel free to open a PR! Also would be cool to still include MultiLongDocRetrieval I think :)

Edit: A few models still missing, updating....

@imenelydiaker

imenelydiaker commented Feb 28, 2024

Copy link
Copy Markdown
Contributor

I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀 Let me know if we should change anything / feel free to open a PR! Also would be cool to still include MultiLongDocRetrieval I think :)

That's so great, thanks ! 🚀 Just one small thing, could you please add @schmarion in the credits section ?

We'll check MultiLongDocRetrieval and will open a PR with the results soon.

Edit: yes some models and a dataset (SummEvalFr) are missing

@MathieuCiancone

MathieuCiancone commented Feb 28, 2024

Copy link
Copy Markdown
Contributor

Thanks a lot ! So good to finally see the results there 🤩 . Could you please rebind my name in the credits. I happen to have changed my account name recently and the link is broken... @MathieuCiancone

@contrebande-labs

Copy link
Copy Markdown

Hi guys! I just want to congratulate everybody who has worked on this! You made my day!

@MathieuCiancone

Copy link
Copy Markdown
Contributor

I have preliminary added the French leaderboard: https://huggingface.co/spaces/mteb/leaderboard 🇫🇷🚀 Let me know if we should change anything / feel free to open a PR! Also would be cool to still include MultiLongDocRetrieval I think :)

Edit: A few models still missing, updating....

I looked into the dataset and it is actually quite huge (not even talking about the English subset 😅). Even for text embedding API (which tend to be cheap), embedding the whole corpus would lead to significant API costs (including openai 3 embedding model, mistral, voyage and cohere). I think we will save this for smaller datasets, as a greater variety of datasets is more insightful that bigger datasets... 🤔

@Muennighoff

Copy link
Copy Markdown
Contributor

I looked into the dataset and it is actually quite huge (not even talking about the English subset 😅). Even for text embedding API (which tend to be cheap), embedding the whole corpus would lead to significant API costs (including openai 3 embedding model, mistral, voyage and cohere). I think we will save this for smaller datasets, as a greater variety of datasets is more insightful that bigger datasets... 🤔

So for French, it's 20 queries & 10K docs w/ 9.6K avg characters per doc. I think it's comparable to TREC-COVID which is 171K docs but only ~900 avg characters per doc. TREC-COVID is one of the cheapest English Retrieval datasets. Also for many models that cannot handle the long sequence length of 9.6K characters it'll be cheaper. But ofc up to you, no worries if you don't want to include it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants