Skip to content

feat: Add new benchmark BEIR-NL #1909

Merged
KennethEnevoldsen merged 11 commits into
embeddings-benchmark:mainfrom
nikolay-banar:beirnl-branch
Feb 4, 2025
Merged

feat: Add new benchmark BEIR-NL #1909
KennethEnevoldsen merged 11 commits into
embeddings-benchmark:mainfrom
nikolay-banar:beirnl-branch

Conversation

@nikolay-banar

Copy link
Copy Markdown
Contributor

We recently published BEIR-NL, which is a Dutch translated version of BEIR.

Adding datasets checklist

Reason for dataset addition: BEIR-NL, a new benchmark for retrieval in Dutch.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@nikolay-banar nikolay-banar changed the title BEIR-NL Add new benchmark BEIR-NL Jan 30, 2025
@Samoed Samoed changed the title Add new benchmark BEIR-NL feat: Add new benchmark BEIR-NL Jan 30, 2025

@Samoed Samoed left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to add your benchmark to benchmark file

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have BEIR-NL added!

I have noted a few additional pointers - @Samoed shouldn't we have a test fail here since the descriptive statistics are missing?

domains=["Written"],
task_subtypes=[],
license="cc-by-sa-4.0",
annotations_creators="LM-generated and reviewed",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this "derived" from the english data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the "derived" category would fit better. I will change that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

license="cc-by-sa-4.0",
annotations_creators="LM-generated and reviewed",
dialect=[],
sample_creation="machine-translated and verified",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these verified?

@nikolay-banar nikolay-banar Jan 30, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We manually checked a small subset of translations. If it does not fit into the "verified" category, I can remove that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that is perfectly fine - will you add a comment like so:

Suggested change
sample_creation="machine-translated and verified",
sample_creation="machine-translated and verified", # manually checked a small subset

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread mteb/tasks/Retrieval/nld/CQADupstackAndroidNLRetrieval.py

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few additional pointers on metadata

Since nothing has been run on these tasks the leaderboard will appear empty. You might consider submitting scores at least for a relevant set of models. If you don't have the resources for this I will have to figure out how we handle an empty benchmark (it might give a bug or lead to confusion on the leaderboard)

BEIR_NL = Benchmark(
name="BEIR-NL",
tasks=get_tasks(
tasks=[

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great - I can see that many models are trained on the English version of these datasets and since they have not been annotated as trained on these datasets they will appear as zero-shot on BEIR-NL (despite being trained on e.g. FEVER). To avoid this you would need to update the model annotations (searching for "NQ"´, "FEVER"` etc. should allow you to find the relevant cases and update the annotations)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted some results to embeddings-benchmark/results#105 and updated the model annotations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged!

eval_splits=["test"],
eval_langs=["nld-Latn"],
main_score="ndcg_at_10",
date=("2024-10-01", "2024-10-01"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second round through the annotations I see that the dates do not quite match the time of the original data (Is it the time of translation?)

I checked the previous translated dataset and there we annotated the data range of the source data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to give your best guess here but simply annotate # best guess

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, these dates were the time of translation.

eval_langs=["nld-Latn"],
main_score="ndcg_at_10",
date=("2024-10-01", "2024-10-01"),
domains=["Written"],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the domains are only minimally filled out. It seems like at least "Non-fiction" would apply to many of these.

Comment thread mteb/tasks/Retrieval/nld/ClimateFEVERNLRetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/nld/FEVERNLRetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/nld/MMARCONLRetrieval.py Outdated
nikolay-banar and others added 6 commits January 30, 2025 21:53

@isaac-chung isaac-chung left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Looks like all comments have been addressed, just need to resolve the merge conflicts. @nikolay-banar would be great to have this merged into the repo soon!

"HotPotQA": ["test"],
"HotPotQAHardNegatives": ["test"],
"HotPotQA-PL": ["test"], # translated from hotpotQA (not trained on)
"HotpotQA-NL": ["test"], # translated from hotpotQA (not trained on)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen if these are "not trained on", should we still keep these? Personally I find these very confusing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think translating a dataset and training on it should still lead to a non-zero-shot on the benchmark - these are just to annotate that. We could "link" the tasks and update leaderboard code (but currently that is not how it is done)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense, thanks. The "not trained on" part was confusing for me. Maybe it could have said something closer to "trained on translation" in the future?

@nikolay-banar

Copy link
Copy Markdown
Contributor Author

@isaac-chung Some tests are failed, but that doesn't seem to be related to my code.

@isaac-chung

Copy link
Copy Markdown
Collaborator

@nikolay-banar thanks! I'm rerunning them now.
@KennethEnevoldsen just wanted to see if you're happy with the updates. I'm happy to merge once CI passes. I think we can handle descriptive stats until v2 is merged.

@nikolay-banar

Copy link
Copy Markdown
Contributor Author

@isaac-chung @KennethEnevoldsen @Samoed Thank you for your reviews!

BEIR_NL = Benchmark(
name="BEIR-NL",
tasks=get_tasks(
tasks=[

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged!

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

everything is good on my end - so will merge this in

@KennethEnevoldsen KennethEnevoldsen merged commit de8f384 into embeddings-benchmark:main Feb 4, 2025
@nikolay-banar

Copy link
Copy Markdown
Contributor Author

@KennethEnevoldsen I have noticed a small bug in SCIDOCSNLRetrieval.py with eval_langs (it should be ["nld-Latn"]). Should I open a new issue for that?

@Samoed

Samoed commented Feb 5, 2025

Copy link
Copy Markdown
Member

You can create PR with fix

@nikolay-banar nikolay-banar deleted the beirnl-branch branch July 2, 2025 14:33
@EwoutH

EwoutH commented Aug 1, 2025

Copy link
Copy Markdown

This is awesome work!

In the “Language-specific” section of the sidebar on the leaderboard, there isn’t currently a filter for Dutch. Could one be added?

@EwoutH EwoutH mentioned this pull request Aug 1, 2025
8 tasks
@isaac-chung

Copy link
Copy Markdown
Collaborator

Beir-NL is currently available under miscellaneous:
Screenshot_20250802-093156.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants