feat: Add new benchmark BEIR-NL by nikolay-banar · Pull Request #1909 · embeddings-benchmark/mteb

nikolay-banar · 2025-01-30T17:16:41Z

We recently published BEIR-NL, which is a Dutch translated version of BEIR.

Adding datasets checklist

Reason for dataset addition: BEIR-NL, a new benchmark for retrieval in Dutch.

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Samoed

You also need to add your benchmark to benchmark file

KennethEnevoldsen

Great to have BEIR-NL added!

I have noted a few additional pointers - @Samoed shouldn't we have a test fail here since the descriptive statistics are missing?

KennethEnevoldsen · 2025-01-30T19:39:43Z

+        domains=["Written"],
+        task_subtypes=[],
+        license="cc-by-sa-4.0",
+        annotations_creators="LM-generated and reviewed",


isn't this "derived" from the english data?

Indeed, the "derived" category would fit better. I will change that.

KennethEnevoldsen · 2025-01-30T19:40:49Z

+        license="cc-by-sa-4.0",
+        annotations_creators="LM-generated and reviewed",
+        dialect=[],
+        sample_creation="machine-translated and verified",


How were these verified?

We manually checked a small subset of translations. If it does not fit into the "verified" category, I can remove that.

No that is perfectly fine - will you add a comment like so:

Suggested change

sample_creation="machine-translated and verified",

sample_creation="machine-translated and verified", # manually checked a small subset

KennethEnevoldsen

Added a few additional pointers on metadata

Since nothing has been run on these tasks the leaderboard will appear empty. You might consider submitting scores at least for a relevant set of models. If you don't have the resources for this I will have to figure out how we handle an empty benchmark (it might give a bug or lead to confusion on the leaderboard)

KennethEnevoldsen · 2025-01-30T20:18:14Z

+BEIR_NL = Benchmark(
+    name="BEIR-NL",
+    tasks=get_tasks(
+        tasks=[


great - I can see that many models are trained on the English version of these datasets and since they have not been annotated as trained on these datasets they will appear as zero-shot on BEIR-NL (despite being trained on e.g. FEVER). To avoid this you would need to update the model annotations (searching for "NQ"´, "FEVER"` etc. should allow you to find the relevant cases and update the annotations)

I submitted some results to embeddings-benchmark/results#105 and updated the model annotations.

KennethEnevoldsen · 2025-01-30T20:25:27Z

+        eval_splits=["test"],
+        eval_langs=["nld-Latn"],
+        main_score="ndcg_at_10",
+        date=("2024-10-01", "2024-10-01"),


On a second round through the annotations I see that the dates do not quite match the time of the original data (Is it the time of translation?)

I checked the previous translated dataset and there we annotated the data range of the source data.

Feel free to give your best guess here but simply annotate # best guess

Indeed, these dates were the time of translation.

KennethEnevoldsen · 2025-01-30T20:30:43Z

+        eval_langs=["nld-Latn"],
+        main_score="ndcg_at_10",
+        date=("2024-10-01", "2024-10-01"),
+        domains=["Written"],


Seems like the domains are only minimally filled out. It seems like at least "Non-fiction" would apply to many of these.

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

isaac-chung

Great work! Looks like all comments have been addressed, just need to resolve the merge conflicts. @nikolay-banar would be great to have this merged into the repo soon!

isaac-chung · 2025-02-03T05:04:28Z

        "HotPotQA": ["test"],
        "HotPotQAHardNegatives": ["test"],
        "HotPotQA-PL": ["test"],  # translated from hotpotQA (not trained on)
+        "HotpotQA-NL": ["test"],  # translated from hotpotQA (not trained on)


@KennethEnevoldsen if these are "not trained on", should we still keep these? Personally I find these very confusing.

I think translating a dataset and training on it should still lead to a non-zero-shot on the benchmark - these are just to annotate that. We could "link" the tasks and update leaderboard code (but currently that is not how it is done)

Got it, makes sense, thanks. The "not trained on" part was confusing for me. Maybe it could have said something closer to "trained on translation" in the future?

nikolay-banar · 2025-02-03T11:56:28Z

@isaac-chung Some tests are failed, but that doesn't seem to be related to my code.

isaac-chung · 2025-02-03T12:15:09Z

@nikolay-banar thanks! I'm rerunning them now.
@KennethEnevoldsen just wanted to see if you're happy with the updates. I'm happy to merge once CI passes. I think we can handle descriptive stats until v2 is merged.

nikolay-banar · 2025-02-03T15:23:48Z

@isaac-chung @KennethEnevoldsen @Samoed Thank you for your reviews!

KennethEnevoldsen · 2025-02-04T16:34:36Z

+BEIR_NL = Benchmark(
+    name="BEIR-NL",
+    tasks=get_tasks(
+        tasks=[


KennethEnevoldsen · 2025-02-04T16:36:17Z

everything is good on my end - so will merge this in

nikolay-banar · 2025-02-05T10:27:10Z

@KennethEnevoldsen I have noticed a small bug in SCIDOCSNLRetrieval.py with eval_langs (it should be ["nld-Latn"]). Should I open a new issue for that?

Samoed · 2025-02-05T10:35:20Z

You can create PR with fix

EwoutH · 2025-08-01T20:27:56Z

This is awesome work!

In the “Language-specific” section of the sidebar on the leaderboard, there isn’t currently a filter for Dutch. Could one be added?

isaac-chung · 2025-08-02T06:33:05Z

Beir-NL is currently available under miscellaneous:

BEIR-NL datasets

de7061f

nikolay-banar changed the title ~~BEIR-NL~~ Add new benchmark BEIR-NL Jan 30, 2025

Samoed changed the title ~~Add new benchmark BEIR-NL~~ feat: Add new benchmark BEIR-NL Jan 30, 2025

Samoed reviewed Jan 30, 2025

View reviewed changes

BEIR-NL added to benchmarks

6cac6f8

KennethEnevoldsen reviewed Jan 30, 2025

View reviewed changes

nikolay-banar added 2 commits January 30, 2025 21:00

BEIR-NL annotations_creators changed to derived

b577b3d

BEIR-NL sample_creation clarified

66a702b

KennethEnevoldsen reviewed Jan 30, 2025

View reviewed changes

nikolay-banar and others added 6 commits January 30, 2025 21:53

Update mteb/tasks/Retrieval/nld/MMARCONLRetrieval.py

b6dce7c

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

Update mteb/tasks/Retrieval/nld/FEVERNLRetrieval.py

dc4d917

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

Update mteb/tasks/Retrieval/nld/ClimateFEVERNLRetrieval.py

b6c12c9

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

descriptions of models are changed to include BEIR-NL

c5a9a70

dates for BEIR-NL fixed

4e48ead

more metadata annotations for BEIR-NL

2dad2fa

nikolay-banar mentioned this pull request Jan 31, 2025

Add initial BEIR-NL retrieval results embeddings-benchmark/results#105

Merged

2 tasks

isaac-chung reviewed Feb 3, 2025

View reviewed changes

Merge branch 'main' into beirnl-branch

6fb7b21

isaac-chung approved these changes Feb 3, 2025

View reviewed changes

KennethEnevoldsen approved these changes Feb 4, 2025

View reviewed changes

KennethEnevoldsen merged commit de8f384 into embeddings-benchmark:main Feb 4, 2025

nikolay-banar deleted the beirnl-branch branch July 2, 2025 14:33

EwoutH mentioned this pull request Aug 1, 2025

model: Add Qwen3 Embedding model #2769

Merged

8 tasks

	sample_creation="machine-translated and verified",
	sample_creation="machine-translated and verified", # manually checked a small subset

Uh oh!

Conversation

nikolay-banar commented Jan 30, 2025

Adding datasets checklist

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikolay-banar Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikolay-banar commented Feb 3, 2025

Uh oh!

isaac-chung commented Feb 3, 2025

Uh oh!

nikolay-banar commented Feb 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen commented Feb 4, 2025

Uh oh!

nikolay-banar commented Feb 5, 2025

Uh oh!

Samoed commented Feb 5, 2025

Uh oh!

EwoutH commented Aug 1, 2025

Uh oh!

isaac-chung commented Aug 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

nikolay-banar Jan 30, 2025 •

edited

Loading