feat: Add new benchmark BEIR-NL #1909
Conversation
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Great to have BEIR-NL added!
I have noted a few additional pointers - @Samoed shouldn't we have a test fail here since the descriptive statistics are missing?
| domains=["Written"], | ||
| task_subtypes=[], | ||
| license="cc-by-sa-4.0", | ||
| annotations_creators="LM-generated and reviewed", |
There was a problem hiding this comment.
isn't this "derived" from the english data?
There was a problem hiding this comment.
Indeed, the "derived" category would fit better. I will change that.
| license="cc-by-sa-4.0", | ||
| annotations_creators="LM-generated and reviewed", | ||
| dialect=[], | ||
| sample_creation="machine-translated and verified", |
There was a problem hiding this comment.
How were these verified?
There was a problem hiding this comment.
We manually checked a small subset of translations. If it does not fit into the "verified" category, I can remove that.
There was a problem hiding this comment.
No that is perfectly fine - will you add a comment like so:
| sample_creation="machine-translated and verified", | |
| sample_creation="machine-translated and verified", # manually checked a small subset |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Added a few additional pointers on metadata
Since nothing has been run on these tasks the leaderboard will appear empty. You might consider submitting scores at least for a relevant set of models. If you don't have the resources for this I will have to figure out how we handle an empty benchmark (it might give a bug or lead to confusion on the leaderboard)
| BEIR_NL = Benchmark( | ||
| name="BEIR-NL", | ||
| tasks=get_tasks( | ||
| tasks=[ |
There was a problem hiding this comment.
great - I can see that many models are trained on the English version of these datasets and since they have not been annotated as trained on these datasets they will appear as zero-shot on BEIR-NL (despite being trained on e.g. FEVER). To avoid this you would need to update the model annotations (searching for "NQ"´, "FEVER"` etc. should allow you to find the relevant cases and update the annotations)
There was a problem hiding this comment.
I submitted some results to embeddings-benchmark/results#105 and updated the model annotations.
| eval_splits=["test"], | ||
| eval_langs=["nld-Latn"], | ||
| main_score="ndcg_at_10", | ||
| date=("2024-10-01", "2024-10-01"), |
There was a problem hiding this comment.
On a second round through the annotations I see that the dates do not quite match the time of the original data (Is it the time of translation?)
I checked the previous translated dataset and there we annotated the data range of the source data.
There was a problem hiding this comment.
Feel free to give your best guess here but simply annotate # best guess
There was a problem hiding this comment.
Indeed, these dates were the time of translation.
| eval_langs=["nld-Latn"], | ||
| main_score="ndcg_at_10", | ||
| date=("2024-10-01", "2024-10-01"), | ||
| domains=["Written"], |
There was a problem hiding this comment.
Seems like the domains are only minimally filled out. It seems like at least "Non-fiction" would apply to many of these.
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
isaac-chung
left a comment
There was a problem hiding this comment.
Great work! Looks like all comments have been addressed, just need to resolve the merge conflicts. @nikolay-banar would be great to have this merged into the repo soon!
| "HotPotQA": ["test"], | ||
| "HotPotQAHardNegatives": ["test"], | ||
| "HotPotQA-PL": ["test"], # translated from hotpotQA (not trained on) | ||
| "HotpotQA-NL": ["test"], # translated from hotpotQA (not trained on) |
There was a problem hiding this comment.
@KennethEnevoldsen if these are "not trained on", should we still keep these? Personally I find these very confusing.
There was a problem hiding this comment.
I think translating a dataset and training on it should still lead to a non-zero-shot on the benchmark - these are just to annotate that. We could "link" the tasks and update leaderboard code (but currently that is not how it is done)
There was a problem hiding this comment.
Got it, makes sense, thanks. The "not trained on" part was confusing for me. Maybe it could have said something closer to "trained on translation" in the future?
|
@isaac-chung Some tests are failed, but that doesn't seem to be related to my code. |
|
@nikolay-banar thanks! I'm rerunning them now. |
|
@isaac-chung @KennethEnevoldsen @Samoed Thank you for your reviews! |
| BEIR_NL = Benchmark( | ||
| name="BEIR-NL", | ||
| tasks=get_tasks( | ||
| tasks=[ |
|
everything is good on my end - so will merge this in |
|
@KennethEnevoldsen I have noticed a small bug in SCIDOCSNLRetrieval.py with eval_langs (it should be ["nld-Latn"]). Should I open a new issue for that? |
|
You can create PR with fix |
|
This is awesome work! In the “Language-specific” section of the sidebar on the leaderboard, there isn’t currently a filter for Dutch. Could one be added? |

We recently published BEIR-NL, which is a Dutch translated version of BEIR.
Adding datasets checklist
Reason for dataset addition: BEIR-NL, a new benchmark for retrieval in Dutch.
mteb -m {model_name} -t {task_name}command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2intfloat/multilingual-e5-smallself.stratified_subsampling() under dataset_transform()make test.make lint.