dataset: Add TREC DL by whybe-choi · Pull Request #3379 · embeddings-benchmark/mteb

whybe-choi · 2025-10-16T04:47:34Z

This pull request adds support for two new retrieval tasks, TRECDL2019 and TRECDL2020, to the English retrieval task suite. It introduces their implementations, descriptive statistics, and integrates them into the task registry, expanding the benchmark coverage for TREC Deep Learning tracks.

New Retrieval Task Support

Implemented TRECDL2019 and TRECDL2020 retrieval tasks in trecdl_retrieval.py, including metadata, dataset references, and evaluation details.
Added descriptive statistics JSON files for both tasks: TRECDL2019.json and TRECDL2020.json, providing sample counts and text/query statistics. [1] [2]

Task Registry Integration

Registered TRECDL2019 and TRECDL2020 in the English retrieval task module (__init__.py), making them available for evaluation and selection. [1] [2]

whybe-choi · 2025-10-17T05:23:45Z

Hello, @orionw and @Samoed !
If you have time, could you check the PR to see if proceeding in this manner is okay?

I uploaded the dataset to my Hugging Face repo as follows for test:

Samoed · 2025-10-17T06:27:26Z

Task looks good! Can you target v2 branch and compute statistics for the task and make file in snake case?

whybe-choi · 2025-10-17T06:34:06Z

Thanks for your feedback! I'll update the pull request to target the v2 branch, compute the task statistics, and rename the file using snake case as requested.

whybe-choi · 2025-10-17T12:27:47Z

I have incorporated all of your feedback. You can check the statistics at the following links:

Samoed · 2025-10-17T13:32:09Z

Yes, that good, but it's in v1 format. Can you recompute it and add to the repo

whybe-choi · 2025-10-17T13:37:47Z

Do you mean I should add the statistics to https://github.com/embeddings-benchmark/mteb/blob/main/docs/tasks.md?

Samoed · 2025-10-17T13:39:05Z

Do you mean I should add the statistics to https://github.com/embeddings-benchmark/mteb/blob/main/docs/tasks.md?

No, it should appear in descriptive_stats folder and you should commit it

Samoed

Great! Thank you for addition

whybe-choi · 2025-10-17T14:52:51Z

I was able to contribute more easily thanks to your kind explanation 🙂

orionw · 2025-10-17T16:58:19Z

Amazing, thanks so much @whybe-choi and @Samoed for the feedback!

Before we merge though, I think there may be too many queries? I think TREC DL 19 and 20 have under 100 queries but I see quite a few more here. Let me get the stats and update this.

whybe-choi · 2025-10-17T17:03:21Z

I think it's because all queries from the train, dev, and test subsets are combined.
Would it be enough if only include the queries from the test set?

orionw · 2025-10-17T17:08:12Z

Yes, my apologies @whybe-choi, this turned out to be more complicated than I thought when I linked the data. That website is where TREC links to, but then they for some reason didn't include the judged queries. From the paper:

Participants were provided with an initial set of 200 test queries, then NIST later selected 43 queries during the pooling and judging process, based on budget...

The official test sets of DL19 and DL20 are much much smaller (e.g. 43 for DL19 and 45 for DL20), but that's because they took a subset of the test queries for annotations. Weirdly, I cannot find these on any TREC website but maybe I am being dense.

The easiest way I see to get those is to install the python package ir_datasets separately for processing and then to save these as jsonl files: https://ir-datasets.com/msmarco-document.html#msmarco-document/trec-dl-2019/judged (similar with dl20, it's the judged version) with command ir_datasets export msmarco-document/trec-dl-2019/judged queries --format jsonl. As before, the corpus is the same but you'd need to update queries and qrels in MTEB format.

Again, I am so sorry about misdirecting you and thanks for your already excellent work here!

I think it's because all queries from the train, dev, and test subsets are combined.

This is also part of it, although not the main reason -- but yes only include the test ones!

whybe-choi · 2025-10-17T17:12:16Z

That's fine. Thank you for the kind guidance. I will get back to work and ask for a review again!

whybe-choi · 2025-10-19T05:11:24Z

@orionw Whenever you have time, could you please review it again?

orionw · 2025-10-19T15:35:28Z

Queries look perfect! The qrels seem still quite large though (9k qrels for ~50 queries?). If that contains all qrels you could probably filter by query-ids in the queries.

whybe-choi · 2025-10-19T15:47:17Z

I checked the dataset, and all qrels were indeed for the test query IDs, but original dataset included items where the score is 0. Would it be okay to simply delete those?

{"query-id": "23849", "corpus-id": "1020327", "score": 2}
{"query-id": "23849", "corpus-id": "1034183", "score": 3}
{"query-id": "23849", "corpus-id": "1120730", "score": 0}
{"query-id": "23849", "corpus-id": "1139571", "score": 1}
{"query-id": "23849", "corpus-id": "1143724", "score": 0}
{"query-id": "23849", "corpus-id": "1147202", "score": 0}
{"query-id": "23849", "corpus-id": "1150311", "score": 0}
{"query-id": "23849", "corpus-id": "1158886", "score": 2}
...

orionw · 2025-10-19T15:50:06Z

Ah oops, forgot how deeply judged these datasets are. That number looks right according to ir_datasets. Definitely don't delete them!

LGTM!

Samoed · 2025-10-19T18:23:35Z

@whybe-choi Сan this pr be merged?

whybe-choi · 2025-10-20T01:22:16Z

@Samoed I think it is enough to merge. But, I uploaded the dataset to my Hugging Face repository— is that okay?

Samoed · 2025-10-20T05:34:36Z

Yes, that is okay, but you need to update revision of repository if you updated qrels

whybe-choi · 2025-10-20T06:04:48Z

This is already the revision where the correct qrels are reflected. As far as I know, there shouldn't be any problem !

yjoonjang · 2025-10-20T11:19:32Z

Hi, @whybe-choi @orionw . Thanks for your implementation of TREC DL 2019, 2020 datasets.
It was great to find this dataset while I was seeking this data for my research experiment.

I have on question though. On the ir_datasets, it looks like there are 3.2M docs for the TREC DL 2019, but I see 8.8M on @whybe-choi 's dataset.

Do you have some ideas? I don't know what is right.

whybe-choi · 2025-10-20T11:24:32Z

Hello, @yjoonjang !
The corpus you uploaded is msmarco-document, while the corpus I uploaded is msmarco-passage. Therefore, it seems there is no issue with my dataset.

yjoonjang · 2025-10-20T11:27:31Z

Ahh okay. Thank you for your help !!

feat: create TRECDLRetrieval

4d380e1

feat: add TREC Deep Learning 2019 and 2020 retrieval tasks

273e09f

whybe-choi force-pushed the dataset/trec-dl branch from 9acec73 to 273e09f Compare October 17, 2025 05:26

Samoed marked this pull request as ready for review October 17, 2025 06:21

whybe-choi changed the base branch from main to v2.0.0 October 17, 2025 06:32

Samoed added the new dataset Issues related to adding a new task or dataset label Oct 17, 2025

feat: rename TRECDLRetrieval.py to trecdl_retrieval.py

b39d078

whybe-choi force-pushed the dataset/trec-dl branch from e6458be to b39d078 Compare October 17, 2025 12:25

Samoed reviewed Oct 17, 2025

View reviewed changes

Comment thread mteb/tasks/retrieval/eng/trecdl_retrieval.py

whybe-choi added 5 commits October 17, 2025 22:50

Merge remote-tracking branch 'upstream/v2.0.0' into dataset/trec-dl

9308c67

fix: replace relative imports from parent modules with absolute imports

182cc9f

feat: add TRECDL retrieval tasks to English registry

d039024

fix: correct task category

4267e08

feat: add TRECDL2019 and TRECDL2020 descriptive stats

df5b146

Samoed reviewed Oct 17, 2025

View reviewed changes

Comment thread mteb/descriptive_stats/Retrieval/TREDDL2019.json Outdated

Samoed requested a review from orionw October 17, 2025 14:18

fix: recompute descriptive stats to match v2 format

73c3fb9

Samoed requested a review from KennethEnevoldsen October 17, 2025 14:49

Samoed approved these changes Oct 17, 2025

View reviewed changes

fix: use official judged queries for TREC-DL 2019/2020 datasets

209bdd4

KennethEnevoldsen approved these changes Oct 19, 2025

View reviewed changes

orionw approved these changes Oct 19, 2025

View reviewed changes

Resolve merge conflict in docs/adding_a_benchmark.md

f0303d9

whybe-choi force-pushed the dataset/trec-dl branch from a88e955 to f0303d9 Compare October 19, 2025 16:23

Samoed merged commit ca8d313 into embeddings-benchmark:v2.0.0 Oct 20, 2025
11 checks passed

whybe-choi deleted the dataset/trec-dl branch October 20, 2025 07:24

Uh oh!

Conversation

whybe-choi commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybe-choi commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 17, 2025

Uh oh!

whybe-choi commented Oct 17, 2025

Uh oh!

whybe-choi commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 17, 2025

Uh oh!

whybe-choi commented Oct 17, 2025

Uh oh!

Uh oh!

Samoed commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

whybe-choi commented Oct 17, 2025

Uh oh!

orionw commented Oct 17, 2025

Uh oh!

whybe-choi commented Oct 17, 2025

Uh oh!

orionw commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybe-choi commented Oct 17, 2025

Uh oh!

whybe-choi commented Oct 19, 2025

Uh oh!

orionw commented Oct 19, 2025

Uh oh!

whybe-choi commented Oct 19, 2025

Uh oh!

orionw commented Oct 19, 2025

Uh oh!

Samoed commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whybe-choi commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 20, 2025

Uh oh!

whybe-choi commented Oct 20, 2025

Uh oh!

Uh oh!

yjoonjang commented Oct 20, 2025

Uh oh!

whybe-choi commented Oct 20, 2025

Uh oh!

yjoonjang commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

whybe-choi commented Oct 16, 2025 •

edited

Loading

whybe-choi commented Oct 17, 2025 •

edited

Loading

whybe-choi commented Oct 17, 2025 •

edited

Loading

Samoed commented Oct 17, 2025 •

edited

Loading

orionw commented Oct 17, 2025 •

edited

Loading

Samoed commented Oct 19, 2025 •

edited

Loading

whybe-choi commented Oct 20, 2025 •

edited

Loading