Skip to content

dataset: Add TREC DL #3379

Merged
Samoed merged 11 commits into
embeddings-benchmark:v2.0.0from
whybe-choi:dataset/trec-dl
Oct 20, 2025
Merged

dataset: Add TREC DL #3379
Samoed merged 11 commits into
embeddings-benchmark:v2.0.0from
whybe-choi:dataset/trec-dl

Conversation

@whybe-choi

@whybe-choi whybe-choi commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

Close #3348

This pull request adds support for two new retrieval tasks, TRECDL2019 and TRECDL2020, to the English retrieval task suite. It introduces their implementations, descriptive statistics, and integrates them into the task registry, expanding the benchmark coverage for TREC Deep Learning tracks.

New Retrieval Task Support

  • Implemented TRECDL2019 and TRECDL2020 retrieval tasks in trecdl_retrieval.py, including metadata, dataset references, and evaluation details.
  • Added descriptive statistics JSON files for both tasks: TRECDL2019.json and TRECDL2020.json, providing sample counts and text/query statistics. [1] [2]

Task Registry Integration

  • Registered TRECDL2019 and TRECDL2020 in the English retrieval task module (__init__.py), making them available for evaluation and selection. [1] [2]

@whybe-choi

whybe-choi commented Oct 17, 2025

Copy link
Copy Markdown
Contributor Author

Hello, @orionw and @Samoed !
If you have time, could you check the PR to see if proceeding in this manner is okay?

I uploaded the dataset to my Hugging Face repo as follows for test:

@Samoed Samoed marked this pull request as ready for review October 17, 2025 06:21
@Samoed

Samoed commented Oct 17, 2025

Copy link
Copy Markdown
Member

Task looks good! Can you target v2 branch and compute statistics for the task and make file in snake case?

@whybe-choi whybe-choi changed the base branch from main to v2.0.0 October 17, 2025 06:32
@whybe-choi

Copy link
Copy Markdown
Contributor Author

Thanks for your feedback! I'll update the pull request to target the v2 branch, compute the task statistics, and rename the file using snake case as requested.

@Samoed Samoed added the new dataset Issues related to adding a new task or dataset label Oct 17, 2025
@whybe-choi

whybe-choi commented Oct 17, 2025

Copy link
Copy Markdown
Contributor Author

I have incorporated all of your feedback. You can check the statistics at the following links:

@Samoed

Samoed commented Oct 17, 2025

Copy link
Copy Markdown
Member

Yes, that good, but it's in v1 format. Can you recompute it and add to the repo

@whybe-choi

Copy link
Copy Markdown
Contributor Author

Do you mean I should add the statistics to https://github.com/embeddings-benchmark/mteb/blob/main/docs/tasks.md?

Comment thread mteb/tasks/retrieval/eng/trecdl_retrieval.py
@Samoed

Samoed commented Oct 17, 2025

Copy link
Copy Markdown
Member

Do you mean I should add the statistics to https://github.com/embeddings-benchmark/mteb/blob/main/docs/tasks.md?

No, it should appear in descriptive_stats folder and you should commit it

Comment thread mteb/descriptive_stats/Retrieval/TREDDL2019.json Outdated
@Samoed Samoed requested a review from orionw October 17, 2025 14:18

@Samoed Samoed left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for addition

@whybe-choi

Copy link
Copy Markdown
Contributor Author

I was able to contribute more easily thanks to your kind explanation 🙂

@orionw

orionw commented Oct 17, 2025

Copy link
Copy Markdown
Contributor

Amazing, thanks so much @whybe-choi and @Samoed for the feedback!

Before we merge though, I think there may be too many queries? I think TREC DL 19 and 20 have under 100 queries but I see quite a few more here. Let me get the stats and update this.

@whybe-choi

Copy link
Copy Markdown
Contributor Author

I think it's because all queries from the train, dev, and test subsets are combined.
Would it be enough if only include the queries from the test set?

@orionw

orionw commented Oct 17, 2025

Copy link
Copy Markdown
Contributor

Yes, my apologies @whybe-choi, this turned out to be more complicated than I thought when I linked the data. That website is where TREC links to, but then they for some reason didn't include the judged queries. From the paper:

Participants were provided with an initial set of 200 test queries, then NIST later selected 43 queries during the pooling and judging process, based on budget...

The official test sets of DL19 and DL20 are much much smaller (e.g. 43 for DL19 and 45 for DL20), but that's because they took a subset of the test queries for annotations. Weirdly, I cannot find these on any TREC website but maybe I am being dense.

The easiest way I see to get those is to install the python package ir_datasets separately for processing and then to save these as jsonl files: https://ir-datasets.com/msmarco-document.html#msmarco-document/trec-dl-2019/judged (similar with dl20, it's the judged version) with command ir_datasets export msmarco-document/trec-dl-2019/judged queries --format jsonl. As before, the corpus is the same but you'd need to update queries and qrels in MTEB format.

Again, I am so sorry about misdirecting you and thanks for your already excellent work here!

I think it's because all queries from the train, dev, and test subsets are combined.

This is also part of it, although not the main reason -- but yes only include the test ones!

@whybe-choi

Copy link
Copy Markdown
Contributor Author

That's fine. Thank you for the kind guidance. I will get back to work and ask for a review again!

@whybe-choi

Copy link
Copy Markdown
Contributor Author

@orionw Whenever you have time, could you please review it again?

@orionw

orionw commented Oct 19, 2025

Copy link
Copy Markdown
Contributor

Queries look perfect! The qrels seem still quite large though (9k qrels for ~50 queries?). If that contains all qrels you could probably filter by query-ids in the queries.

@whybe-choi

Copy link
Copy Markdown
Contributor Author

I checked the dataset, and all qrels were indeed for the test query IDs, but original dataset included items where the score is 0. Would it be okay to simply delete those?

{"query-id": "23849", "corpus-id": "1020327", "score": 2}
{"query-id": "23849", "corpus-id": "1034183", "score": 3}
{"query-id": "23849", "corpus-id": "1120730", "score": 0}
{"query-id": "23849", "corpus-id": "1139571", "score": 1}
{"query-id": "23849", "corpus-id": "1143724", "score": 0}
{"query-id": "23849", "corpus-id": "1147202", "score": 0}
{"query-id": "23849", "corpus-id": "1150311", "score": 0}
{"query-id": "23849", "corpus-id": "1158886", "score": 2}
...

@orionw

orionw commented Oct 19, 2025

Copy link
Copy Markdown
Contributor

Ah oops, forgot how deeply judged these datasets are. That number looks right according to ir_datasets. Definitely don't delete them!

LGTM!

@Samoed

Samoed commented Oct 19, 2025

Copy link
Copy Markdown
Member

@whybe-choi Сan this pr be merged?

@whybe-choi

whybe-choi commented Oct 20, 2025

Copy link
Copy Markdown
Contributor Author

@Samoed I think it is enough to merge. But, I uploaded the dataset to my Hugging Face repository— is that okay?

@Samoed

Samoed commented Oct 20, 2025

Copy link
Copy Markdown
Member

Yes, that is okay, but you need to update revision of repository if you updated qrels

@whybe-choi

Copy link
Copy Markdown
Contributor Author

This is already the revision where the correct qrels are reflected. As far as I know, there shouldn't be any problem !

@Samoed Samoed merged commit ca8d313 into embeddings-benchmark:v2.0.0 Oct 20, 2025
11 checks passed
@whybe-choi whybe-choi deleted the dataset/trec-dl branch October 20, 2025 07:24
@yjoonjang

Copy link
Copy Markdown
Contributor

Hi, @whybe-choi @orionw . Thanks for your implementation of TREC DL 2019, 2020 datasets.
It was great to find this dataset while I was seeking this data for my research experiment.

I have on question though. On the ir_datasets, it looks like there are 3.2M docs for the TREC DL 2019, but I see 8.8M on @whybe-choi 's dataset.
image

Do you have some ideas? I don't know what is right.

@whybe-choi

Copy link
Copy Markdown
Contributor Author

Hello, @yjoonjang !
The corpus you uploaded is msmarco-document, while the corpus I uploaded is msmarco-passage. Therefore, it seems there is no issue with my dataset.

image

@yjoonjang

Copy link
Copy Markdown
Contributor

Ahh okay. Thank you for your help !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new dataset Issues related to adding a new task or dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add TREC deep learning track evals

5 participants