Consolidate Retrieval/Reranking/Instruction Variants by orionw · Pull Request #1359 · embeddings-benchmark/mteb

orionw · 2024-10-29T20:16:05Z

This PR would consolidate much of the logic between reranking/retrieval/instruction variants.

In principal, the only difference between these retrieval and reranking is that reranking is a bunch of "mini" retrieval tasks where the corpus is set by a different list.

The only difference between reranking and instruction reranking is the presence of an instruction, so this PR also merges those together also (we will have to designate instruction tasks via some list, but we need to do that anyways for #1066 and others)

Benefits are:

Enables reranking to use the full set of retrieval metrics, not just MAP and MRR.
Consolidates all the retrieval-style code (reranking, retrieval, instruction-based variants) hopefully making it easier to maintain and reducing duplicate code (e.g. we won't have three different places for retrieval eval code)
Makes it possible to use cross-encoders on reranking tasks (Cross-encoder and MTEB reranking tasks #1213)
Allows for instruction-based reranking tasks

Potential concerns/todos:

Using the same implementation for both retrieval and reranking is less computationally efficient since we don't batch over all queries at once. It should be roughly similar though on the corpus side, since we're still batching. It depends on if we're okay with this though @KennethEnevoldsen @Muennighoff @imenelydiaker @Samoed (and please tag others interested in reranking)? I don't remember reranking being particular compute intensive so perhaps this trade off is worth it? If not, I can revert some of this.
I still need to remove AbsTaskInstructionRetrieval and put the extras into the AbsTaskRetrieval class (+ adding some new datasets), but I thought I would leave that for the next PR.

If people are okay with this approach, I'll go through and verify that all the Reranking tasks still work and reproduce score-wise.

Checklist

[] Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

KennethEnevoldsen

Love the PR. I think the time concern is a reasonable trade-off for simplicity and maintainability, but @Muennighoff might know more about this given he made the original choice to split the two.

Have we checked that the new rerankers produce similar scores?

Samoed · 2024-10-30T10:11:12Z

Currently, retrieval tasks can't load multilingual datasets. I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders. For base we can take MIRACLE loader or CodeSearchNetCCRetrieval.

For example, in CMTEB, qrel_revision specifies loading qrels from a different repository. I can assist with re-uploading. Here I tried specified ConfigDict(extra="forbid") to find tasks with old unused metadata, but found a lot weird retrieval loaders (I commended out their irrelevant revisions to make mteb) #1362.

orionw · 2024-10-30T13:11:07Z

Thanks for the comments @KennethEnevoldsen and @Samoed! I'll go ahead and finish validating these / fixing MIRACL and turn this into a non-draft PR.

I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders.

@Samoed, when you say standardize, what do you mean? I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader?

Samoed · 2024-10-30T13:22:28Z

I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader?

Yes, currently only retrieval tasks don’t support multilingual loading by default and require custom data loaders.

orionw · 2024-11-06T22:12:49Z

An update on this. Everything is done, except for MindSmall, but perhaps I need to make another large change to fit it.

The reranking code has:

    for instance in self.samples:
        num_subqueries = (
            len(instance["query"]) if isinstance(instance["query"], list) else 1
        )
        query_emb = all_query_embs[query_idx : query_idx + num_subqueries]

query_emb: Query embedding of shape `(num_queries, hidden_size)`)
                if `num_queries` > 0: we take the closest document to any of the queries

where these subqueries only applies to the MindSmall reranking task.

The two scores (with or without subqueries) are actually pretty close if you ignore this step (only 0.1% off), but since it is in MTEB-original I wanted to make sure it hit parity.

The problem is that MindSmall has thousands of duplicate entries. So because my change makes it the "mini"-retrieval setting where each query sets up a mini retrieval task, it re-computes all the duplicates. With all the duplicates (1000x), it's simply not feasible to run the task.

I considered a solution where (like we do currently) we compute a unique set of queries and documents, but then share the retrieval evaluation code. However, this will compute the dot product on a lot more query-doc pairs than we need to.

I'm not sure I see a solution, so I will probably re-add the reranking evaluator and try to have it share as much code as possible with the retrieval one. EDIT: or potentially changing the DRESModel search to handle this case...

If anyone has other suggestions please let me know!

KennethEnevoldsen · 2024-11-07T08:59:43Z

I think the best solution is to duplicate the evaluation code over to mindsmall reranking (and then leave it as is). We can then create a v2 of the mindsmall both without duplicates and using the standardized code. We can then update MTEB(eng, beta) to use v2.

This keeps parity, simplifies the code (moving complexity to a single case), and removes duplicates from a dataset.

orionw · 2024-11-09T23:33:50Z

Okay, I think this is finally ready for review. Sorry it ended up so big, it was hard to break this into pieces.

A list of things changed:

Deleted InstructionRetrievalEvaluator, RerankingEvaluator, AbsTaskInstructionRetrieval and severely reduced AbstractTaskReranking
Made all retrieval-based tasks take the same main three configs: corpus, queries, relevant_docs with options for instructions (query-id->str) and top_ranked (which is reranking, query-id -> list of strings)
Removed a lot of older code around instructions, such as the keywords, short_instructions, save_corpus_embeddings, etc.
Made the reranking happen efficiently by changing the search function in DenseRetrievalExactSearch to handle reranking also, added torch compile to the similarity metrics if available (I saw a 2x speedup!), and added GPU-based cosine similarity (and dot product) to speed it up also. This makes MindSmall feasible. Without the GPU and compiling, MindSmall takes too long in the dot products - with it takes 30 minutes.
Allow for specialized metrics like p-MRR and Robustness@10 to be implemented in a way that doesn't affect all the other tasks, in utils.py
Refactored a bit of the evaluation code so that other custom metrics could use it when they do aggregation over the scores.
Added custom loaders for MIRACL and MindSmall so they load nicely until I have time to make them into clean datasets
Updated the tests to add InstructionReranking
Added a new InstructionRetrieval task, InstructIR and verified it matches the paper. Added contriever to the model section (including fixing a bug that didn't allow dot product similarity), so I could verify the InstructIR dataset.
Added a new Reranking task (NevIR) with a custom metric (paired accuracy) to make sure custom metrics work.
Moved FollowIR-based datasets into InstructionReranking which I think is a better fit
Refactored HFDataLoader into its own file
Refactored DRESModel and DenseRetrievalExtactSearch into a separate file.

I ran sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on all of the reranking tasks, the instruction-based tasks, and some retrieval tasks to be sure that the scores match and they all do. There is some minor variation in the 4th decimal place (off by 0.0001 ish) due to the previous reranking code not having query-ids and so the sorting is a bit different on ties and such.

Eventually, I'd like to create new datasets on HF for the 15 ish reranking tasks, but I don't think I'll have time for that in the next couple weeks. It takes a few seconds to convert them on the fly for all tasks except MIRACL, MindSmall, and WebLINXCandidatesReranking which takes 5-10 minutes to convert on the fly.

Samoed

Awesome changes!

isaac-chung · 2024-11-10T15:52:13Z

@orionw wow! Great PR. I'll take a more detailed look shortly. Right now I only have a small thought on the removal of the Evaluators and AbsTask: removing these classes would be considered a breaking change, which should trigger a major version bump. We might want to do that only when we group a few other bigger impact changes together.

So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0. That way if anyone had started using these will have a heads up, and we avoid 2 almost back-to-back major version bumps. Those classes will be fully removed at the v2.0.0 commit.

What do you think? also @KennethEnevoldsen @Samoed

KennethEnevoldsen · 2024-11-11T08:59:55Z

What do you think? also @KennethEnevoldsen @Samoed

We could include these in v2.0.0 along with MIEB

(maybe we should start the v2.0.0 branch to collect some changes together?

isaac-chung · 2024-11-11T09:12:56Z

We could include these in v2.0.0 along with MIEB
(maybe we should start the v2.0.0 branch to collect some changes together?

Yeah, good call!

KennethEnevoldsen · 2024-11-11T09:28:59Z

Moved from main to v.2.0.0

orionw · 2024-11-11T12:08:53Z

Thanks @isaac-chung and good point!

So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0

I agree with this. Unless v2.0.0 is soon, I think this will make it much easier for people to add instruction-based tasks and we should merge it in sooner than later. @KennethEnevoldsen do you know the timeline on that?

Happy to add back in some of these evaluators even if they are unused, just so we keep compatibility.

isaac-chung · 2024-11-11T12:24:42Z

@orionw @KennethEnevoldsen maybe we can merge this into the 2.0.0 preview branch when ready? That way if users want to try it out before the actual release they still can, and they won't be blocked by our timeline.

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

orionw · 2024-11-11T14:18:52Z

@isaac-chung we could do that too

I was mainly commenting from a "new-task-adder" perspective rather than a user perspective. New PR's like #1425 are adding new instruction-retrieval tasks and the current implementation of the InstructionRetrieval is quite confusing to add new tasks (which is totally my fault, as I added that setup earlier this year and didn't keep it broad enough).

If v2.0.0 is coming out within a month or so then perhaps it's fine to make those PRs also merge in to v2.0.0, but I was trying to avoid making it difficult for others to add because of my poor decisions :)

KennethEnevoldsen · 2024-11-13T12:53:42Z

I believe this is a good fit for v2. Feel free to merge in into v2 as it is now (resolving the merge conflict). Please add a release note in the description over at #1433

…com/embeddings-benchmark/mteb into consolidate_reranking_and_retrieval

orionw added 3 commits October 29, 2024 13:22

update

c54ea11

merged retrieval; working

0fb211d

update tasks; working multilingual

cacd075

orionw requested review from KennethEnevoldsen and isaac-chung October 29, 2024 20:16

orionw commented Oct 29, 2024

View reviewed changes

KennethEnevoldsen reviewed Oct 30, 2024

View reviewed changes

orionw added 3 commits November 4, 2024 20:05

everything working except instructions

4b8d32f

working instructions; just need cleanup

45df270

add metadata for all but MindSmall

4f25da4

orionw added 9 commits November 9, 2024 12:53

faster evaluation; mindsmall can compute in reasonable time

01c63e0

merge in main

acb4dec

fix bad merge of docs

d557aac

lint

b19335b

fix test

636fdfc

qa

47e80ba

updated mindsmall

ad0a3db

lint

88df96d

fix debug

94ba50b

orionw changed the title ~~DRAFT: Consolidate Retrieval/Reranking/Instruction Variants~~ Consolidate Retrieval/Reranking/Instruction Variants Nov 9, 2024

orionw requested a review from Samoed November 9, 2024 23:38