Skip to content

Consolidate Retrieval/Reranking/Instruction Variants#1359

Merged
orionw merged 19 commits into
v2.0.0from
consolidate_reranking_and_retrieval
Nov 13, 2024
Merged

Consolidate Retrieval/Reranking/Instruction Variants#1359
orionw merged 19 commits into
v2.0.0from
consolidate_reranking_and_retrieval

Conversation

@orionw

@orionw orionw commented Oct 29, 2024

Copy link
Copy Markdown
Contributor

This PR would consolidate much of the logic between reranking/retrieval/instruction variants.

In principal, the only difference between these retrieval and reranking is that reranking is a bunch of "mini" retrieval tasks where the corpus is set by a different list.

The only difference between reranking and instruction reranking is the presence of an instruction, so this PR also merges those together also (we will have to designate instruction tasks via some list, but we need to do that anyways for #1066 and others)

Benefits are:

  • Enables reranking to use the full set of retrieval metrics, not just MAP and MRR.
  • Consolidates all the retrieval-style code (reranking, retrieval, instruction-based variants) hopefully making it easier to maintain and reducing duplicate code (e.g. we won't have three different places for retrieval eval code)
  • Makes it possible to use cross-encoders on reranking tasks (Cross-encoder and MTEB reranking tasks #1213)
  • Allows for instruction-based reranking tasks

Potential concerns/todos:

  • Using the same implementation for both retrieval and reranking is less computationally efficient since we don't batch over all queries at once. It should be roughly similar though on the corpus side, since we're still batching. It depends on if we're okay with this though @KennethEnevoldsen @Muennighoff @imenelydiaker @Samoed (and please tag others interested in reranking)? I don't remember reranking being particular compute intensive so perhaps this trade off is worth it? If not, I can revert some of this.
  • I still need to remove AbsTaskInstructionRetrieval and put the extras into the AbsTaskRetrieval class (+ adding some new datasets), but I thought I would leave that for the next PR.

If people are okay with this approach, I'll go through and verify that all the Reranking tasks still work and reproduce score-wise.

Checklist

  • [] Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Comment thread tests/test_evaluators/test_InstructionRetrievalEvaluator.py
Comment thread mteb/tasks/Reranking/multilingual/WikipediaRerankingMultilingual.py
Comment thread mteb/overview.py
Comment thread mteb/evaluation/evaluators/__init__.py
Comment thread mteb/abstasks/AbsTaskReranking.py
Comment thread mteb/abstasks/AbsTaskReranking.py

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the PR. I think the time concern is a reasonable trade-off for simplicity and maintainability, but @Muennighoff might know more about this given he made the original choice to split the two.

Have we checked that the new rerankers produce similar scores?

Comment thread mteb/abstasks/AbsTaskReranking.py Outdated
Comment thread mteb/abstasks/AbsTaskReranking.py Outdated
Comment thread mteb/abstasks/AbsTaskReranking.py Outdated
Comment thread mteb/tasks/Reranking/multilingual/MIRACLReranking.py Outdated
Comment thread mteb/tasks/Reranking/multilingual/MIRACLReranking.py Outdated
@Samoed

Samoed commented Oct 30, 2024

Copy link
Copy Markdown
Member

Currently, retrieval tasks can't load multilingual datasets. I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders. For base we can take MIRACLE loader or CodeSearchNetCCRetrieval.

For example, in CMTEB, qrel_revision specifies loading qrels from a different repository. I can assist with re-uploading. Here I tried specified ConfigDict(extra="forbid") to find tasks with old unused metadata, but found a lot weird retrieval loaders (I commended out their irrelevant revisions to make mteb) #1362.

@orionw

orionw commented Oct 30, 2024

Copy link
Copy Markdown
Contributor Author

Thanks for the comments @KennethEnevoldsen and @Samoed! I'll go ahead and finish validating these / fixing MIRACL and turn this into a non-draft PR.

I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders.

@Samoed, when you say standardize, what do you mean? I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader?

@Samoed

Samoed commented Oct 30, 2024

Copy link
Copy Markdown
Member

I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader?

Yes, currently only retrieval tasks don’t support multilingual loading by default and require custom data loaders.

@orionw

orionw commented Nov 6, 2024

Copy link
Copy Markdown
Contributor Author

An update on this. Everything is done, except for MindSmall, but perhaps I need to make another large change to fit it.

The reranking code has:

    for instance in self.samples:
        num_subqueries = (
            len(instance["query"]) if isinstance(instance["query"], list) else 1
        )
        query_emb = all_query_embs[query_idx : query_idx + num_subqueries]
query_emb: Query embedding of shape `(num_queries, hidden_size)`)
                if `num_queries` > 0: we take the closest document to any of the queries

where these subqueries only applies to the MindSmall reranking task.

The two scores (with or without subqueries) are actually pretty close if you ignore this step (only 0.1% off), but since it is in MTEB-original I wanted to make sure it hit parity.

The problem is that MindSmall has thousands of duplicate entries. So because my change makes it the "mini"-retrieval setting where each query sets up a mini retrieval task, it re-computes all the duplicates. With all the duplicates (1000x), it's simply not feasible to run the task.

I considered a solution where (like we do currently) we compute a unique set of queries and documents, but then share the retrieval evaluation code. However, this will compute the dot product on a lot more query-doc pairs than we need to.

I'm not sure I see a solution, so I will probably re-add the reranking evaluator and try to have it share as much code as possible with the retrieval one. EDIT: or potentially changing the DRESModel search to handle this case...

If anyone has other suggestions please let me know!

@KennethEnevoldsen

KennethEnevoldsen commented Nov 7, 2024

Copy link
Copy Markdown
Contributor

I think the best solution is to duplicate the evaluation code over to mindsmall reranking (and then leave it as is). We can then create a v2 of the mindsmall both without duplicates and using the standardized code. We can then update MTEB(eng, beta) to use v2.

This keeps parity, simplifies the code (moving complexity to a single case), and removes duplicates from a dataset.

@orionw orionw changed the title DRAFT: Consolidate Retrieval/Reranking/Instruction Variants Consolidate Retrieval/Reranking/Instruction Variants Nov 9, 2024
@orionw

orionw commented Nov 9, 2024

Copy link
Copy Markdown
Contributor Author

Okay, I think this is finally ready for review. Sorry it ended up so big, it was hard to break this into pieces.

A list of things changed:

  • Deleted InstructionRetrievalEvaluator, RerankingEvaluator, AbsTaskInstructionRetrieval and severely reduced AbstractTaskReranking
  • Made all retrieval-based tasks take the same main three configs: corpus, queries, relevant_docs with options for instructions (query-id->str) and top_ranked (which is reranking, query-id -> list of strings)
  • Removed a lot of older code around instructions, such as the keywords, short_instructions, save_corpus_embeddings, etc.
  • Made the reranking happen efficiently by changing the search function in DenseRetrievalExactSearch to handle reranking also, added torch compile to the similarity metrics if available (I saw a 2x speedup!), and added GPU-based cosine similarity (and dot product) to speed it up also. This makes MindSmall feasible. Without the GPU and compiling, MindSmall takes too long in the dot products - with it takes 30 minutes.
  • Allow for specialized metrics like p-MRR and Robustness@10 to be implemented in a way that doesn't affect all the other tasks, in utils.py
  • Refactored a bit of the evaluation code so that other custom metrics could use it when they do aggregation over the scores.
  • Added custom loaders for MIRACL and MindSmall so they load nicely until I have time to make them into clean datasets
  • Updated the tests to add InstructionReranking
  • Added a new InstructionRetrieval task, InstructIR and verified it matches the paper. Added contriever to the model section (including fixing a bug that didn't allow dot product similarity), so I could verify the InstructIR dataset.
  • Added a new Reranking task (NevIR) with a custom metric (paired accuracy) to make sure custom metrics work.
  • Moved FollowIR-based datasets into InstructionReranking which I think is a better fit
  • Refactored HFDataLoader into its own file
  • Refactored DRESModel and DenseRetrievalExtactSearch into a separate file.

I ran sentence-transformers/paraphrase-multilingual-mpnet-base-v2 on all of the reranking tasks, the instruction-based tasks, and some retrieval tasks to be sure that the scores match and they all do. There is some minor variation in the 4th decimal place (off by 0.0001 ish) due to the previous reranking code not having query-ids and so the sorting is a bit different on ties and such.

Eventually, I'd like to create new datasets on HF for the 15 ish reranking tasks, but I don't think I'll have time for that in the next couple weeks. It takes a few seconds to convert them on the fly for all tasks except MIRACL, MindSmall, and WebLINXCandidatesReranking which takes 5-10 minutes to convert on the fly.

@orionw orionw requested a review from Samoed November 9, 2024 23:38
Comment thread mteb/abstasks/dataloaders.py Outdated

@Samoed Samoed left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome changes!

Comment thread mteb/tasks/InstructionRetrieval/eng/InstructIR.py
Comment thread mteb/evaluation/evaluators/model_classes.py Outdated
Comment thread mteb/evaluation/evaluators/model_classes.py
Comment thread mteb/evaluation/evaluators/utils.py
Comment thread mteb/evaluation/evaluators/utils.py
Comment thread mteb/evaluation/evaluators/utils.py
@orionw orionw mentioned this pull request Nov 10, 2024
15 tasks
@isaac-chung

Copy link
Copy Markdown
Collaborator

@orionw wow! Great PR. I'll take a more detailed look shortly. Right now I only have a small thought on the removal of the Evaluators and AbsTask: removing these classes would be considered a breaking change, which should trigger a major version bump. We might want to do that only when we group a few other bigger impact changes together.

So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0. That way if anyone had started using these will have a heads up, and we avoid 2 almost back-to-back major version bumps. Those classes will be fully removed at the v2.0.0 commit.

What do you think? also @KennethEnevoldsen @Samoed

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

What do you think? also @KennethEnevoldsen @Samoed

We could include these in v2.0.0 along with MIEB

(maybe we should start the v2.0.0 branch to collect some changes together?

@isaac-chung

Copy link
Copy Markdown
Collaborator

We could include these in v2.0.0 along with MIEB
(maybe we should start the v2.0.0 branch to collect some changes together?

Yeah, good call!

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to v2.0.0 November 11, 2024 09:28
@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Moved from main to v.2.0.0

@orionw

orionw commented Nov 11, 2024

Copy link
Copy Markdown
Contributor Author

Thanks @isaac-chung and good point!

So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0

I agree with this. Unless v2.0.0 is soon, I think this will make it much easier for people to add instruction-based tasks and we should merge it in sooner than later. @KennethEnevoldsen do you know the timeline on that?

Happy to add back in some of these evaluators even if they are unused, just so we keep compatibility.

@isaac-chung

Copy link
Copy Markdown
Collaborator

@orionw @KennethEnevoldsen maybe we can merge this into the 2.0.0 preview branch when ready? That way if users want to try it out before the actual release they still can, and they won't be blocked by our timeline.

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
@orionw

orionw commented Nov 11, 2024

Copy link
Copy Markdown
Contributor Author

@isaac-chung we could do that too

I was mainly commenting from a "new-task-adder" perspective rather than a user perspective. New PR's like #1425 are adding new instruction-retrieval tasks and the current implementation of the InstructionRetrieval is quite confusing to add new tasks (which is totally my fault, as I added that setup earlier this year and didn't keep it broad enough).

If v2.0.0 is coming out within a month or so then perhaps it's fine to make those PRs also merge in to v2.0.0, but I was trying to avoid making it difficult for others to add because of my poor decisions :)

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

I believe this is a good fit for v2. Feel free to merge in into v2 as it is now (resolving the merge conflict). Please add a release note in the description over at #1433

@orionw orionw merged commit 9c58518 into v2.0.0 Nov 13, 2024
@orionw orionw deleted the consolidate_reranking_and_retrieval branch November 13, 2024 16:30
@orionw orionw mentioned this pull request Nov 13, 2024
@Samoed Samoed mentioned this pull request Feb 21, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants