[v2] Two stage rerank by Samoed · Pull Request #3040 · embeddings-benchmark/mteb

Samoed · 2025-08-17T20:33:47Z

I've created 2 stage reranking. Here is example how to run it:

from sentence_transformers import CrossEncoder

import mteb

model = mteb.get_model("minishlab/potion-base-2M")
task = mteb.get_task("NanoArguAnaRetrieval")
prediction_folder = "results_folder"

res = mteb.evaluate(
    model,
    task,
    prediction_folder=prediction_folder,
    # overwrite_strategy="always",
)
task = task.convert_to_reranking(task.predictions_path(prediction_folder), top_k=100)
model = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2")
mteb.evaluate(
    model,
    task,
)

I've made saving information about run results to result file for better reproduction.

KennethEnevoldsen

Hmm thanks for making the PR @Samoed.

I really don't like the save_retrieval_results=True argument as it clearly a task specific argument.

I do however like that it is possible to create a reranking from a retrieval task. How about

import mteb

model = mteb.get_model("minishlab/potion-base-2M")
retrieval_task = mteb.get_task("NanoArguAnaRetrieval")

model = SaveRetrievaPredictionsWrapper(model)

res = mteb.evaluate(model, retrieval_task)
reranking_task = retrieval_task.convert_to_reranking(model.predictions, top_k=100)

model = mteb.get_model("jinaai/jina-reranker-v2-base-multilingual")
mteb.evaluate(model, reranking_task)

Alternatively, we could add a more general save_predictions flag, that can be used for all tasks and just add a NotImplementedError() for all tasks but retrieval. I could see that as being an option as well. Besides the many not-implemented errors this would cause, it would also lead to a requirement to handle new arguments, such as where to save predictions, etc.

The Wrapper isolate the concern and leave the evalautor only focused on evaluation

KennethEnevoldsen · 2025-08-19T13:43:47Z

    assert result.returncode == 0, "Command failed"
-
-
-def test_save_predictions():


isn't this still used by MTEB()?

Yes, this still used by MTEB, but I don't want to support it, because otherwise it would required to support a lot of extra parameters in AbsRetrieval which we don't want I think

So this will break MTEB()?

I've chedked a bit futher. It will break only cli for some part, becuase we don't support this flag for now, but we need to refactor CLI a bit to use mteb.evaluate instead of MTEB. We can add support of this flag to save backward compatibility. But this flag is not supported by MTEB directly, it's just passed thought kwargs and not documented

mteb/mteb/evaluation/MTEB.py

Lines 254 to 266 in cf01bc8

def run(

self,

model: MTEBModels | CrossEncoder | SentenceTransformer,

verbosity: int = 1,

output_folder: str | None = "results",

eval_splits: list[str] | None = None,

eval_subsets: list[str] | None = None,

overwrite_results: bool = False,

raise_error: bool = True,

co2_tracker: bool = True,

encode_kwargs: dict[str, Any] | None = None,

**kwargs,

) -> list[TaskResult]:

Yeah indeed!

Let is keep it like this, but remove it from the CLI args/docs

If we implement an additional method for save results, I think we can leave it as is

Samoed · 2025-08-19T14:10:39Z

I really don't like the save_retrieval_results=True argument as it clearly a task specific argument.

I defenetly agree, but I don't see another solution to the problem. And this is because I didn't want to add it implement firstly.

model = SaveRetrievaPredictionsWrapper(model)

I don't think that this would solve the problems, because models can save only embedings, but not predictions for the retrieval task. I don't know how resuse results from the wrapper then.

Alternatively, we could add a more general save_predictions flag, that can be used for all tasks and just add a NotImplementedError() for all tasks but retrieval

I can make implementation for it, because this would solve problems for the tasks

KennethEnevoldsen · 2025-08-20T09:16:12Z

I don't think that this would solve the problems, because models can save only embedings, but not predictions for the retrieval task. I don't know how resuse results from the wrapper then.

Why can't they save predicitons? doesn't the search interface return RetrievalOutputType?

I can make implementation for it, because this would solve problems for the tasks

Let us decide on a good solutions before spending the implementation time

Samoed · 2025-08-20T09:28:14Z

Why can't they save predicitons? doesn't the search interface return RetrievalOutputType?

Oh, I forgot that we have search interface. Yes, wrapper should work then

orionw

Sorry for the late review. In general this looks fine to me.

As a meta-comment though, I will say I'm nervous we're adding too many wrappers. Do they stack? Can we save predictions and cache embeddings and ...?

Maybe a personal preference but I think arguments to functions are a much nicer UI from a users perspective than doing a bunch of wrappers. However, we have so many it's probably too late to change all of these at this point. Just worth noting for future extensions.

Samoed · 2025-08-20T15:18:48Z

Can we save predictions and cache embeddings and ...?

I thought about this, too. This will be like RetrievalSaveResultsWrapper(CacheEmbeddingWrapper(model)) and should work

Maybe a personal preference but I think arguments to functions are a much nicer UI from a users perspective than doing a bunch of wrappers. However, we have so many it's probably too late to change all of these at this point. Just worth noting for future extensions.

In new mteb.evaluate we have just 5 parameters

mteb/mteb/evaluate.py

Lines 162 to 171 in cf01bc8

    
           def evaluate( 
        
               model: ModelMeta | MTEBModels | SentenceTransformer | CrossEncoder, 
        
               tasks: AbsTask | Iterable[AbsTask], 
        
               *, 
        
               co2_tracker: bool | None = None, 
        
               raise_error: bool = True, 
        
               encode_kwargs: dict[str, Any] | None = None, 
        
               cache: ResultCache | None = ResultCache(), 
        
               overwrite_strategy: str | OverwriteStrategy = "only-missing", 
        
           ) -> ModelResult:

KennethEnevoldsen · 2025-08-22T08:41:27Z

+    def __init__(
+        self,
+        model: SearchProtocol,
+        results_path: str | Path,


So based on @orionw and @Samoed comment, I might be better to add the following arguments:

save_retrieval_predictions: bool
prediction_save_path: Path

And then in the future we can convert raise a depreciationwarning on save_retrieval_predictions and instead refer to save_predictions.

let me know what you think?

(sorry @Samoed i understand that this is a digression from before)

We might still want to keep the wrapper internally (without exposing it to the user)

And then in the future we can convert raise a depreciationwarning on save_retrieval_predictions and instead refer to save_predictions.

Where do you want to add it? To mteb.evaluate?

Yea that would be the idea

Should we add method save_predictions to all tasks then? And we can use only prediction_save_path: Path then

Hmm but that required a lot of implementation across the different tasks?

Yes, but I don't like that we're passing task-specific args. Also, for some users this can be helpful

Sure, it was simply that we don't have it implemented yet so I am afraid it would give a poor user experience

I will implement it for retrieval in this pr and will add to other tasks later

Alright let us do it this way (sorry for backtracking)

KennethEnevoldsen

can you redo the example in the top. I think this format looks pretty good though. Few updates on naming things

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

Samoed · 2025-08-26T08:52:39Z

Updated example in description

KennethEnevoldsen · 2025-08-26T09:24:25Z

Looks great @Samoed - I think this is good to merge - we need to update some docs away from using MTEB() for two-stage retrieval, but I can do that in a separate PR

Samoed and others added 4 commits August 17, 2025 18:27

init two stage

b1cffaa

working 2 stage reranking

3f8f3b9

upd numpy meta

579caf2

fix tests

9c78d6d

Samoed added the v2 label Aug 18, 2025

Samoed requested review from KennethEnevoldsen and orionw August 18, 2025 07:44

Samoed marked this pull request as ready for review August 18, 2025 07:45

Samoed added 7 commits August 18, 2025 14:25

fix python 3.9

97d0da5

format

53190e1

Merge branch 'refs/heads/v2.0.0' into two_stage_rerank

e7ad863

simplify

2e88e2f

fix cross encoder meta

9624259

add meta to sentence transformers wrapper

f3db816

fix model meta

c7077b9

KennethEnevoldsen reviewed Aug 19, 2025

View reviewed changes

create RetrievalSaveResultsWrapper

6f53499

Samoed requested a review from KennethEnevoldsen August 20, 2025 13:22

rename

573e2d3

orionw approved these changes Aug 20, 2025

View reviewed changes

Samoed mentioned this pull request Aug 20, 2025

fix: ensure that there are always relevant docs attached to query #3058

Merged

KennethEnevoldsen reviewed Aug 22, 2025

View reviewed changes

Samoed added 2 commits August 22, 2025 17:24

save only model name and revision in previous_results_model_meta

1f705ed

add results save path

0773402

Samoed requested a review from KennethEnevoldsen August 22, 2025 15:19

KennethEnevoldsen reviewed Aug 26, 2025

View reviewed changes

Comment thread mteb/evaluate.py Outdated

Comment thread mteb/evaluate.py Outdated

Comment thread mteb/abstasks/AbsTask.py Outdated

Samoed and others added 2 commits August 26, 2025 13:20

Update mteb/evaluate.py

6910563

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

rename results folder to prediction

c19125e

add more info about save path

5ae93f0

Samoed enabled auto-merge (squash) August 26, 2025 09:26

Samoed merged commit 687cb78 into v2.0.0 Aug 26, 2025
13 of 14 checks passed

Samoed deleted the two_stage_rerank branch August 26, 2025 13:28

		assert result.returncode == 0, "Command failed"


		def test_save_predictions():

	def run(
	self,
	model: MTEBModels \| CrossEncoder \| SentenceTransformer,
	verbosity: int = 1,
	output_folder: str \| None = "results",
	eval_splits: list[str] \| None = None,
	eval_subsets: list[str] \| None = None,
	overwrite_results: bool = False,
	raise_error: bool = True,
	co2_tracker: bool = True,
	encode_kwargs: dict[str, Any] \| None = None,
	**kwargs,
	) -> list[TaskResult]:

Uh oh!

Conversation

Samoed commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen left a comment • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Aug 20, 2025

Uh oh!

Samoed commented Aug 20, 2025

Uh oh!

orionw left a comment

Choose a reason for hiding this comment

Uh oh!

Samoed commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Samoed commented Aug 26, 2025

Uh oh!

KennethEnevoldsen commented Aug 26, 2025

Samoed commented Aug 17, 2025 •

edited

Loading

KennethEnevoldsen left a comment •

edited by Samoed

Loading

Samoed Aug 22, 2025 •

edited

Loading

KennethEnevoldsen Aug 22, 2025 •

edited

Loading

Samoed Aug 22, 2025 •

edited

Loading

Samoed commented Aug 19, 2025 •

edited

Loading

Samoed commented Aug 20, 2025 •

edited

Loading

Samoed Aug 22, 2025 •

edited

Loading