[v2] New encoder interface by Samoed · Pull Request #2415 · embeddings-benchmark/mteb

Samoed · 2025-03-22T23:04:35Z

Renamed Wrapper to AbsEncoder
Added similarity and pairwise similarity functions to Encoder interface
Created file for similarity functions and moved PromptType and BatchedInput to types file

All tasks now use the model's similarity implementation cosine by default. This allows for new approaches for similarity, but does mean that scores will change:

Results on bitext and retrieval models that use not cosine similarity. For now there is only facebook/contriever-msmarco, that was added in v2

For now, I don't know what to do with CrossEncoders. Should we create another interface for them?

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

# Conflicts: # mteb/evaluation/evaluators/Image/ClassificationEvaluator.py # mteb/evaluation/evaluators/PairClassificationEvaluator.py # mteb/model_meta.py

Samoed · 2025-03-25T09:14:13Z

@isaac-chung @KennethEnevoldsen Can you review?

KennethEnevoldsen

A lot of good stuff here - sorry the second part of the review came so long after the first comments (the GitHub UI is simply excessively slow when working with large diffs - this is also why some of the comments are a bit terse)

KennethEnevoldsen

Looks good. I think this one is ready to merge

isaac-chung · 2025-04-02T11:49:30Z

There are 18 model groups and some of them are difficult to run and some are too big for me to run. This PR will not change much for the models, I think we can check them later, or can you provide top 5 models for us to check?

Hmm, if you can run every model in this screenshot that is runnable, it's probably a good enough check. Thanks again in advance!
(taken from the MIEB multilingual LB, sort by mean (task type))

CC @gowitheflow-1998 this is a change in v2.

Samoed · 2025-04-02T11:58:40Z

Your're sharing MTEB(Multilingual, v1). For me MIEB(Multilingual) looks like:

isaac-chung · 2025-04-02T11:59:54Z

(taken from the MIEB multilingual LB, sort by mean (task type))

Did you try this?

Samoed · 2025-04-02T12:02:04Z

It's a bit different with the sort, but still you're sharing MTEB, because GritLM, gte, e5-mistral is text only model

isaac-chung · 2025-04-02T12:12:25Z

Ah yes, yours is correct. e5-v is at the top.

Samoed · 2025-04-03T16:42:14Z

@orionw Hope you’ve successfully submitted to COLM! Could you take a look at the PR when you have a moment?

Regarding the EncoderInterface, maybe you have some suggestions for how we could better support cross-encoders?

orionw · 2025-04-04T11:01:00Z

Thanks for the ping @Samoed, sorry for the delay. It seems like this is the same interface that the cross-encoders currently have: namely the __init__, combine_query_and_instruction, and predict.

I have missed a lot of the conversation here so apologies if this has been hashed out: the only thing I think is confusing is we lump cross-encoders into the same class as the bi-encoders (also same with the base DenseRetrievalExactSearch class). I would prefer to have them be separate interfaces as there really isn't any overlap between how they work (except for GritLM).

Otherwise it looks good and complete w.r.t. cross-encoders!

Samoed · 2025-04-04T11:23:52Z

I think is confusing is we lump cross-encoders into the same class as the bi-encoders

Yes, I understand your point, but we haven't implemented the cross-encoder interface anywhere yet, and it would be difficult to check each time if a model is a cross-encoder. I have made a change based on the ModelMeta.is_cross_encoder, so the predict function will only be applied if the model is a cross-encoder. For now, I think it would be easier to support a single interface.

Samoed · 2025-04-04T18:49:15Z

@isaac-chung I've tested models.

Results from this PR:

task_name	google/siglip-base-patch16-256-multilingual	google/siglip-base-patch16-384	google/siglip-base-patch16-512	google/siglip-large-patch16-256	google/siglip-large-patch16-384	google/siglip-so400m-patch14-384	laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K	laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
AROFlickrOrder	35.78	34.36	31.94	37.58	33.2	35	36.18	38.86
BLINKIT2IRetrieval	49.76	50.44	51.92	48.35	49.69	50.33	50.38	52.28
CVBenchCount	11.42	22.72	21.19	13.2	10.91	3.93	26.4	20.05
MNIST	94.61	95.24	95.38	96.05	96.13	95.95	96.03	95.74
MNISTZeroShot	83.1	80.88	82.86	85	85.17	88.56	81.46	77.1
STS12VisualSTS	66.62	64.62	64.97	63.94	66.3	61.9	62.37	62.81

Results from leaderboard:

task_name	google/siglip-base-patch16-256-multilingual	google/siglip-base-patch16-384	google/siglip-base-patch16-512	google/siglip-large-patch16-256	google/siglip-large-patch16-384	google/siglip-so400m-patch14-384	laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K	laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
AROFlickrOrder	36	34.18	32.14	37.76	33.34	34.92	34.56	38.84
BLINKIT2IRetrieval	37.31	35.82	38.31	35.57	33.58	36.82	36.32	39.55
CVBenchCount	34.64	53.43	55.2	8.88	8.76	21.7	43.27	4.19
MNIST	94.98	95.35	95.48	96.05	96.21	96.11	96.16	96.12
MNISTZeroShot	83.09	80.88	82.85	85	85.17	88.56	81.44	77.09
STS12VisualSTS	66.62	64.62	64.97	63.94	66.3	61.9	62.36	62.81

BLINKIT2IRetrieval and CVBenchCount results don't match, and I've run them from main. And I can't reporoduce scores from leaderboard for BLINKIT2IRetrieval. For CVBenchCount I don't understand why results from my PR not matching

task_name	google/siglip-base-patch16-512	google/siglip-large-patch16-256	google/siglip-large-patch16-384	google/siglip-so400m-patch14-384
BLINKIT2IRetrieval	51.88	48.47	49.86	50.32
CVBenchCount	55.2	8.88	8.76	21.7

isaac-chung · 2025-04-04T19:09:36Z

Odd. I was able to get 'main_score': 0.55203 on main for google/siglip-base-patch16-512 and CVBenchCount. Can take a look at the PR's score.

Samoed · 2025-04-04T19:20:47Z

Yes, I'm getting similar score 51.88 on BLINKIT2IRetrieval, but on leaderboard reported 38.31. I think at one time metric of retrieval task was changed, because it has ndcg@10 0.51, but main score 38.31 (maybe previously it had ndcg@1 and because of that main score is different) https://github.com/embeddings-benchmark/results/blob/7efced57a95d46c32df7f2bdb51bf0ea77944fbc/results/google__siglip-base-patch16-512/753a949581523b60257d93e18391e8c27f72eb22/BLINKIT2IRetrieval.json#L162

CC @gowitheflow-1998

isaac-chung · 2025-04-04T19:24:24Z

I got 'main_score': 0.55457 from v2.0.0 for google/siglip-base-patch16-512 and CVBenchCount. So far main, v2.0.0, and the LB agree for this model-task combination.

Samoed · 2025-04-04T19:27:28Z

For CVBench I agree, but what's for BLINKIT2IRetrieval? It's reported score 38.31, but I got from main 0.51

https://github.com/embeddings-benchmark/results/blob/7efced57a95d46c32df7f2bdb51bf0ea77944fbc/results/google__siglip-base-patch16-512/753a949581523b60257d93e18391e8c27f72eb22/BLINKIT2IRetrieval.json#L162

isaac-chung · 2025-04-04T19:27:54Z

We previously overwrote LB scores to use cv_recall_at_1 for BLINKIT2IRetrieval: embeddings-benchmark/results#122 (comment). We should update main_score to cv_recall_at_1 as well.

isaac-chung · 2025-04-04T19:31:14Z

That should be it, thanks so much, @Samoed !
@gowitheflow-1998 I think we anticipated small deviations like 1-2 decimal points. So this looks decent.

Samoed · 2025-04-04T19:32:50Z

Scores after main merge are matching for CVBenchCount too

task_name	google/siglip-base-patch16-512	google/siglip-large-patch16-256	google/siglip-large-patch16-384	google/siglip-so400m-patch14-384
CVBenchCount	55.46	8.76	8.76	21.7

Samoed added 14 commits March 23, 2025 02:04

update mteb

828a8ae

fix docs

28272be

fix docs

5b1b902

fix docs

ed0c9c0

update image tasks

d24d586

update models

0281023

fix imports

05a01d6

fix mteb meta name

a46f740

fix mock model

f2ea9f6

Merge branch 'v2.0.0' into new_encoder_interface

1aaf0f8

# Conflicts: # mteb/evaluation/evaluators/Image/ClassificationEvaluator.py # mteb/evaluation/evaluators/PairClassificationEvaluator.py # mteb/model_meta.py

fix mock model

e548529

fixes

aa92576

fix similarity

70dffec

fix tests

99a4678

Samoed requested review from KennethEnevoldsen and isaac-chung March 23, 2025 21:39

Samoed marked this pull request as ready for review March 23, 2025 21:39

Update docs/adding_a_model.md

da3b46b

Samoed commented Mar 25, 2025

View reviewed changes

Comment thread mteb/encoder_interface.py Outdated

This was referenced Mar 26, 2025

Refactor MIEB Clustering #2434

Closed

Refactor MIEB retrieval #2435

Closed

KennethEnevoldsen reviewed Mar 26, 2025

View reviewed changes

Comment thread tests/test_benchmark/mock_models.py Outdated

Comment thread mteb/model_meta.py Outdated

Samoed added 3 commits March 26, 2025 17:06

change similarity to custom

8dd66d6

add vision similarity

3b9b1c2

refactor scoring function values to use .value attribute

b8b5ac4

Samoed requested a review from KennethEnevoldsen March 26, 2025 15:48

KennethEnevoldsen reviewed Mar 26, 2025

View reviewed changes

Samoed mentioned this pull request Mar 26, 2025

Allow numpy output in Image2Text #2441

Closed

fixes

5e71603

merge v2

f436681

KennethEnevoldsen approved these changes Apr 2, 2025

View reviewed changes

Comment thread docs/adding_a_model.md Outdated

Comment thread mteb/abstasks/Image/AbsTaskVisualSTS.py

Comment thread mteb/models/sentence_transformer_wrapper.py

Samoed added 3 commits April 3, 2025 00:59

rename abs encoder to model

37d7ec7

upd docs

b704084

remove vision similarity

b153676

Samoed added 3 commits April 4, 2025 11:38

all models inherit from AbsEncoder

8180f04

fix openclip loader

055b7f0

make sure that loaders follow new loading format

d37058d

Samoed added 2 commits April 4, 2025 15:50

ouput image instead of tesnsor

e3dfdbe

merge

6434aa3

Samoed mentioned this pull request Apr 4, 2025

CLI Tool for results dataframe on leaderboard #2454

Closed

4 tasks

Samoed merged commit 9460039 into v2.0.0 Apr 4, 2025

Samoed deleted the new_encoder_interface branch April 4, 2025 19:42

Samoed added the v2 label Sep 6, 2025

Uh oh!

Conversation

Samoed commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Quality

Documentation

Testing

Adding datasets checklist

Adding a model checklist

Uh oh!

Samoed commented Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isaac-chung commented Apr 2, 2025

Uh oh!

Samoed commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Apr 2, 2025

Uh oh!

Samoed commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Apr 2, 2025

Uh oh!

Samoed commented Apr 3, 2025

Uh oh!

orionw commented Apr 4, 2025

Uh oh!

Samoed commented Apr 4, 2025

Uh oh!

Samoed commented Apr 4, 2025

Uh oh!

isaac-chung commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Apr 4, 2025

Uh oh!

Samoed commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Apr 4, 2025

Uh oh!

isaac-chung commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Samoed commented Mar 22, 2025 •

edited

Loading

Samoed commented Apr 2, 2025 •

edited

Loading

Samoed commented Apr 2, 2025 •

edited

Loading

isaac-chung commented Apr 4, 2025 •

edited

Loading

Samoed commented Apr 4, 2025 •

edited

Loading

Samoed commented Apr 4, 2025 •

edited

Loading

isaac-chung commented Apr 4, 2025 •

edited

Loading

Samoed commented Apr 4, 2025 •

edited

Loading