Add jina, uae, stella models by Samoed · Pull Request #1319 · embeddings-benchmark/mteb

Samoed · 2024-10-24T11:57:30Z

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

Co-authored-by: Wang Bo <bo.wang@jina.ai>

bwanglzu · 2024-10-24T13:05:51Z

the rest looks good to me, need to run some check to make sure different task adapters (especially retrieval), task and prompt_name is correctly passed and can reproduce our reported results, i'm running some small testing

Samoed · 2024-10-24T13:07:17Z

I have results. I will paste them soon (currently creating table to compare easily and for jina they are same). @bwanglzu Thank you very much!

Samoed · 2024-10-24T13:35:56Z

Results Summary:

I was able to reproduce the results for jina-embeddings-v3, except for EmotionClassification. Overall, the results seem consistent.
For UAE-Large-V1, the results are close but differ for ToxicConversationsClassification.
I couldn't reproduce the results for stella. I'm considering removing it from this PR. I've opened an issue on HF regarding this: link. I think this is because they have different dimension, but I'm not sure. (UPD. Rerun as GritLM model, results much better. I forgot that instruct model)
@bwanglzu

Full results

Classification

model name	AmazonCounterfactualClassification (en)	EmotionClassification	ToxicConvesationsClassification
jina-embeddings-v3 (leaderboard)	89.49	73.3	91.29
jina-embeddings-v3	89.34	77.26	91.25
UAE-Large-V1 (leaderboard)	75.55	51.75	71.09
UAE-Large-V1	74.77	51.75	66.93
stella_en_400M_v5 (leaderboard)	92.36	78.77	89.94
stella_en_400M_v5	91.76	81.44	88.11

Clustering

model name	ArxivClusteringS2S	RedditClustering
jina-embeddings-v3 (leaderboard)	39.27	55.4
jina-embeddings-v3	39.24	55.18
UAE-Large-V1 (leaderboard)	43.09	60.52
UAE-Large-V1	43.01	59.77
stella_en_400M_v5 (leaderboard)	49.82	71.19
stella_en_400M_v5	49.67	70.67

PairClassification

model name	SprintDuplicateQuestions	TwitterSemEval2015
jina-embeddings-v3 (leaderboard)	96.99	70.9
jina-embeddings-v3	96.99	70.9
UAE-Large-V1 (leaderboard)	97.24	78.17
UAE-Large-V1	97.23	78.16
stella_en_400M_v5 (leaderboard)	95.59	80.18
stella_en_400M_v5	95.50	80.26

Reranking

model name	SciDocsRR	AskUbuntuDupQuestions
jina-embeddings-v3 (leaderboard)	84.88	65.04
jina-embeddings-v3	84.86	65.31
UAE-Large-V1 (leaderboard)	87.49	64.2
UAE-Large-V1	87.03	63.12
stella_en_400M_v5 (leaderboard)	88.44	66.15
stella_en_400M_v5	88.16	65.55

Retrieval

model name	SCIDOCS	SciFact
jina-embeddings-v3 (leaderboard)	19.81	72.31
jina-embeddings-v3	19.87	72.68
UAE-Large-V1 (leaderboard)	22.98	74.07
UAE-Large-V1	22.98	74.07
stella_en_400M_v5 (leaderboard)	25.04	78.23
stella_en_400M_v5	23.96	77.96

STS

model name	STS16	STSBenchmark
jina-embeddings-v3 (leaderboard)	86.85	89.44
jina-embeddings-v3	86.83	89.44
UAE-Large-V1 (leaderboard)	86.61	89.06
UAE-Large-V1	86.61	89.06
stella_en_400M_v5 (leaderboard)	87.14	87.74
stella_en_400M_v5	87.00	87.56

Summarization

model name	SummEval
jina-embeddings-v3 (leaderboard)	29.71
jina-embeddings-v3	29.71
UAE-Large-V1 (leaderboard)	32.03
UAE-Large-V1	31.60
stella_en_400M_v5 (leaderboard)	31.66
stella_en_400M_v5	30.59

bwanglzu · 2024-10-24T14:49:57Z

perfect thanks @Samoed ! seems our reported on Emotion is lower than what we actually have (lol).

do you mind to share me your script so that i can run a few more experiments?

BTW some of our reported score might be comes from a smaller context length such as 512, i do not recall in which dataset we evaluate on 512 context length but i believe most of the MTEB tasks except LongEMbed

Samoed · 2024-10-24T15:00:24Z

Here is my code

Samoed · 2024-10-24T15:02:37Z

BTW some of our reported score might be comes from a smaller context length such as 512

What do you mean? I think that mteb using SentenceTransformer uses all context length

bwanglzu · 2024-10-25T10:12:05Z

What do you mean? I think that mteb using SentenceTransformer uses all context length

i mean when we submit scores, their might be a small chance it is being submitted by different ppl in the team which utilise slightly different max sequence length, sometimes to speed up evaluation we use 512, sometimes we use full context length which is 8192.

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

# Conflicts: # mteb/models/overview.py

KennethEnevoldsen

Only a minor thing otherwise all good

Samoed · 2024-10-28T16:52:57Z

@KennethEnevoldsen Is this PR ready for merge?

bwanglzu · 2024-10-29T13:30:15Z

i tested a few more benchmarks and the results are consistent, thanks @Samoed !

KennethEnevoldsen

This looks good! Very happy to have it merged in

Samoed added 4 commits October 22, 2024 10:39

add models

ce5e3dd

fix

0b411c6

fix

f02c4a3

fix prompt

2ca8509

Samoed mentioned this pull request Oct 24, 2024

feat: add jina-v3 into model list #1318

Closed

15 tasks