Skip to content

Add jina, uae, stella models#1319

Merged
KennethEnevoldsen merged 19 commits into
embeddings-benchmark:mainfrom
Samoed:add_jina_models
Oct 30, 2024
Merged

Add jina, uae, stella models#1319
KennethEnevoldsen merged 19 commits into
embeddings-benchmark:mainfrom
Samoed:add_jina_models

Conversation

@Samoed

@Samoed Samoed commented Oct 24, 2024

Copy link
Copy Markdown
Member

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@Samoed Samoed mentioned this pull request Oct 24, 2024
15 tasks
Comment thread mteb/models/jina_models.py
Comment thread mteb/models/jina_models.py Outdated
Comment thread mteb/models/jina_models.py Outdated
Co-authored-by: Wang Bo <bo.wang@jina.ai>
Comment thread mteb/models/jina_models.py Outdated
Co-authored-by: Wang Bo <bo.wang@jina.ai>
Comment thread mteb/models/jina_models.py Outdated
Comment thread mteb/models/jina_models.py Outdated
@bwanglzu

Copy link
Copy Markdown

the rest looks good to me, need to run some check to make sure different task adapters (especially retrieval), task and prompt_name is correctly passed and can reproduce our reported results, i'm running some small testing

@Samoed

Samoed commented Oct 24, 2024

Copy link
Copy Markdown
Member Author

I have results. I will paste them soon (currently creating table to compare easily and for jina they are same). @bwanglzu Thank you very much!

@Samoed

Samoed commented Oct 24, 2024

Copy link
Copy Markdown
Member Author

Results Summary:

  1. I was able to reproduce the results for jina-embeddings-v3, except for EmotionClassification. Overall, the results seem consistent.
  2. For UAE-Large-V1, the results are close but differ for ToxicConversationsClassification.
  3. I couldn't reproduce the results for stella. I'm considering removing it from this PR. I've opened an issue on HF regarding this: link. I think this is because they have different dimension, but I'm not sure. (UPD. Rerun as GritLM model, results much better. I forgot that instruct model)
    @bwanglzu

Full results

Classification

model name AmazonCounterfactualClassification (en) EmotionClassification ToxicConvesationsClassification
jina-embeddings-v3 (leaderboard) 89.49 73.3 91.29
jina-embeddings-v3 89.34 77.26 91.25
UAE-Large-V1 (leaderboard) 75.55 51.75 71.09
UAE-Large-V1 74.77 51.75 66.93
stella_en_400M_v5 (leaderboard) 92.36 78.77 89.94
stella_en_400M_v5 91.76 81.44 88.11

Clustering

model name ArxivClusteringS2S RedditClustering
jina-embeddings-v3 (leaderboard) 39.27 55.4
jina-embeddings-v3 39.24 55.18
UAE-Large-V1 (leaderboard) 43.09 60.52
UAE-Large-V1 43.01 59.77
stella_en_400M_v5 (leaderboard) 49.82 71.19
stella_en_400M_v5 49.67 70.67

PairClassification

model name SprintDuplicateQuestions TwitterSemEval2015
jina-embeddings-v3 (leaderboard) 96.99 70.9
jina-embeddings-v3 96.99 70.9
UAE-Large-V1 (leaderboard) 97.24 78.17
UAE-Large-V1 97.23 78.16
stella_en_400M_v5 (leaderboard) 95.59 80.18
stella_en_400M_v5 95.50 80.26

Reranking

model name SciDocsRR AskUbuntuDupQuestions
jina-embeddings-v3 (leaderboard) 84.88 65.04
jina-embeddings-v3 84.86 65.31
UAE-Large-V1 (leaderboard) 87.49 64.2
UAE-Large-V1 87.03 63.12
stella_en_400M_v5 (leaderboard) 88.44 66.15
stella_en_400M_v5 88.16 65.55

Retrieval

model name SCIDOCS SciFact
jina-embeddings-v3 (leaderboard) 19.81 72.31
jina-embeddings-v3 19.87 72.68
UAE-Large-V1 (leaderboard) 22.98 74.07
UAE-Large-V1 22.98 74.07
stella_en_400M_v5 (leaderboard) 25.04 78.23
stella_en_400M_v5 23.96 77.96

STS

model name STS16 STSBenchmark
jina-embeddings-v3 (leaderboard) 86.85 89.44
jina-embeddings-v3 86.83 89.44
UAE-Large-V1 (leaderboard) 86.61 89.06
UAE-Large-V1 86.61 89.06
stella_en_400M_v5 (leaderboard) 87.14 87.74
stella_en_400M_v5 87.00 87.56

Summarization

model name SummEval
jina-embeddings-v3 (leaderboard) 29.71
jina-embeddings-v3 29.71
UAE-Large-V1 (leaderboard) 32.03
UAE-Large-V1 31.60
stella_en_400M_v5 (leaderboard) 31.66
stella_en_400M_v5 30.59

@bwanglzu

bwanglzu commented Oct 24, 2024

Copy link
Copy Markdown

perfect thanks @Samoed ! seems our reported on Emotion is lower than what we actually have (lol).

do you mind to share me your script so that i can run a few more experiments?

BTW some of our reported score might be comes from a smaller context length such as 512, i do not recall in which dataset we evaluate on 512 context length but i believe most of the MTEB tasks except LongEMbed

@Samoed

Samoed commented Oct 24, 2024

Copy link
Copy Markdown
Member Author

Here is my code

@Samoed

Samoed commented Oct 24, 2024

Copy link
Copy Markdown
Member Author

BTW some of our reported score might be comes from a smaller context length such as 512

What do you mean? I think that mteb using SentenceTransformer uses all context length

@Samoed Samoed marked this pull request as ready for review October 24, 2024 17:57
@bwanglzu

Copy link
Copy Markdown

What do you mean? I think that mteb using SentenceTransformer uses all context length

i mean when we submit scores, their might be a small chance it is being submitted by different ppl in the team which utilise slightly different max sequence length, sometimes to speed up evaluation we use 512, sometimes we use full context length which is 8192.

Comment thread mteb/models/e5_instruct.py
Comment thread mteb/models/jina_models.py Outdated
Comment thread mteb/models/jina_models.py Outdated
Comment thread mteb/models/jina_models.py Outdated
Samoed and others added 2 commits October 25, 2024 22:47
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a minor thing otherwise all good

Comment thread pyproject.toml Outdated
@Samoed

Samoed commented Oct 28, 2024

Copy link
Copy Markdown
Member Author

@KennethEnevoldsen Is this PR ready for merge?

@bwanglzu

Copy link
Copy Markdown

i tested a few more benchmarks and the results are consistent, thanks @Samoed !

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Very happy to have it merged in

@KennethEnevoldsen KennethEnevoldsen merged commit 0b846ff into embeddings-benchmark:main Oct 30, 2024
@Samoed Samoed deleted the add_jina_models branch October 20, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants