Skip to content

Qzhou embedding results#250

Merged
KennethEnevoldsen merged 5 commits into
embeddings-benchmark:mainfrom
PennyYu123:qzhou-embedding-results
Aug 26, 2025
Merged

Qzhou embedding results#250
KennethEnevoldsen merged 5 commits into
embeddings-benchmark:mainfrom
PennyYu123:qzhou-embedding-results

Conversation

@PennyYu123

Copy link
Copy Markdown
Contributor

We have released the HF model publicly and resubmitted the mteb implementation.

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted is obtained using the reference implementation
  • [ *] My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • [ *] I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions

github-actions Bot commented Aug 3, 2025

Copy link
Copy Markdown

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Kingsoft-LLM/QZhou-Embedding
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P.v2, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, CmedqaRetrieval, Cmnli, CovidRetrieval, DuRetrieval, EcomRetrieval, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, MultilingualSentiment, Ocnli, OnlineShopping, PAWSX, QBQTC, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai

Results for Kingsoft-LLM/QZhou-Embedding

task_name Kingsoft-LLM/QZhou-Embedding google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AFQMC 0.67 nan 0.33 0.72
ATEC 0.55 nan 0.4 0.65
AmazonCounterfactualClassification 0.93 0.88 0.7 0.97
ArXivHierarchicalClusteringP2P 0.66 0.65 0.56 0.69
ArXivHierarchicalClusteringS2S 0.64 0.64 0.54 0.65
ArguAna 0.84 0.86 0.54 0.90
AskUbuntuDupQuestions 0.69 0.64 0.59 0.70
BIOSSES 0.93 0.89 0.85 0.97
BQ 0.77 nan 0.48 0.81
Banking77Classification 0.85 0.94 0.75 0.94
BiorxivClusteringP2P.v2 0.54 0.54 0.37 0.56
CLSClusteringP2P 0.65 nan nan 0.82
CLSClusteringS2S 0.61 nan nan 0.74
CMedQAv1-reranking 0.94 nan 0.68 0.94
CMedQAv2-reranking 0.94 nan 0.67 0.94
CQADupstackGamingRetrieval 0.76 0.71 0.59 0.79
CQADupstackUnixRetrieval 0.71 0.54 0.4 0.72
ClimateFEVERHardNegatives 0.49 0.31 0.26 0.49
CmedqaRetrieval 0.52 nan 0.29 0.57
Cmnli 0.95 nan nan 0.95
CovidRetrieval 0.93 0.79 0.76 0.96
DuRetrieval 0.92 nan 0.85 0.94
EcomRetrieval 0.77 nan 0.55 0.78
FEVERHardNegatives 0.94 0.89 0.84 0.95
FiQA2018 0.60 0.62 0.44 0.80
HotpotQAHardNegatives 0.81 0.87 0.71 0.87
IFlyTek 0.58 nan 0.42 0.58
ImdbClassification 0.96 0.95 0.89 0.97
JDReview 0.88 nan 0.81 0.92
LCQMC 0.82 nan 0.76 0.82
MMarcoReranking 0.44 nan 0.29 0.47
MMarcoRetrieval 0.83 nan 0.79 0.90
MTOPDomainClassification 0.96 0.98 0.9 1.00
MassiveIntentClassification 0.55 0.82 0.6 0.92
MassiveScenarioClassification 0.74 0.87 0.7 0.99
MedicalRetrieval 0.73 nan 0.51 0.76
MedrxivClusteringP2P.v2 0.50 0.47 0.34 0.52
MedrxivClusteringS2S.v2 0.48 0.45 0.32 0.51
MindSmallReranking 0.34 0.33 0.3 0.34
MultilingualSentiment 0.85 nan 0.71 0.85
Ocnli 0.95 nan nan 0.95
OnlineShopping 0.96 nan 0.9 0.97
PAWSX 0.70 nan 0.15 0.70
QBQTC 0.60 nan nan 0.71
SCIDOCS 0.29 0.25 0.17 0.35
SICK-R 0.88 0.83 0.8 0.95
STS12 0.90 0.82 0.8 0.95
STS13 0.96 0.90 0.82 0.98
STS14 0.93 0.85 0.78 0.98
STS15 0.95 0.90 0.89 0.98
STS17 0.89 0.89 0.82 0.93
STS22.v2 0.77 0.72 0.64 0.77
STSB 0.92 0.85 0.82 0.92
STSBenchmark 0.95 0.89 0.87 0.95
SprintDuplicateQuestions 0.98 0.97 0.93 0.98
StackExchangeClustering.v2 0.76 0.92 0.46 0.92
StackExchangeClusteringP2P.v2 0.55 0.51 0.39 0.55
SummEvalSummarization.v2 0.33 0.38 0.31 0.39
T2Reranking 0.68 0.68 0.66 0.73
T2Retrieval 0.82 nan 0.76 0.89
TNews 0.61 nan 0.49 0.61
TRECCOVID 0.78 0.86 0.71 0.95
ThuNewsClusteringP2P 0.82 nan nan 0.89
ThuNewsClusteringS2S 0.76 nan nan 0.88
Touche2020Retrieval.v3 0.50 0.52 0.5 0.75
ToxicConversationsClassification 0.90 0.89 0.66 0.98
TweetSentimentExtractionClassification 0.77 0.70 0.63 0.88
TwentyNewsgroupsClustering.v2 0.81 0.57 0.39 0.88
TwitterSemEval2015 0.87 0.79 0.75 0.89
TwitterURLCorpus 0.92 0.87 0.86 0.96
VideoRetrieval 0.79 nan 0.58 0.84
Waimai 0.92 nan 0.86 0.92
Average 0.76 0.73 0.61 0.81

@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Aug 3, 2025
correct model_meta info
@PennyYu123

Copy link
Copy Markdown
Contributor Author

past PR: #249

@PennyYu123 PennyYu123 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

We are still waiting for the model PR to merge :)

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Thanks! I have looked over the scores, and a few seem suspiciously high:

  • AmazonCounterfactualClassification
  • AskUbuntuDupQuestions
  • BQ
  • Waimai
  • TNews
  • IFlyTek
  • ...

However, it seems like these are not in the annotated training data:

import mteb
meta = mteb.get_model_meta("Kingsoft-LLM/QZhou-Embedding")

# in training data
"AmazonCounterfactualClassification" in meta.training_datasets # True

# not in:
"AskUbuntuDupQuestions" in meta.training_datasets # False
"BQ" in meta.training_datasets # False
"Waimai" in meta.training_datasets # False
"TNews" in meta.training_datasets # False
"IFlyTek" in meta.training_datasets # False

@PennyYu123 can you help me figure out these scores? Could you have missed some annotations or synthetically generated matching training data?

@Samoed

Samoed commented Aug 10, 2025

Copy link
Copy Markdown
Member

If you allow, I can open another PR to mteb to add the missing training sets.

Yes, it would great if you'd add them to training datasets

@Samoed

Samoed commented Aug 11, 2025

Copy link
Copy Markdown
Member

You can add your new scores in new subfolder with your new revision

@PennyYu123

Copy link
Copy Markdown
Contributor Author

Hello, our new model results have been uploaded. We have already submitted a PR to mteb repo. We have also replaced the original model parameter file with our new one in huggingface. Let's continue the previous process. 😊😊😊

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Hi @PennyYu123,

I have merged the PR, but it seems like there are still some datasets missing from the list that you provided:

import mteb
meta = mteb.get_model_meta("Kingsoft-LLM/QZhou-Embedding")
"AmazonCounterfactualClassification" in meta.training_datasets # True
"AskUbuntuDupQuestions" in meta.training_datasets # False
"BQ" in meta.training_datasets # False
"Waimai" in meta.training_datasets # True (fixed)
"TNews" in meta.training_datasets # False (fixed)
"IFlyTek" in meta.training_datasets # False
# do also check the remainder of the list

Can I ask you to update the training datasets again?

@KennethEnevoldsen KennethEnevoldsen removed the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Aug 16, 2025
@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

we restructured our training plan and trained a new model last week

Ahh, great, I will rerun the table to see if there are any remaining concerns.

Oh, I have another request. We have an important presentation on Monday, and we might need our results to be on the leaderboard. Can you get it done in the next few days

I am back from holiday, so that should be possible. Sorry that you had to wait due to the holiday; normally, it takes no more than 1-2 days.

If you can help us complete it, we'd like to contribute more to our open source community. I'm also engaged in retrieval model research, and we may collaborate in the future.

We, of course, always appreciate collaboration and contributions, but let us keep that out of the review process :)

@Samoed

Samoed commented Aug 18, 2025

Copy link
Copy Markdown
Member

@KennethEnevoldsen

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh! forgot to press submit on the review...

I have added the updated table below. There are still a few that seem concerning:

  • TwitterSemEval2015
  • SCIDOCS
  • AskUbuntuDupQuestions

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Kingsoft-LLM/QZhou-Embedding
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P.v2, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, CmedqaRetrieval, Cmnli, CovidRetrieval, DuRetrieval, EcomRetrieval, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, MultilingualSentiment, Ocnli, OnlineShopping, PAWSX, QBQTC, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai

Results for Kingsoft-LLM/QZhou-Embedding

task_name Kingsoft-LLM/QZhou-Embedding google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AFQMC 0.66 nan 0.33 0.72
ATEC 0.55 nan 0.4 0.65
AmazonCounterfactualClassification 0.93 0.88 0.7 0.97
ArXivHierarchicalClusteringP2P 0.66 0.65 0.56 0.69
ArXivHierarchicalClusteringS2S 0.64 0.64 0.54 0.65
ArguAna 0.84 0.86 0.54 0.90
AskUbuntuDupQuestions 0.75 0.64 0.59 0.75
BIOSSES 0.93 0.89 0.85 0.97
BQ 0.77 nan 0.48 0.81
Banking77Classification 0.85 0.94 0.75 0.94
BiorxivClusteringP2P.v2 0.55 0.54 0.37 0.56
CLSClusteringP2P 0.67 nan nan 0.82
CLSClusteringS2S 0.61 nan nan 0.74
CMedQAv1-reranking 0.94 nan 0.68 0.94
CMedQAv2-reranking 0.93 nan 0.67 0.93
CQADupstackGamingRetrieval 0.77 0.71 0.59 0.79
CQADupstackUnixRetrieval 0.70 0.54 0.4 0.72
ClimateFEVERHardNegatives 0.62 0.31 0.26 0.62
CmedqaRetrieval 0.51 nan 0.29 0.57
Cmnli 0.95 nan nan 0.95
CovidRetrieval 0.93 0.79 0.76 0.96
DuRetrieval 0.92 nan 0.85 0.94
EcomRetrieval 0.77 nan 0.55 0.78
FEVERHardNegatives 0.94 0.89 0.84 0.95
FiQA2018 0.60 0.62 0.44 0.80
HotpotQAHardNegatives 0.80 0.87 0.71 0.87
IFlyTek 0.57 nan 0.42 0.58
ImdbClassification 0.96 0.95 0.89 0.97
JDReview 0.90 nan 0.81 0.92
LCQMC 0.82 nan 0.76 0.82
MMarcoReranking 0.51 nan 0.29 0.51
MMarcoRetrieval 0.83 nan 0.79 0.90
MTOPDomainClassification 0.96 0.98 0.9 1.00
MassiveIntentClassification 0.55 0.82 0.6 0.92
MassiveScenarioClassification 0.73 0.87 0.7 0.99
MedicalRetrieval 0.72 nan 0.51 0.76
MedrxivClusteringP2P.v2 0.51 0.47 0.34 0.52
MedrxivClusteringS2S.v2 0.48 0.45 0.32 0.51
MindSmallReranking 0.36 0.33 0.3 0.36
MultilingualSentiment 0.85 nan 0.71 0.85
Ocnli 0.95 nan nan 0.95
OnlineShopping 0.96 nan 0.9 0.97
PAWSX 0.70 nan 0.15 0.70
QBQTC 0.61 nan nan 0.71
SCIDOCS 0.44 0.25 0.17 0.44
SICK-R 0.88 0.83 0.8 0.95
STS12 0.90 0.82 0.8 0.95
STS13 0.95 0.90 0.82 0.98
STS14 0.93 0.85 0.78 0.98
STS15 0.96 0.90 0.89 0.98
STS17 0.90 0.89 0.82 0.93
STS22.v2 0.78 0.72 0.64 0.78
STSB 0.92 0.85 0.82 0.92
STSBenchmark 0.96 0.89 0.87 0.96
SprintDuplicateQuestions 0.98 0.97 0.93 0.98
StackExchangeClustering.v2 0.76 0.92 0.46 0.92
StackExchangeClusteringP2P.v2 0.55 0.51 0.39 0.55
SummEvalSummarization.v2 0.34 0.38 0.31 0.39
T2Reranking 0.68 0.68 0.66 0.73
T2Retrieval 0.81 nan 0.76 0.89
TNews 0.60 nan 0.49 0.60
TRECCOVID 0.79 0.86 0.71 0.95
ThuNewsClusteringP2P 0.83 nan nan 0.89
ThuNewsClusteringS2S 0.78 nan nan 0.88
Touche2020Retrieval.v3 0.49 0.52 0.5 0.75
ToxicConversationsClassification 0.90 0.89 0.66 0.98
TweetSentimentExtractionClassification 0.77 0.70 0.63 0.88
TwentyNewsgroupsClustering.v2 0.82 0.57 0.39 0.88
TwitterSemEval2015 0.92 0.79 0.75 0.92
TwitterURLCorpus 0.92 0.87 0.86 0.96
VideoRetrieval 0.80 nan 0.58 0.84
Waimai 0.92 nan 0.86 0.92
Average 0.76 0.73 0.61 0.81

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

@PennyYu123 can you help me understand the few concerning datasets? Might there be missing dataset annotations?

@PennyYu123

Copy link
Copy Markdown
Contributor Author

We have concurrently updated the following components:
1.​​Hugging Face model parameters​​ - Full refresh of embedding weights
2.MTEB model_meta - Updated release_date and revision
All upgrades are now live and operational.

@KennethEnevoldsen

KennethEnevoldsen commented Aug 25, 2025

Copy link
Copy Markdown
Contributor

PR that updates model revision embeddings-benchmark/mteb#3069

I will recompute the table

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Kingsoft-LLM/QZhou-Embedding
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P.v2, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, CmedqaRetrieval, Cmnli, CovidRetrieval, DuRetrieval, EcomRetrieval, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, MultilingualSentiment, Ocnli, OnlineShopping, PAWSX, QBQTC, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai

Results for Kingsoft-LLM/QZhou-Embedding

task_name Kingsoft-LLM/QZhou-Embedding google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AFQMC 0.67 nan 0.33 0.72
ATEC 0.55 nan 0.4 0.65
AmazonCounterfactualClassification 0.93 0.88 0.7 0.97
ArXivHierarchicalClusteringP2P 0.66 0.65 0.56 0.69
ArXivHierarchicalClusteringS2S 0.64 0.64 0.54 0.65
ArguAna 0.84 0.86 0.54 0.90
AskUbuntuDupQuestions 0.69 0.64 0.59 0.70
BIOSSES 0.93 0.89 0.85 0.97
BQ 0.77 nan 0.48 0.81
Banking77Classification 0.85 0.94 0.75 0.94
BiorxivClusteringP2P.v2 0.54 0.54 0.37 0.56
CLSClusteringP2P 0.65 nan nan 0.82
CLSClusteringS2S 0.61 nan nan 0.74
CMedQAv1-reranking 0.94 nan 0.68 0.94
CMedQAv2-reranking 0.94 nan 0.67 0.94
CQADupstackGamingRetrieval 0.76 0.71 0.59 0.79
CQADupstackUnixRetrieval 0.71 0.54 0.4 0.72
ClimateFEVERHardNegatives 0.49 0.31 0.26 0.49
CmedqaRetrieval 0.52 nan 0.29 0.57
Cmnli 0.95 nan nan 0.95
CovidRetrieval 0.93 0.79 0.76 0.96
DuRetrieval 0.92 nan 0.85 0.94
EcomRetrieval 0.77 nan 0.55 0.78
FEVERHardNegatives 0.94 0.89 0.84 0.95
FiQA2018 0.60 0.62 0.44 0.80
HotpotQAHardNegatives 0.81 0.87 0.71 0.87
IFlyTek 0.58 nan 0.42 0.58
ImdbClassification 0.96 0.95 0.89 0.97
JDReview 0.88 nan 0.81 0.92
LCQMC 0.82 nan 0.76 0.82
MMarcoReranking 0.44 nan 0.29 0.47
MMarcoRetrieval 0.83 nan 0.79 0.90
MTOPDomainClassification 0.96 0.98 0.9 1.00
MassiveIntentClassification 0.55 0.82 0.6 0.92
MassiveScenarioClassification 0.74 0.87 0.7 0.99
MedicalRetrieval 0.73 nan 0.51 0.76
MedrxivClusteringP2P.v2 0.50 0.47 0.34 0.52
MedrxivClusteringS2S.v2 0.48 0.45 0.32 0.51
MindSmallReranking 0.34 0.33 0.3 0.34
MultilingualSentiment 0.85 nan 0.71 0.85
Ocnli 0.95 nan nan 0.95
OnlineShopping 0.96 nan 0.9 0.97
PAWSX 0.70 nan 0.15 0.70
QBQTC 0.60 nan nan 0.71
SCIDOCS 0.29 0.25 0.17 0.35
SICK-R 0.88 0.83 0.8 0.95
STS12 0.90 0.82 0.8 0.95
STS13 0.96 0.90 0.82 0.98
STS14 0.93 0.85 0.78 0.98
STS15 0.95 0.90 0.89 0.98
STS17 0.89 0.89 0.82 0.93
STS22.v2 0.77 0.72 0.64 0.77
STSB 0.92 0.85 0.82 0.92
STSBenchmark 0.95 0.89 0.87 0.95
SprintDuplicateQuestions 0.98 0.97 0.93 0.98
StackExchangeClustering.v2 0.76 0.92 0.46 0.92
StackExchangeClusteringP2P.v2 0.55 0.51 0.39 0.55
SummEvalSummarization.v2 0.33 0.38 0.31 0.39
T2Reranking 0.68 0.68 0.66 0.73
T2Retrieval 0.82 nan 0.76 0.89
TNews 0.61 nan 0.49 0.61
TRECCOVID 0.78 0.86 0.71 0.95
ThuNewsClusteringP2P 0.82 nan nan 0.89
ThuNewsClusteringS2S 0.76 nan nan 0.88
Touche2020Retrieval.v3 0.50 0.52 0.5 0.75
ToxicConversationsClassification 0.90 0.89 0.66 0.98
TweetSentimentExtractionClassification 0.77 0.70 0.63 0.88
TwentyNewsgroupsClustering.v2 0.81 0.57 0.39 0.88
TwitterSemEval2015 0.87 0.79 0.75 0.89
TwitterURLCorpus 0.92 0.87 0.86 0.96
VideoRetrieval 0.79 nan 0.58 0.84
Waimai 0.92 nan 0.86 0.92
Average 0.76 0.73 0.61 0.81

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Alright, I think we finally got there! Congratulations again on the release :)

@KennethEnevoldsen KennethEnevoldsen merged commit 2369024 into embeddings-benchmark:main Aug 26, 2025
3 checks passed
@PennyYu123 PennyYu123 deleted the qzhou-embedding-results branch August 26, 2025 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants