Update Seed1.5-Embedding revision 4 by namespace-Pt · Pull Request #205 · embeddings-benchmark/results

namespace-Pt · 2025-05-26T16:16:27Z

Checklist

My model has a model sheet, report or similar
My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
- No, but there is an existing PR ___
The results submitted is obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not training on the dataset including the training set. If I have I have disclosed it clearly.

@KennethEnevoldsen created a revision 4 in this PR. In order to show the correct results on leaderboard, I've copied the results from revision 3 to revision 4.

Samoed · 2025-05-27T13:04:34Z

@KennethEnevoldsen Do these results look good?

KennethEnevoldsen

formatting looks reasonable

Here is the results table of MTEB(eng, v2):

task_name	ByteDance-Seed/Seed1.5-Embedding	google/gemini-embedding-001	intfloat/e5-large-v2	nvidia/NV-Embed-v2
AmazonCounterfactualClassification	0.92	0.93	0.78	0.79
ArXivHierarchicalClusteringP2P	0.65	0.65	0.58	0.60
ArXivHierarchicalClusteringS2S	0.64	0.64	0.55	0.59
ArguAna	0.78	0.86	0.46	0.70
AskUbuntuDupQuestions	0.69	0.64	0.6	0.67
BIOSSES	0.85	0.89	0.84	0.87
Banking77Classification	0.91	0.94	0.85	0.92
BiorxivClusteringP2P.v2	0.55	0.54	0.4	0.44
CQADupstackGamingRetrieval	0.70	0.71	0.58	0.65
CQADupstackUnixRetrieval	0.57	0.54	0.39	0.52
ClimateFEVERHardNegatives	0.47	0.31	0.23	0.33
FEVERHardNegatives	0.95	0.89	0.83	0.90
FiQA2018	0.66	0.62	0.41	0.66
HotpotQAHardNegatives	0.88	0.87	0.73	0.84
ImdbClassification	0.97	0.95	0.92	0.97
MTOPDomainClassification	0.99	0.99	0.93	0.96
MassiveIntentClassification	0.87	0.88	0.68	0.78
MassiveScenarioClassification	0.93	0.92	0.71	0.81
MedrxivClusteringP2P.v2	0.51	0.47	0.35	0.37
MedrxivClusteringS2S.v2	0.51	0.45	0.34	0.36
MindSmallReranking	0.32	0.33	0.32	0.32
SCIDOCS	0.25	0.25	0.2	0.22
SICK-R	0.84	0.83	0.79	0.82
STS12	0.85	0.82	0.74	0.78
STS13	0.92	0.90	0.81	0.88
STS14	0.90	0.85	0.79	0.84
STS15	0.92	0.90	0.88	0.89
STS17	0.93	0.92	0.9	0.91
STS22.v2	0.71	0.68	0.67	0.66
STSBenchmark	0.92	0.89	0.85	0.88
SprintDuplicateQuestions	0.97	0.97	0.95	0.97
StackExchangeClustering.v2	0.80	0.92	0.52	0.55
StackExchangeClusteringP2P.v2	0.52	0.51	0.4	0.45
SummEvalSummarization.v2	0.35	0.38	0.32	0.35
TRECCOVID	0.88	0.86	0.67	0.89
Touche2020Retrieval.v3	0.64	0.52	0.42	0.57
ToxicConversationsClassification	0.86	0.89	0.63	0.93
TweetSentimentExtractionClassification	0.72	0.70	0.61	0.81
TwentyNewsgroupsClustering.v2	0.63	0.57	0.48	0.45
TwitterSemEval2015	0.77	0.79	0.77	0.81
TwitterURLCorpus	0.87	0.87	0.86	0.88
Average	0.75	0.73	0.63	0.70

and here is the full a table for all models:

task_name	ByteDance-Seed/Seed1.5-Embedding	google/gemini-embedding-001	intfloat/e5-large-v2	nvidia/NV-Embed-v2
AFQMC	0.57	nan	nan	nan
ATEC	0.54	nan	nan	nan
AmazonCounterfactualClassification	0.92	0.88	0.68	0.78
AmazonReviewsClassification	0.58	nan	0.35	0.47
ArXivHierarchicalClusteringP2P	0.65	0.65	0.58	0.60
ArXivHierarchicalClusteringS2S	0.64	0.64	0.55	0.59
ArguAna	0.78	0.86	0.46	0.70
AskUbuntuDupQuestions	0.69	0.64	0.6	0.67
BIOSSES	0.85	0.89	0.84	0.87
BQ	0.70	nan	nan	nan
Banking77Classification	0.91	0.94	0.85	0.92
BiorxivClusteringP2P.v2	0.55	0.54	0.4	0.44
BrightRetrieval	0.27	nan	nan	nan
CLSClusteringP2P	0.54	nan	nan	nan
CLSClusteringS2S	0.62	nan	nan	nan
CMedQAv1-reranking	0.82	nan	nan	nan
CMedQAv2-reranking	0.84	nan	0.23	0.76
CQADupstackGamingRetrieval	0.70	0.71	0.58	0.65
CQADupstackUnixRetrieval	0.57	0.54	0.39	0.52
ClimateFEVERHardNegatives	0.47	0.31	0.23	0.33
CmedqaRetrieval	0.52	nan	0.03	0.31
Cmnli	0.91	nan	nan	nan
CovidRetrieval	0.88	0.79	0.2	0.59
DuRetrieval	0.94	nan	nan	nan
EcomRetrieval	0.73	nan	nan	nan
FEVERHardNegatives	0.95	0.89	0.83	0.90
FiQA2018	0.66	0.62	0.41	0.66
HotpotQAHardNegatives	0.88	0.87	0.73	0.84
IFlyTek	0.56	nan	nan	nan
ImdbClassification	0.97	0.95	0.92	0.97
JDReview	0.89	nan	nan	nan
LCQMC	0.81	nan	nan	nan
MMarcoReranking	0.36	nan	nan	nan
MMarcoRetrieval	0.89	nan	nan	nan
MTOPDomainClassification	0.99	0.98	0.66	0.90
MassiveIntentClassification	0.86	0.82	0.33	0.58
MassiveScenarioClassification	0.92	0.87	0.4	0.63
MedicalRetrieval	0.71	nan	nan	nan
MedrxivClusteringP2P.v2	0.51	0.47	0.35	0.37
MedrxivClusteringS2S.v2	0.51	0.45	0.34	0.36
MindSmallReranking	0.32	0.33	0.32	0.32
MultilingualSentiment	0.83	nan	nan	nan
Ocnli	0.84	nan	nan	nan
OnlineShopping	0.96	nan	nan	nan
PAWSX	0.68	nan	nan	nan
QBQTC	0.52	nan	nan	nan
SCIDOCS	0.25	0.25	0.2	0.22
SICK-R	0.84	0.83	0.79	0.82
STS12	0.85	0.82	0.74	0.78
STS13	0.92	0.90	0.81	0.88
STS14	0.90	0.85	0.79	0.84
STS15	0.92	0.90	0.88	0.89
STS17	0.93	0.89	0.48	0.91
STS22.v2	0.72	0.72	0.57	0.61
STSB	0.86	0.85	0.43	0.78
STSBenchmark	0.92	0.89	0.85	0.88
SprintDuplicateQuestions	0.97	0.97	0.95	0.97
StackExchangeClustering.v2	0.80	0.92	0.52	0.55
StackExchangeClusteringP2P.v2	0.52	0.51	0.4	0.45
SummEvalSummarization.v2	0.35	0.38	0.32	0.35
T2Reranking	0.67	0.68	0.6	0.67
T2Retrieval	0.90	nan	nan	nan
TNews	0.57	nan	nan	nan
TRECCOVID	0.88	0.86	0.67	0.89
ThuNewsClusteringP2P	0.83	nan	nan	nan
ThuNewsClusteringS2S	0.85	nan	nan	nan
Touche2020Retrieval.v3	0.64	0.52	0.42	0.57
ToxicConversationsClassification	0.86	0.89	0.63	0.93
TweetSentimentExtractionClassification	0.72	0.70	0.61	0.81
TwentyNewsgroupsClustering.v2	0.63	0.57	0.48	0.45
TwitterSemEval2015	0.77	0.79	0.77	0.81
TwitterURLCorpus	0.87	0.87	0.86	0.88
VideoRetrieval	0.81	nan	nan	nan
Waimai	0.92	nan	nan	nan
Average	0.74	0.73	0.55	0.67

namespace-Pt · 2025-05-27T16:53:48Z

@KennethEnevoldsen If there is no other issues, could you please approve and merge this PR so that the results on the leaderboard is correct? Please let me know if there is any other problems :) Thanks in advance.

KennethEnevoldsen · 2025-05-27T17:00:42Z

Hi @namespace-Pt, sorry I just wanted to look through the scores - A few look quite high, ClimateFEVER, FEVER, and touche2020. Can I get a confirmation that these results are correct?

namespace-Pt · 2025-05-27T17:09:13Z

Hi @KennethEnevoldsen. Yes the results are correct. We gurantee no contamination during our training process.

namespace-Pt · 2025-05-27T17:09:58Z

BTW, the results of nv-embed-v2 on FEVER, ClimateFEVER, Touche are underestimate currently, I think due to the misuse of instruction. From my own testing, if using the correct instructions (as stated in their paper), the results of nv-embed-v2 should be similar or even higher than ours (FEVER 0.95, ClimateFEVER 0.45, Touche 0.65).

KennethEnevoldsen · 2025-05-27T17:11:58Z

Thanks - ahh didn't know we didn't match the instructions, but NV-Embed is also trained specifically on those datasets, so would expect a bit of an inflated performance

Samoed · 2025-05-27T17:14:47Z

Added comment about instructions to embeddings-benchmark/mteb#1600

zhangpeitian and others added 11 commits April 25, 2025 01:47

update Seed-Embedding results

c805b18

update model_meta

0d44f77

update Doubao-1.5-Embedding results

12a1df3

Merge branch 'embeddings-benchmark:main' into main

59954f9

update results and bright evaluation

ba5ba17

restore version1 and update version2

219922c

Merge branch 'embeddings-benchmark:main' into main

1f19a00

update revision 3

8778ef5

rename Doubao-1.5-Embedding to Seed1.5-Embedding

c39a7e7

Merge branch 'embeddings-benchmark:main' into main

2dd6d4e

update revision 4

9945297

namespace-Pt mentioned this pull request May 26, 2025

fix: Update Seed1.5-Embedding API embeddings-benchmark/mteb#2724

Merged

8 tasks

Samoed requested a review from KennethEnevoldsen May 26, 2025 16:54

KennethEnevoldsen reviewed May 27, 2025

View reviewed changes

KennethEnevoldsen enabled auto-merge (squash) May 27, 2025 17:09

Samoed mentioned this pull request May 27, 2025

Investigate performance discrepancies in gte-Qwen and NV-embed models embeddings-benchmark/mteb#1600

Open

Samoed approved these changes May 27, 2025

View reviewed changes

KennethEnevoldsen merged commit 0f6fab6 into embeddings-benchmark:main May 27, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Seed1.5-Embedding revision 4#205

Update Seed1.5-Embedding revision 4#205
KennethEnevoldsen merged 11 commits into
embeddings-benchmark:mainfrom
namespace-Pt:main

namespace-Pt commented May 26, 2025

Uh oh!

Samoed commented May 27, 2025

Uh oh!

KennethEnevoldsen left a comment

Uh oh!

namespace-Pt commented May 27, 2025

Uh oh!

KennethEnevoldsen commented May 27, 2025

Uh oh!

namespace-Pt commented May 27, 2025

Uh oh!

namespace-Pt commented May 27, 2025

Uh oh!

KennethEnevoldsen commented May 27, 2025

Uh oh!

Samoed commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

namespace-Pt commented May 26, 2025

Checklist

Uh oh!

Samoed commented May 27, 2025

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

namespace-Pt commented May 27, 2025

Uh oh!

KennethEnevoldsen commented May 27, 2025

Uh oh!

namespace-Pt commented May 27, 2025

Uh oh!

namespace-Pt commented May 27, 2025

Uh oh!

KennethEnevoldsen commented May 27, 2025

Uh oh!

Samoed commented May 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants