Update Seed1.5-Embedding revision 4#205
Conversation
|
@KennethEnevoldsen Do these results look good? |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
formatting looks reasonable
Here is the results table of MTEB(eng, v2):
| task_name | ByteDance-Seed/Seed1.5-Embedding | google/gemini-embedding-001 | intfloat/e5-large-v2 | nvidia/NV-Embed-v2 |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 0.92 | 0.93 | 0.78 | 0.79 |
| ArXivHierarchicalClusteringP2P | 0.65 | 0.65 | 0.58 | 0.60 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.55 | 0.59 |
| ArguAna | 0.78 | 0.86 | 0.46 | 0.70 |
| AskUbuntuDupQuestions | 0.69 | 0.64 | 0.6 | 0.67 |
| BIOSSES | 0.85 | 0.89 | 0.84 | 0.87 |
| Banking77Classification | 0.91 | 0.94 | 0.85 | 0.92 |
| BiorxivClusteringP2P.v2 | 0.55 | 0.54 | 0.4 | 0.44 |
| CQADupstackGamingRetrieval | 0.70 | 0.71 | 0.58 | 0.65 |
| CQADupstackUnixRetrieval | 0.57 | 0.54 | 0.39 | 0.52 |
| ClimateFEVERHardNegatives | 0.47 | 0.31 | 0.23 | 0.33 |
| FEVERHardNegatives | 0.95 | 0.89 | 0.83 | 0.90 |
| FiQA2018 | 0.66 | 0.62 | 0.41 | 0.66 |
| HotpotQAHardNegatives | 0.88 | 0.87 | 0.73 | 0.84 |
| ImdbClassification | 0.97 | 0.95 | 0.92 | 0.97 |
| MTOPDomainClassification | 0.99 | 0.99 | 0.93 | 0.96 |
| MassiveIntentClassification | 0.87 | 0.88 | 0.68 | 0.78 |
| MassiveScenarioClassification | 0.93 | 0.92 | 0.71 | 0.81 |
| MedrxivClusteringP2P.v2 | 0.51 | 0.47 | 0.35 | 0.37 |
| MedrxivClusteringS2S.v2 | 0.51 | 0.45 | 0.34 | 0.36 |
| MindSmallReranking | 0.32 | 0.33 | 0.32 | 0.32 |
| SCIDOCS | 0.25 | 0.25 | 0.2 | 0.22 |
| SICK-R | 0.84 | 0.83 | 0.79 | 0.82 |
| STS12 | 0.85 | 0.82 | 0.74 | 0.78 |
| STS13 | 0.92 | 0.90 | 0.81 | 0.88 |
| STS14 | 0.90 | 0.85 | 0.79 | 0.84 |
| STS15 | 0.92 | 0.90 | 0.88 | 0.89 |
| STS17 | 0.93 | 0.92 | 0.9 | 0.91 |
| STS22.v2 | 0.71 | 0.68 | 0.67 | 0.66 |
| STSBenchmark | 0.92 | 0.89 | 0.85 | 0.88 |
| SprintDuplicateQuestions | 0.97 | 0.97 | 0.95 | 0.97 |
| StackExchangeClustering.v2 | 0.80 | 0.92 | 0.52 | 0.55 |
| StackExchangeClusteringP2P.v2 | 0.52 | 0.51 | 0.4 | 0.45 |
| SummEvalSummarization.v2 | 0.35 | 0.38 | 0.32 | 0.35 |
| TRECCOVID | 0.88 | 0.86 | 0.67 | 0.89 |
| Touche2020Retrieval.v3 | 0.64 | 0.52 | 0.42 | 0.57 |
| ToxicConversationsClassification | 0.86 | 0.89 | 0.63 | 0.93 |
| TweetSentimentExtractionClassification | 0.72 | 0.70 | 0.61 | 0.81 |
| TwentyNewsgroupsClustering.v2 | 0.63 | 0.57 | 0.48 | 0.45 |
| TwitterSemEval2015 | 0.77 | 0.79 | 0.77 | 0.81 |
| TwitterURLCorpus | 0.87 | 0.87 | 0.86 | 0.88 |
| Average | 0.75 | 0.73 | 0.63 | 0.70 |
and here is the full a table for all models:
| task_name | ByteDance-Seed/Seed1.5-Embedding | google/gemini-embedding-001 | intfloat/e5-large-v2 | nvidia/NV-Embed-v2 |
|---|---|---|---|---|
| AFQMC | 0.57 | nan | nan | nan |
| ATEC | 0.54 | nan | nan | nan |
| AmazonCounterfactualClassification | 0.92 | 0.88 | 0.68 | 0.78 |
| AmazonReviewsClassification | 0.58 | nan | 0.35 | 0.47 |
| ArXivHierarchicalClusteringP2P | 0.65 | 0.65 | 0.58 | 0.60 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.55 | 0.59 |
| ArguAna | 0.78 | 0.86 | 0.46 | 0.70 |
| AskUbuntuDupQuestions | 0.69 | 0.64 | 0.6 | 0.67 |
| BIOSSES | 0.85 | 0.89 | 0.84 | 0.87 |
| BQ | 0.70 | nan | nan | nan |
| Banking77Classification | 0.91 | 0.94 | 0.85 | 0.92 |
| BiorxivClusteringP2P.v2 | 0.55 | 0.54 | 0.4 | 0.44 |
| BrightRetrieval | 0.27 | nan | nan | nan |
| CLSClusteringP2P | 0.54 | nan | nan | nan |
| CLSClusteringS2S | 0.62 | nan | nan | nan |
| CMedQAv1-reranking | 0.82 | nan | nan | nan |
| CMedQAv2-reranking | 0.84 | nan | 0.23 | 0.76 |
| CQADupstackGamingRetrieval | 0.70 | 0.71 | 0.58 | 0.65 |
| CQADupstackUnixRetrieval | 0.57 | 0.54 | 0.39 | 0.52 |
| ClimateFEVERHardNegatives | 0.47 | 0.31 | 0.23 | 0.33 |
| CmedqaRetrieval | 0.52 | nan | 0.03 | 0.31 |
| Cmnli | 0.91 | nan | nan | nan |
| CovidRetrieval | 0.88 | 0.79 | 0.2 | 0.59 |
| DuRetrieval | 0.94 | nan | nan | nan |
| EcomRetrieval | 0.73 | nan | nan | nan |
| FEVERHardNegatives | 0.95 | 0.89 | 0.83 | 0.90 |
| FiQA2018 | 0.66 | 0.62 | 0.41 | 0.66 |
| HotpotQAHardNegatives | 0.88 | 0.87 | 0.73 | 0.84 |
| IFlyTek | 0.56 | nan | nan | nan |
| ImdbClassification | 0.97 | 0.95 | 0.92 | 0.97 |
| JDReview | 0.89 | nan | nan | nan |
| LCQMC | 0.81 | nan | nan | nan |
| MMarcoReranking | 0.36 | nan | nan | nan |
| MMarcoRetrieval | 0.89 | nan | nan | nan |
| MTOPDomainClassification | 0.99 | 0.98 | 0.66 | 0.90 |
| MassiveIntentClassification | 0.86 | 0.82 | 0.33 | 0.58 |
| MassiveScenarioClassification | 0.92 | 0.87 | 0.4 | 0.63 |
| MedicalRetrieval | 0.71 | nan | nan | nan |
| MedrxivClusteringP2P.v2 | 0.51 | 0.47 | 0.35 | 0.37 |
| MedrxivClusteringS2S.v2 | 0.51 | 0.45 | 0.34 | 0.36 |
| MindSmallReranking | 0.32 | 0.33 | 0.32 | 0.32 |
| MultilingualSentiment | 0.83 | nan | nan | nan |
| Ocnli | 0.84 | nan | nan | nan |
| OnlineShopping | 0.96 | nan | nan | nan |
| PAWSX | 0.68 | nan | nan | nan |
| QBQTC | 0.52 | nan | nan | nan |
| SCIDOCS | 0.25 | 0.25 | 0.2 | 0.22 |
| SICK-R | 0.84 | 0.83 | 0.79 | 0.82 |
| STS12 | 0.85 | 0.82 | 0.74 | 0.78 |
| STS13 | 0.92 | 0.90 | 0.81 | 0.88 |
| STS14 | 0.90 | 0.85 | 0.79 | 0.84 |
| STS15 | 0.92 | 0.90 | 0.88 | 0.89 |
| STS17 | 0.93 | 0.89 | 0.48 | 0.91 |
| STS22.v2 | 0.72 | 0.72 | 0.57 | 0.61 |
| STSB | 0.86 | 0.85 | 0.43 | 0.78 |
| STSBenchmark | 0.92 | 0.89 | 0.85 | 0.88 |
| SprintDuplicateQuestions | 0.97 | 0.97 | 0.95 | 0.97 |
| StackExchangeClustering.v2 | 0.80 | 0.92 | 0.52 | 0.55 |
| StackExchangeClusteringP2P.v2 | 0.52 | 0.51 | 0.4 | 0.45 |
| SummEvalSummarization.v2 | 0.35 | 0.38 | 0.32 | 0.35 |
| T2Reranking | 0.67 | 0.68 | 0.6 | 0.67 |
| T2Retrieval | 0.90 | nan | nan | nan |
| TNews | 0.57 | nan | nan | nan |
| TRECCOVID | 0.88 | 0.86 | 0.67 | 0.89 |
| ThuNewsClusteringP2P | 0.83 | nan | nan | nan |
| ThuNewsClusteringS2S | 0.85 | nan | nan | nan |
| Touche2020Retrieval.v3 | 0.64 | 0.52 | 0.42 | 0.57 |
| ToxicConversationsClassification | 0.86 | 0.89 | 0.63 | 0.93 |
| TweetSentimentExtractionClassification | 0.72 | 0.70 | 0.61 | 0.81 |
| TwentyNewsgroupsClustering.v2 | 0.63 | 0.57 | 0.48 | 0.45 |
| TwitterSemEval2015 | 0.77 | 0.79 | 0.77 | 0.81 |
| TwitterURLCorpus | 0.87 | 0.87 | 0.86 | 0.88 |
| VideoRetrieval | 0.81 | nan | nan | nan |
| Waimai | 0.92 | nan | nan | nan |
| Average | 0.74 | 0.73 | 0.55 | 0.67 |
|
@KennethEnevoldsen If there is no other issues, could you please approve and merge this PR so that the results on the leaderboard is correct? Please let me know if there is any other problems :) Thanks in advance. |
|
Hi @namespace-Pt, sorry I just wanted to look through the scores - A few look quite high, ClimateFEVER, FEVER, and touche2020. Can I get a confirmation that these results are correct? |
|
Hi @KennethEnevoldsen. Yes the results are correct. We gurantee no contamination during our training process. |
|
BTW, the results of nv-embed-v2 on FEVER, ClimateFEVER, Touche are underestimate currently, I think due to the misuse of instruction. From my own testing, if using the correct instructions (as stated in their paper), the results of nv-embed-v2 should be similar or even higher than ours (FEVER 0.95, ClimateFEVER 0.45, Touche 0.65). |
|
Thanks - ahh didn't know we didn't match the instructions, but NV-Embed is also trained specifically on those datasets, so would expect a bit of an inflated performance |
|
Added comment about instructions to embeddings-benchmark/mteb#1600 |
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here@KennethEnevoldsen created a revision
4in this PR. In order to show the correct results on leaderboard, I've copied the results from revision3to revision4.