Qzhou embedding results#250
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | Kingsoft-LLM/QZhou-Embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AFQMC | 0.67 | nan | 0.33 | 0.72 |
| ATEC | 0.55 | nan | 0.4 | 0.65 |
| AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 | 0.97 |
| ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 | 0.69 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 | 0.65 |
| ArguAna | 0.84 | 0.86 | 0.54 | 0.90 |
| AskUbuntuDupQuestions | 0.69 | 0.64 | 0.59 | 0.70 |
| BIOSSES | 0.93 | 0.89 | 0.85 | 0.97 |
| BQ | 0.77 | nan | 0.48 | 0.81 |
| Banking77Classification | 0.85 | 0.94 | 0.75 | 0.94 |
| BiorxivClusteringP2P.v2 | 0.54 | 0.54 | 0.37 | 0.56 |
| CLSClusteringP2P | 0.65 | nan | nan | 0.82 |
| CLSClusteringS2S | 0.61 | nan | nan | 0.74 |
| CMedQAv1-reranking | 0.94 | nan | 0.68 | 0.94 |
| CMedQAv2-reranking | 0.94 | nan | 0.67 | 0.94 |
| CQADupstackGamingRetrieval | 0.76 | 0.71 | 0.59 | 0.79 |
| CQADupstackUnixRetrieval | 0.71 | 0.54 | 0.4 | 0.72 |
| ClimateFEVERHardNegatives | 0.49 | 0.31 | 0.26 | 0.49 |
| CmedqaRetrieval | 0.52 | nan | 0.29 | 0.57 |
| Cmnli | 0.95 | nan | nan | 0.95 |
| CovidRetrieval | 0.93 | 0.79 | 0.76 | 0.96 |
| DuRetrieval | 0.92 | nan | 0.85 | 0.94 |
| EcomRetrieval | 0.77 | nan | 0.55 | 0.78 |
| FEVERHardNegatives | 0.94 | 0.89 | 0.84 | 0.95 |
| FiQA2018 | 0.60 | 0.62 | 0.44 | 0.80 |
| HotpotQAHardNegatives | 0.81 | 0.87 | 0.71 | 0.87 |
| IFlyTek | 0.58 | nan | 0.42 | 0.58 |
| ImdbClassification | 0.96 | 0.95 | 0.89 | 0.97 |
| JDReview | 0.88 | nan | 0.81 | 0.92 |
| LCQMC | 0.82 | nan | 0.76 | 0.82 |
| MMarcoReranking | 0.44 | nan | 0.29 | 0.47 |
| MMarcoRetrieval | 0.83 | nan | 0.79 | 0.90 |
| MTOPDomainClassification | 0.96 | 0.98 | 0.9 | 1.00 |
| MassiveIntentClassification | 0.55 | 0.82 | 0.6 | 0.92 |
| MassiveScenarioClassification | 0.74 | 0.87 | 0.7 | 0.99 |
| MedicalRetrieval | 0.73 | nan | 0.51 | 0.76 |
| MedrxivClusteringP2P.v2 | 0.50 | 0.47 | 0.34 | 0.52 |
| MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 | 0.51 |
| MindSmallReranking | 0.34 | 0.33 | 0.3 | 0.34 |
| MultilingualSentiment | 0.85 | nan | 0.71 | 0.85 |
| Ocnli | 0.95 | nan | nan | 0.95 |
| OnlineShopping | 0.96 | nan | 0.9 | 0.97 |
| PAWSX | 0.70 | nan | 0.15 | 0.70 |
| QBQTC | 0.60 | nan | nan | 0.71 |
| SCIDOCS | 0.29 | 0.25 | 0.17 | 0.35 |
| SICK-R | 0.88 | 0.83 | 0.8 | 0.95 |
| STS12 | 0.90 | 0.82 | 0.8 | 0.95 |
| STS13 | 0.96 | 0.90 | 0.82 | 0.98 |
| STS14 | 0.93 | 0.85 | 0.78 | 0.98 |
| STS15 | 0.95 | 0.90 | 0.89 | 0.98 |
| STS17 | 0.89 | 0.89 | 0.82 | 0.93 |
| STS22.v2 | 0.77 | 0.72 | 0.64 | 0.77 |
| STSB | 0.92 | 0.85 | 0.82 | 0.92 |
| STSBenchmark | 0.95 | 0.89 | 0.87 | 0.95 |
| SprintDuplicateQuestions | 0.98 | 0.97 | 0.93 | 0.98 |
| StackExchangeClustering.v2 | 0.76 | 0.92 | 0.46 | 0.92 |
| StackExchangeClusteringP2P.v2 | 0.55 | 0.51 | 0.39 | 0.55 |
| SummEvalSummarization.v2 | 0.33 | 0.38 | 0.31 | 0.39 |
| T2Reranking | 0.68 | 0.68 | 0.66 | 0.73 |
| T2Retrieval | 0.82 | nan | 0.76 | 0.89 |
| TNews | 0.61 | nan | 0.49 | 0.61 |
| TRECCOVID | 0.78 | 0.86 | 0.71 | 0.95 |
| ThuNewsClusteringP2P | 0.82 | nan | nan | 0.89 |
| ThuNewsClusteringS2S | 0.76 | nan | nan | 0.88 |
| Touche2020Retrieval.v3 | 0.50 | 0.52 | 0.5 | 0.75 |
| ToxicConversationsClassification | 0.90 | 0.89 | 0.66 | 0.98 |
| TweetSentimentExtractionClassification | 0.77 | 0.70 | 0.63 | 0.88 |
| TwentyNewsgroupsClustering.v2 | 0.81 | 0.57 | 0.39 | 0.88 |
| TwitterSemEval2015 | 0.87 | 0.79 | 0.75 | 0.89 |
| TwitterURLCorpus | 0.92 | 0.87 | 0.86 | 0.96 |
| VideoRetrieval | 0.79 | nan | 0.58 | 0.84 |
| Waimai | 0.92 | nan | 0.86 | 0.92 |
| Average | 0.76 | 0.73 | 0.61 | 0.81 |
correct model_meta info
|
past PR: #249 |
|
We are still waiting for the model PR to merge :) |
|
Thanks! I have looked over the scores, and a few seem suspiciously high:
However, it seems like these are not in the annotated training data: import mteb
meta = mteb.get_model_meta("Kingsoft-LLM/QZhou-Embedding")
# in training data
"AmazonCounterfactualClassification" in meta.training_datasets # True
# not in:
"AskUbuntuDupQuestions" in meta.training_datasets # False
"BQ" in meta.training_datasets # False
"Waimai" in meta.training_datasets # False
"TNews" in meta.training_datasets # False
"IFlyTek" in meta.training_datasets # False@PennyYu123 can you help me figure out these scores? Could you have missed some annotations or synthetically generated matching training data? |
Yes, it would great if you'd add them to training datasets |
|
You can add your new scores in new subfolder with your new revision |
|
Hello, our new model results have been uploaded. We have already submitted a PR to mteb repo. We have also replaced the original model parameter file with our new one in huggingface. Let's continue the previous process. 😊😊😊 |
|
Hi @PennyYu123, I have merged the PR, but it seems like there are still some datasets missing from the list that you provided: import mteb
meta = mteb.get_model_meta("Kingsoft-LLM/QZhou-Embedding")
"AmazonCounterfactualClassification" in meta.training_datasets # True
"AskUbuntuDupQuestions" in meta.training_datasets # False
"BQ" in meta.training_datasets # False
"Waimai" in meta.training_datasets # True (fixed)
"TNews" in meta.training_datasets # False (fixed)
"IFlyTek" in meta.training_datasets # False
# do also check the remainder of the listCan I ask you to update the training datasets again? |
Ahh, great, I will rerun the table to see if there are any remaining concerns.
I am back from holiday, so that should be possible. Sorry that you had to wait due to the holiday; normally, it takes no more than 1-2 days.
We, of course, always appreciate collaboration and contributions, but let us keep that out of the review process :) |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Ahh! forgot to press submit on the review...
I have added the updated table below. There are still a few that seem concerning:
- TwitterSemEval2015
- SCIDOCS
- AskUbuntuDupQuestions
Model Results Comparison
Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Kingsoft-LLM/QZhou-Embedding
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P.v2, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, CmedqaRetrieval, Cmnli, CovidRetrieval, DuRetrieval, EcomRetrieval, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, MultilingualSentiment, Ocnli, OnlineShopping, PAWSX, QBQTC, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai
Results for Kingsoft-LLM/QZhou-Embedding
| task_name | Kingsoft-LLM/QZhou-Embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AFQMC | 0.66 | nan | 0.33 | 0.72 |
| ATEC | 0.55 | nan | 0.4 | 0.65 |
| AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 | 0.97 |
| ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 | 0.69 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 | 0.65 |
| ArguAna | 0.84 | 0.86 | 0.54 | 0.90 |
| AskUbuntuDupQuestions | 0.75 | 0.64 | 0.59 | 0.75 |
| BIOSSES | 0.93 | 0.89 | 0.85 | 0.97 |
| BQ | 0.77 | nan | 0.48 | 0.81 |
| Banking77Classification | 0.85 | 0.94 | 0.75 | 0.94 |
| BiorxivClusteringP2P.v2 | 0.55 | 0.54 | 0.37 | 0.56 |
| CLSClusteringP2P | 0.67 | nan | nan | 0.82 |
| CLSClusteringS2S | 0.61 | nan | nan | 0.74 |
| CMedQAv1-reranking | 0.94 | nan | 0.68 | 0.94 |
| CMedQAv2-reranking | 0.93 | nan | 0.67 | 0.93 |
| CQADupstackGamingRetrieval | 0.77 | 0.71 | 0.59 | 0.79 |
| CQADupstackUnixRetrieval | 0.70 | 0.54 | 0.4 | 0.72 |
| ClimateFEVERHardNegatives | 0.62 | 0.31 | 0.26 | 0.62 |
| CmedqaRetrieval | 0.51 | nan | 0.29 | 0.57 |
| Cmnli | 0.95 | nan | nan | 0.95 |
| CovidRetrieval | 0.93 | 0.79 | 0.76 | 0.96 |
| DuRetrieval | 0.92 | nan | 0.85 | 0.94 |
| EcomRetrieval | 0.77 | nan | 0.55 | 0.78 |
| FEVERHardNegatives | 0.94 | 0.89 | 0.84 | 0.95 |
| FiQA2018 | 0.60 | 0.62 | 0.44 | 0.80 |
| HotpotQAHardNegatives | 0.80 | 0.87 | 0.71 | 0.87 |
| IFlyTek | 0.57 | nan | 0.42 | 0.58 |
| ImdbClassification | 0.96 | 0.95 | 0.89 | 0.97 |
| JDReview | 0.90 | nan | 0.81 | 0.92 |
| LCQMC | 0.82 | nan | 0.76 | 0.82 |
| MMarcoReranking | 0.51 | nan | 0.29 | 0.51 |
| MMarcoRetrieval | 0.83 | nan | 0.79 | 0.90 |
| MTOPDomainClassification | 0.96 | 0.98 | 0.9 | 1.00 |
| MassiveIntentClassification | 0.55 | 0.82 | 0.6 | 0.92 |
| MassiveScenarioClassification | 0.73 | 0.87 | 0.7 | 0.99 |
| MedicalRetrieval | 0.72 | nan | 0.51 | 0.76 |
| MedrxivClusteringP2P.v2 | 0.51 | 0.47 | 0.34 | 0.52 |
| MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 | 0.51 |
| MindSmallReranking | 0.36 | 0.33 | 0.3 | 0.36 |
| MultilingualSentiment | 0.85 | nan | 0.71 | 0.85 |
| Ocnli | 0.95 | nan | nan | 0.95 |
| OnlineShopping | 0.96 | nan | 0.9 | 0.97 |
| PAWSX | 0.70 | nan | 0.15 | 0.70 |
| QBQTC | 0.61 | nan | nan | 0.71 |
| SCIDOCS | 0.44 | 0.25 | 0.17 | 0.44 |
| SICK-R | 0.88 | 0.83 | 0.8 | 0.95 |
| STS12 | 0.90 | 0.82 | 0.8 | 0.95 |
| STS13 | 0.95 | 0.90 | 0.82 | 0.98 |
| STS14 | 0.93 | 0.85 | 0.78 | 0.98 |
| STS15 | 0.96 | 0.90 | 0.89 | 0.98 |
| STS17 | 0.90 | 0.89 | 0.82 | 0.93 |
| STS22.v2 | 0.78 | 0.72 | 0.64 | 0.78 |
| STSB | 0.92 | 0.85 | 0.82 | 0.92 |
| STSBenchmark | 0.96 | 0.89 | 0.87 | 0.96 |
| SprintDuplicateQuestions | 0.98 | 0.97 | 0.93 | 0.98 |
| StackExchangeClustering.v2 | 0.76 | 0.92 | 0.46 | 0.92 |
| StackExchangeClusteringP2P.v2 | 0.55 | 0.51 | 0.39 | 0.55 |
| SummEvalSummarization.v2 | 0.34 | 0.38 | 0.31 | 0.39 |
| T2Reranking | 0.68 | 0.68 | 0.66 | 0.73 |
| T2Retrieval | 0.81 | nan | 0.76 | 0.89 |
| TNews | 0.60 | nan | 0.49 | 0.60 |
| TRECCOVID | 0.79 | 0.86 | 0.71 | 0.95 |
| ThuNewsClusteringP2P | 0.83 | nan | nan | 0.89 |
| ThuNewsClusteringS2S | 0.78 | nan | nan | 0.88 |
| Touche2020Retrieval.v3 | 0.49 | 0.52 | 0.5 | 0.75 |
| ToxicConversationsClassification | 0.90 | 0.89 | 0.66 | 0.98 |
| TweetSentimentExtractionClassification | 0.77 | 0.70 | 0.63 | 0.88 |
| TwentyNewsgroupsClustering.v2 | 0.82 | 0.57 | 0.39 | 0.88 |
| TwitterSemEval2015 | 0.92 | 0.79 | 0.75 | 0.92 |
| TwitterURLCorpus | 0.92 | 0.87 | 0.86 | 0.96 |
| VideoRetrieval | 0.80 | nan | 0.58 | 0.84 |
| Waimai | 0.92 | nan | 0.86 | 0.92 |
| Average | 0.76 | 0.73 | 0.61 | 0.81 |
|
@PennyYu123 can you help me understand the few concerning datasets? Might there be missing dataset annotations? |
|
We have concurrently updated the following components: |
|
PR that updates model revision embeddings-benchmark/mteb#3069 I will recompute the table |
Model Results ComparisonReference models: Results for
|
| task_name | Kingsoft-LLM/QZhou-Embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AFQMC | 0.67 | nan | 0.33 | 0.72 |
| ATEC | 0.55 | nan | 0.4 | 0.65 |
| AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 | 0.97 |
| ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 | 0.69 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 | 0.65 |
| ArguAna | 0.84 | 0.86 | 0.54 | 0.90 |
| AskUbuntuDupQuestions | 0.69 | 0.64 | 0.59 | 0.70 |
| BIOSSES | 0.93 | 0.89 | 0.85 | 0.97 |
| BQ | 0.77 | nan | 0.48 | 0.81 |
| Banking77Classification | 0.85 | 0.94 | 0.75 | 0.94 |
| BiorxivClusteringP2P.v2 | 0.54 | 0.54 | 0.37 | 0.56 |
| CLSClusteringP2P | 0.65 | nan | nan | 0.82 |
| CLSClusteringS2S | 0.61 | nan | nan | 0.74 |
| CMedQAv1-reranking | 0.94 | nan | 0.68 | 0.94 |
| CMedQAv2-reranking | 0.94 | nan | 0.67 | 0.94 |
| CQADupstackGamingRetrieval | 0.76 | 0.71 | 0.59 | 0.79 |
| CQADupstackUnixRetrieval | 0.71 | 0.54 | 0.4 | 0.72 |
| ClimateFEVERHardNegatives | 0.49 | 0.31 | 0.26 | 0.49 |
| CmedqaRetrieval | 0.52 | nan | 0.29 | 0.57 |
| Cmnli | 0.95 | nan | nan | 0.95 |
| CovidRetrieval | 0.93 | 0.79 | 0.76 | 0.96 |
| DuRetrieval | 0.92 | nan | 0.85 | 0.94 |
| EcomRetrieval | 0.77 | nan | 0.55 | 0.78 |
| FEVERHardNegatives | 0.94 | 0.89 | 0.84 | 0.95 |
| FiQA2018 | 0.60 | 0.62 | 0.44 | 0.80 |
| HotpotQAHardNegatives | 0.81 | 0.87 | 0.71 | 0.87 |
| IFlyTek | 0.58 | nan | 0.42 | 0.58 |
| ImdbClassification | 0.96 | 0.95 | 0.89 | 0.97 |
| JDReview | 0.88 | nan | 0.81 | 0.92 |
| LCQMC | 0.82 | nan | 0.76 | 0.82 |
| MMarcoReranking | 0.44 | nan | 0.29 | 0.47 |
| MMarcoRetrieval | 0.83 | nan | 0.79 | 0.90 |
| MTOPDomainClassification | 0.96 | 0.98 | 0.9 | 1.00 |
| MassiveIntentClassification | 0.55 | 0.82 | 0.6 | 0.92 |
| MassiveScenarioClassification | 0.74 | 0.87 | 0.7 | 0.99 |
| MedicalRetrieval | 0.73 | nan | 0.51 | 0.76 |
| MedrxivClusteringP2P.v2 | 0.50 | 0.47 | 0.34 | 0.52 |
| MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 | 0.51 |
| MindSmallReranking | 0.34 | 0.33 | 0.3 | 0.34 |
| MultilingualSentiment | 0.85 | nan | 0.71 | 0.85 |
| Ocnli | 0.95 | nan | nan | 0.95 |
| OnlineShopping | 0.96 | nan | 0.9 | 0.97 |
| PAWSX | 0.70 | nan | 0.15 | 0.70 |
| QBQTC | 0.60 | nan | nan | 0.71 |
| SCIDOCS | 0.29 | 0.25 | 0.17 | 0.35 |
| SICK-R | 0.88 | 0.83 | 0.8 | 0.95 |
| STS12 | 0.90 | 0.82 | 0.8 | 0.95 |
| STS13 | 0.96 | 0.90 | 0.82 | 0.98 |
| STS14 | 0.93 | 0.85 | 0.78 | 0.98 |
| STS15 | 0.95 | 0.90 | 0.89 | 0.98 |
| STS17 | 0.89 | 0.89 | 0.82 | 0.93 |
| STS22.v2 | 0.77 | 0.72 | 0.64 | 0.77 |
| STSB | 0.92 | 0.85 | 0.82 | 0.92 |
| STSBenchmark | 0.95 | 0.89 | 0.87 | 0.95 |
| SprintDuplicateQuestions | 0.98 | 0.97 | 0.93 | 0.98 |
| StackExchangeClustering.v2 | 0.76 | 0.92 | 0.46 | 0.92 |
| StackExchangeClusteringP2P.v2 | 0.55 | 0.51 | 0.39 | 0.55 |
| SummEvalSummarization.v2 | 0.33 | 0.38 | 0.31 | 0.39 |
| T2Reranking | 0.68 | 0.68 | 0.66 | 0.73 |
| T2Retrieval | 0.82 | nan | 0.76 | 0.89 |
| TNews | 0.61 | nan | 0.49 | 0.61 |
| TRECCOVID | 0.78 | 0.86 | 0.71 | 0.95 |
| ThuNewsClusteringP2P | 0.82 | nan | nan | 0.89 |
| ThuNewsClusteringS2S | 0.76 | nan | nan | 0.88 |
| Touche2020Retrieval.v3 | 0.50 | 0.52 | 0.5 | 0.75 |
| ToxicConversationsClassification | 0.90 | 0.89 | 0.66 | 0.98 |
| TweetSentimentExtractionClassification | 0.77 | 0.70 | 0.63 | 0.88 |
| TwentyNewsgroupsClustering.v2 | 0.81 | 0.57 | 0.39 | 0.88 |
| TwitterSemEval2015 | 0.87 | 0.79 | 0.75 | 0.89 |
| TwitterURLCorpus | 0.92 | 0.87 | 0.86 | 0.96 |
| VideoRetrieval | 0.79 | nan | 0.58 | 0.84 |
| Waimai | 0.92 | nan | 0.86 | 0.92 |
| Average | 0.76 | 0.73 | 0.61 | 0.81 |
|
Alright, I think we finally got there! Congratulations again on the release :) |
We have released the HF model publicly and resubmitted the mteb implementation.
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here