MMTEB results for llama-embed-nemotron-8b#302
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | google/gemini-embedding-001 | nvidia/llama-embed-nemotron-8b | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AILAStatutes | 0.4877 | 0.5403 | 0.2084 | 0.8509 |
| AfriSentiClassification | 0.5356 | 0.4939 | 0.455 | 0.5399 |
| AlloProfClusteringS2S.v2 | 0.5636 | 0.5714 | 0.3515 | 0.5965 |
| AlloprofReranking | 0.8177 | 0.8129 | 0.6944 | 0.8513 |
| AmazonCounterfactualClassification | 0.8820 | 0.8394 | 0.7713 | 0.9696 |
| ArXivHierarchicalClusteringP2P | 0.6492 | 0.6284 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.6384 | 0.6389 | 0.5621 | 0.6548 |
| ArguAna | 0.8644 | 0.7567 | 0.5438 | 0.8979 |
| ArmenianParaphrasePC | 0.9689 | 0.9682 | 0.9493 | 0.9689 |
| BUCC.v2 | 0.9899 | 0.9898 | 0.9878 | 0.9902 |
| BelebeleRetrieval | 0.9073 | 0.8604 | 0.7791 | 0.9167 |
| BibleNLPBitextMining | 0.2072 | 0.2149 | 0.1665 | 0.9899 |
| BigPatentClustering.v2 | 0.3806 | 0.3667 | 0.3466 | 0.4553 |
| BiorxivClusteringP2P.v2 | 0.5386 | 0.4710 | 0.3778 | 0.8417 |
| BornholmBitextMining | 0.5169 | 0.6548 | 0.4416 | 0.7633 |
| BrazilianToxicTweetsClassification | 0.2802 | 0.2901 | 0.2123 | 0.2802 |
| BulgarianStoreReviewSentimentClassfication | 0.7813 | 0.7967 | 0.7093 | 0.8044 |
| CEDRClassification | 0.5742 | 0.5325 | 0.4484 | 0.7301 |
| CLSClusteringP2P.v2 | 0.4268 | 0.4428 | 0.4037 | 0.7572 |
| CSFDSKMovieReviewSentimentClassification | 0.4938 | 0.5543 | 0.3664 | 0.6243 |
| CTKFactsNLI | 0.8759 | 0.8735 | 0.8096 | 0.8993 |
| CataloniaTweetClassification | 0.5451 | 0.5313 | 0.504 | 0.5563 |
| Core17InstructionRetrieval | 0.0769 | 0.1461 | -0.0162 | 0.1648 |
| CovidRetrieval | 0.7913 | 0.7953 | 0.7561 | 0.9606 |
| CyrillicTurkicLangClassification | 0.9530 | 0.9252 | 0.4085 | 0.9615 |
| CzechProductReviewSentimentClassification | 0.6816 | 0.6807 | 0.5742 | 0.6988 |
| DBpediaClassification | 0.9476 | 0.9764 | 0.8828 | 0.9926 |
| DalajClassification | 0.5047 | 0.5277 | 0.5001 | 0.5352 |
| DiaBlaBitextMining | 0.8723 | 0.8865 | 0.8483 | 0.8846 |
| EstonianValenceClassification | 0.5352 | 0.6291 | 0.4358 | 0.6820 |
| FaroeseSTS | 0.8612 | 0.8393 | 0.7239 | 0.9739 |
| FilipinoShopeeReviewsClassification | 0.4845 | 0.5094 | 0.3527 | 0.5052 |
| FinParaSTS | 0.2860 | 0.2656 | 0.2666 | 0.3399 |
| FinancialPhrasebankClassification | 0.8864 | 0.9440 | 0.8404 | 0.9515 |
| FloresBitextMining | 0.8371 | 0.8021 | 0.8108 | 0.8596 |
| GermanSTSBenchmark | 0.8809 | 0.8900 | 0.8527 | 0.9541 |
| GreekLegalCodeClassification | 0.4376 | 0.5128 | 0.3713 | 0.5648 |
| GujaratiNewsClassification | 0.9205 | 0.8970 | 0.7674 | 0.9205 |
| HALClusteringS2S.v2 | 0.3200 | 0.3184 | 0.2261 | 0.3237 |
| HagridRetrieval | 0.9931 | 0.9897 | 0.9891 | 0.9931 |
| IN22GenBitextMining | 0.9375 | 0.8775 | 0.7675 | 0.9375 |
| IndicCrosslingualSTS | 0.6287 | 0.5818 | 0.4387 | 0.8477 |
| IndicGenBenchFloresBitextMining | 0.9677 | 0.9655 | 0.8875 | 0.9881 |
| IndicLangClassification | 0.8769 | 0.9554 | 0.2025 | 0.9532 |
| IndonesianIdClickbaitClassification | 0.6700 | 0.7560 | 0.6122 | 0.6700 |
| IsiZuluNewsClassification | 0.4053 | 0.3826 | 0.3241 | 0.4053 |
| ItaCaseholdClassification | 0.7330 | 0.7321 | 0.6679 | 0.9439 |
| JSICK | 0.8499 | 0.8380 | 0.7983 | 0.8938 |
| KorHateSpeechMLClassification | 0.1769 | 0.2297 | 0.1049 | 0.2167 |
| KorSarcasmClassification | 0.6051 | 0.6388 | 0.5679 | 0.6629 |
| KurdishSentimentClassification | 0.8639 | 0.8454 | 0.7708 | 0.8639 |
| LEMBPasskeyRetrieval | 0.3850 | 0.8450 | 0.3825 | 1.0000 |
| LegalBenchCorporateLobbying | 0.9598 | 0.9615 | 0.8972 | 0.9696 |
| MIRACLRetrievalHardNegatives | 0.7042 | 0.7305 | 0.6675 | 0.7058 |
| MLQARetrieval | 0.8416 | 0.8388 | 0.7566 | 0.8416 |
| MacedonianTweetSentimentClassification | 0.7183 | 0.6868 | 0.6192 | 0.7547 |
| MalteseNewsClassification | 0.3738 | 0.3928 | 0.2533 | 0.4741 |
| MasakhaNEWSClassification | 0.8355 | 0.8623 | 0.7754 | 0.8603 |
| MasakhaNEWSClusteringS2S | 0.5745 | 0.6087 | 0.3804 | 0.7182 |
| MassiveIntentClassification | 0.8192 | 0.7635 | 0.6591 | 0.9194 |
| MedrxivClusteringP2P.v2 | 0.4716 | 0.4201 | 0.3515 | 0.7199 |
| MultiEURLEXMultilabelClassification | 0.0528 | 0.0477 | 0.0516 | 0.0550 |
| MultiHateClassification | 0.7247 | 0.8032 | 0.6357 | 0.8262 |
| NTREXBitextMining | 0.9364 | 0.9105 | 0.914 | 0.9368 |
| NepaliNewsClassification | 0.9814 | 0.9753 | 0.8847 | 0.9814 |
| News21InstructionRetrieval | 0.1026 | 0.0676 | -0.0006 | 0.1145 |
| NollySentiBitextMining | 0.6871 | 0.8083 | 0.675 | 0.8071 |
| NordicLangClassification | 0.8597 | 0.8425 | 0.8015 | 0.9199 |
| NorwegianCourtsBitextMining | 0.9342 | 0.9379 | 0.9404 | 0.9447 |
| NusaParagraphEmotionClassification | 0.5638 | 0.5592 | 0.4166 | 0.6538 |
| NusaTranslationBitextMining | 0.7752 | 0.8779 | 0.672 | 0.9222 |
| NusaX-senti | 0.8031 | 0.7762 | 0.7055 | 0.8093 |
| NusaXBitextMining | 0.8252 | 0.8824 | 0.7267 | 0.8790 |
| OdiaNewsClassification | 0.9184 | 0.8689 | 0.8001 | 0.9490 |
| OpusparcusPC | 0.9662 | 0.9662 | 0.948 | 0.9662 |
| PAC | 0.7168 | 0.7471 | 0.7033 | 0.7387 |
| PawsXPairClassification | 0.5999 | 0.6166 | 0.5514 | 0.7524 |
| PlscClusteringP2P.v2 | 0.7431 | 0.7518 | 0.7161 | 0.7524 |
| PoemSentimentClassification | 0.5966 | 0.6702 | 0.5067 | 0.7522 |
| PolEmo2.0-OUT | 0.7753 | 0.8006 | 0.5348 | 0.7881 |
| PpcPC | 0.9550 | 0.9524 | 0.9218 | 0.9550 |
| PunjabiNewsClassification | 0.8261 | 0.8395 | 0.807 | 0.8522 |
| RTE3 | 0.8955 | 0.8997 | 0.8752 | 0.9123 |
| Robust04InstructionRetrieval | -0.0241 | 0.1111 | -0.0748 | 0.1372 |
| RomaniBibleClustering | 0.4322 | 0.4326 | 0.4092 | 0.4514 |
| RuBQReranking | 0.7384 | 0.8049 | 0.756 | 0.8051 |
| SCIDOCS | 0.2515 | 0.2817 | 0.1747 | 0.3453 |
| SIB200ClusteringS2S | 0.4174 | 0.5067 | 0.2366 | 0.4719 |
| SICK-R | 0.8275 | 0.8537 | 0.8023 | 0.9465 |
| STS12 | 0.8155 | 0.8336 | 0.8002 | 0.9546 |
| STS13 | 0.8989 | 0.9196 | 0.8155 | 0.9776 |
| STS14 | 0.8541 | 0.8784 | 0.7772 | 0.9753 |
| STS15 | 0.9044 | 0.9183 | 0.8931 | 0.9811 |
| STS17 | 0.8858 | 0.8996 | 0.8215 | 0.9323 |
| STS22.v2 | 0.7169 | 0.7154 | 0.643 | 0.7718 |
| STSB | 0.8550 | 0.8614 | 0.8236 | 0.9199 |
| STSBenchmark | 0.8908 | 0.9055 | 0.8729 | 0.9504 |
| STSES | 0.8175 | 0.8034 | 0.8021 | 0.8231 |
| ScalaClassification | 0.5185 | 0.6586 | 0.5157 | 0.5743 |
| SemRel24STS | 0.7314 | 0.7028 | 0.6266 | 0.8112 |
| SentimentAnalysisHindi | 0.7606 | 0.7840 | 0.642 | 0.8001 |
| SinhalaNewsClassification | 0.8229 | 0.8248 | 0.6682 | 0.8229 |
| SiswatiNewsClassification | 0.6238 | 0.5825 | 0.535 | 0.7837 |
| SlovakMovieReviewSentimentClassification | 0.9035 | 0.8993 | 0.7441 | 0.9441 |
| SpartQA | 0.1030 | 0.1586 | 0.0565 | 0.3024 |
| SprintDuplicateQuestions | 0.9690 | 0.9607 | 0.9318 | 0.9838 |
| StackExchangeClustering.v2 | 0.9207 | 0.7642 | 0.4643 | 0.9207 |
| StackOverflowQA | 0.9671 | 0.9659 | 0.8889 | 0.9717 |
| StatcanDialogueDatasetRetrieval | 0.5111 | 0.4167 | 0.1063 | 0.5807 |
| SwahiliNewsClassification | 0.6605 | 0.6441 | 0.5969 | 0.6753 |
| SwednClusteringP2P | 0.4584 | 0.5423 | 0.3691 | 0.6213 |
| SwissJudgementClassification | 0.5786 | 0.6097 | 0.5362 | 0.6727 |
| T2Reranking | 0.6795 | 0.6829 | 0.6632 | 0.7315 |
| TERRa | 0.6392 | 0.6721 | 0.5842 | 0.7957 |
| TRECCOVID | 0.8631 | 0.8819 | 0.7133 | 0.9499 |
| Tatoeba | 0.8197 | 0.8155 | 0.7574 | 0.9515 |
| TempReasonL1 | 0.0296 | 0.0805 | 0.0114 | 0.0716 |
| ToxicConversationsClassification | 0.8875 | 0.8697 | 0.7132 | 0.9759 |
| TswanaNewsClassification | 0.5337 | 0.5018 | 0.47 | 0.5337 |
| TweetTopicSingleClassification | 0.7111 | 0.7842 | 0.6532 | 0.8171 |
| TwitterHjerneRetrieval | 0.9802 | 0.8139 | 0.3522 | 0.9802 |
| TwitterURLCorpus | 0.8705 | 0.8767 | 0.8589 | 0.9571 |
| VoyageMMarcoReranking | 0.6673 | 0.7125 | 0.6821 | 0.7126 |
| WebLINXCandidatesReranking | 0.1097 | 0.1366 | 0.0778 | 0.1595 |
| WikiCitiesClustering | 0.9163 | 0.9042 | 0.755 | 0.9381 |
| WikiClusteringP2P.v2 | 0.2823 | 0.3282 | 0.256 | 0.3234 |
| WikipediaRerankingMultilingual | 0.9224 | 0.9168 | 0.897 | 0.9224 |
| WikipediaRetrievalMultilingual | 0.9420 | 0.9301 | 0.9082 | 0.9420 |
| WinoGrande | 0.6052 | 0.5175 | 0.5498 | 0.7561 |
| XNLI | 0.8526 | 0.8340 | 0.7477 | 0.8907 |
| indonli | 0.6069 | 0.6166 | 0.5174 | 0.6683 |
| Average | 0.6837 | 0.6946 | 0.5902 | 0.7595 |
Model have high performance on these tasks: BrazilianToxicTweetsClassification,DiaBlaBitextMining,FilipinoShopeeReviewsClassification,IndicLangClassification,IndonesianIdClickbaitClassification,KorHateSpeechMLClassification,MIRACLRetrievalHardNegatives,MasakhaNEWSClassification,NollySentiBitextMining,NusaXBitextMining,OpusparcusPC,PAC,PolEmo2.0-OUT,SIB200ClusteringS2S,ScalaClassification,SinhalaNewsClassification,TempReasonL1,WikiClusteringP2P.v2
|
Tasks with improvements over previous max
|
|
@ybabakhin can I ask about the approach that you used for selecting prompts (I see that not all tasks have them). |
We can't update prompts, because results wouldn't be reproducible |
|
@KennethEnevoldsen approach was the following:
|
|
Thanks for the clarification @ybabakhin. I will merge this in as it doesn't seem like there are any notable changes in the model implementation. Btw @ybabakhin did you consider running this on RTEB as well? (we can help run the private set if you run the public)
We can introduce a new version of the tasks and rerun existing set of models |
|
@KennethEnevoldsen Thanks for merging!
Unfortunately, RTEB was released when we already had finalized our data mix. We don't have any Code data in there, while RTEB is pretty code-heavy. So, we're not targeting it with this model. But we like a Public-Private Leaderboard setup, and we can re-visit RTEB in the future |
Adds MMTEB results for llama-embed-nemotron-8b model
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here