Skip to content

MMTEB results for llama-embed-nemotron-8b#302

Merged
KennethEnevoldsen merged 3 commits into
embeddings-benchmark:mainfrom
ybabakhin:llama-embed-nemotron-8b
Oct 19, 2025
Merged

MMTEB results for llama-embed-nemotron-8b#302
KennethEnevoldsen merged 3 commits into
embeddings-benchmark:mainfrom
ybabakhin:llama-embed-nemotron-8b

Conversation

@ybabakhin

Copy link
Copy Markdown
Contributor

Adds MMTEB results for llama-embed-nemotron-8b model

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions

Copy link
Copy Markdown

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: nvidia/llama-embed-nemotron-8b
Tasks: AILAStatutes, AfriSentiClassification, AlloProfClusteringS2S.v2, AlloprofReranking, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArmenianParaphrasePC, BUCC.v2, BelebeleRetrieval, BibleNLPBitextMining, BigPatentClustering.v2, BiorxivClusteringP2P.v2, BornholmBitextMining, BrazilianToxicTweetsClassification, BulgarianStoreReviewSentimentClassfication, CEDRClassification, CLSClusteringP2P.v2, CSFDSKMovieReviewSentimentClassification, CTKFactsNLI, CataloniaTweetClassification, Core17InstructionRetrieval, CovidRetrieval, CyrillicTurkicLangClassification, CzechProductReviewSentimentClassification, DBpediaClassification, DalajClassification, DiaBlaBitextMining, EstonianValenceClassification, FaroeseSTS, FilipinoShopeeReviewsClassification, FinParaSTS, FinancialPhrasebankClassification, FloresBitextMining, GermanSTSBenchmark, GreekLegalCodeClassification, GujaratiNewsClassification, HALClusteringS2S.v2, HagridRetrieval, IN22GenBitextMining, IndicCrosslingualSTS, IndicGenBenchFloresBitextMining, IndicLangClassification, IndonesianIdClickbaitClassification, IsiZuluNewsClassification, ItaCaseholdClassification, JSICK, KorHateSpeechMLClassification, KorSarcasmClassification, KurdishSentimentClassification, LEMBPasskeyRetrieval, LegalBenchCorporateLobbying, MIRACLRetrievalHardNegatives, MLQARetrieval, MacedonianTweetSentimentClassification, MalteseNewsClassification, MasakhaNEWSClassification, MasakhaNEWSClusteringS2S, MassiveIntentClassification, MedrxivClusteringP2P.v2, MultiEURLEXMultilabelClassification, MultiHateClassification, NTREXBitextMining, NepaliNewsClassification, News21InstructionRetrieval, NollySentiBitextMining, NordicLangClassification, NorwegianCourtsBitextMining, NusaParagraphEmotionClassification, NusaTranslationBitextMining, NusaX-senti, NusaXBitextMining, OdiaNewsClassification, OpusparcusPC, PAC, PawsXPairClassification, PlscClusteringP2P.v2, PoemSentimentClassification, PolEmo2.0-OUT, PpcPC, PunjabiNewsClassification, RTE3, Robust04InstructionRetrieval, RomaniBibleClustering, RuBQReranking, SCIDOCS, SIB200ClusteringS2S, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, STSES, ScalaClassification, SemRel24STS, SentimentAnalysisHindi, SinhalaNewsClassification, SiswatiNewsClassification, SlovakMovieReviewSentimentClassification, SpartQA, SprintDuplicateQuestions, StackExchangeClustering.v2, StackOverflowQA, StatcanDialogueDatasetRetrieval, SwahiliNewsClassification, SwednClusteringP2P, SwissJudgementClassification, T2Reranking, TERRa, TRECCOVID, Tatoeba, TempReasonL1, ToxicConversationsClassification, TswanaNewsClassification, TweetTopicSingleClassification, TwitterHjerneRetrieval, TwitterURLCorpus, VoyageMMarcoReranking, WebLINXCandidatesReranking, WikiCitiesClustering, WikiClusteringP2P.v2, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual, WinoGrande, XNLI, indonli

Results for nvidia/llama-embed-nemotron-8b

task_name google/gemini-embedding-001 nvidia/llama-embed-nemotron-8b intfloat/multilingual-e5-large Max result
AILAStatutes 0.4877 0.5403 0.2084 0.8509
AfriSentiClassification 0.5356 0.4939 0.455 0.5399
AlloProfClusteringS2S.v2 0.5636 0.5714 0.3515 0.5965
AlloprofReranking 0.8177 0.8129 0.6944 0.8513
AmazonCounterfactualClassification 0.8820 0.8394 0.7713 0.9696
ArXivHierarchicalClusteringP2P 0.6492 0.6284 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.6384 0.6389 0.5621 0.6548
ArguAna 0.8644 0.7567 0.5438 0.8979
ArmenianParaphrasePC 0.9689 0.9682 0.9493 0.9689
BUCC.v2 0.9899 0.9898 0.9878 0.9902
BelebeleRetrieval 0.9073 0.8604 0.7791 0.9167
BibleNLPBitextMining 0.2072 0.2149 0.1665 0.9899
BigPatentClustering.v2 0.3806 0.3667 0.3466 0.4553
BiorxivClusteringP2P.v2 0.5386 0.4710 0.3778 0.8417
BornholmBitextMining 0.5169 0.6548 0.4416 0.7633
BrazilianToxicTweetsClassification 0.2802 0.2901 0.2123 0.2802
BulgarianStoreReviewSentimentClassfication 0.7813 0.7967 0.7093 0.8044
CEDRClassification 0.5742 0.5325 0.4484 0.7301
CLSClusteringP2P.v2 0.4268 0.4428 0.4037 0.7572
CSFDSKMovieReviewSentimentClassification 0.4938 0.5543 0.3664 0.6243
CTKFactsNLI 0.8759 0.8735 0.8096 0.8993
CataloniaTweetClassification 0.5451 0.5313 0.504 0.5563
Core17InstructionRetrieval 0.0769 0.1461 -0.0162 0.1648
CovidRetrieval 0.7913 0.7953 0.7561 0.9606
CyrillicTurkicLangClassification 0.9530 0.9252 0.4085 0.9615
CzechProductReviewSentimentClassification 0.6816 0.6807 0.5742 0.6988
DBpediaClassification 0.9476 0.9764 0.8828 0.9926
DalajClassification 0.5047 0.5277 0.5001 0.5352
DiaBlaBitextMining 0.8723 0.8865 0.8483 0.8846
EstonianValenceClassification 0.5352 0.6291 0.4358 0.6820
FaroeseSTS 0.8612 0.8393 0.7239 0.9739
FilipinoShopeeReviewsClassification 0.4845 0.5094 0.3527 0.5052
FinParaSTS 0.2860 0.2656 0.2666 0.3399
FinancialPhrasebankClassification 0.8864 0.9440 0.8404 0.9515
FloresBitextMining 0.8371 0.8021 0.8108 0.8596
GermanSTSBenchmark 0.8809 0.8900 0.8527 0.9541
GreekLegalCodeClassification 0.4376 0.5128 0.3713 0.5648
GujaratiNewsClassification 0.9205 0.8970 0.7674 0.9205
HALClusteringS2S.v2 0.3200 0.3184 0.2261 0.3237
HagridRetrieval 0.9931 0.9897 0.9891 0.9931
IN22GenBitextMining 0.9375 0.8775 0.7675 0.9375
IndicCrosslingualSTS 0.6287 0.5818 0.4387 0.8477
IndicGenBenchFloresBitextMining 0.9677 0.9655 0.8875 0.9881
IndicLangClassification 0.8769 0.9554 0.2025 0.9532
IndonesianIdClickbaitClassification 0.6700 0.7560 0.6122 0.6700
IsiZuluNewsClassification 0.4053 0.3826 0.3241 0.4053
ItaCaseholdClassification 0.7330 0.7321 0.6679 0.9439
JSICK 0.8499 0.8380 0.7983 0.8938
KorHateSpeechMLClassification 0.1769 0.2297 0.1049 0.2167
KorSarcasmClassification 0.6051 0.6388 0.5679 0.6629
KurdishSentimentClassification 0.8639 0.8454 0.7708 0.8639
LEMBPasskeyRetrieval 0.3850 0.8450 0.3825 1.0000
LegalBenchCorporateLobbying 0.9598 0.9615 0.8972 0.9696
MIRACLRetrievalHardNegatives 0.7042 0.7305 0.6675 0.7058
MLQARetrieval 0.8416 0.8388 0.7566 0.8416
MacedonianTweetSentimentClassification 0.7183 0.6868 0.6192 0.7547
MalteseNewsClassification 0.3738 0.3928 0.2533 0.4741
MasakhaNEWSClassification 0.8355 0.8623 0.7754 0.8603
MasakhaNEWSClusteringS2S 0.5745 0.6087 0.3804 0.7182
MassiveIntentClassification 0.8192 0.7635 0.6591 0.9194
MedrxivClusteringP2P.v2 0.4716 0.4201 0.3515 0.7199
MultiEURLEXMultilabelClassification 0.0528 0.0477 0.0516 0.0550
MultiHateClassification 0.7247 0.8032 0.6357 0.8262
NTREXBitextMining 0.9364 0.9105 0.914 0.9368
NepaliNewsClassification 0.9814 0.9753 0.8847 0.9814
News21InstructionRetrieval 0.1026 0.0676 -0.0006 0.1145
NollySentiBitextMining 0.6871 0.8083 0.675 0.8071
NordicLangClassification 0.8597 0.8425 0.8015 0.9199
NorwegianCourtsBitextMining 0.9342 0.9379 0.9404 0.9447
NusaParagraphEmotionClassification 0.5638 0.5592 0.4166 0.6538
NusaTranslationBitextMining 0.7752 0.8779 0.672 0.9222
NusaX-senti 0.8031 0.7762 0.7055 0.8093
NusaXBitextMining 0.8252 0.8824 0.7267 0.8790
OdiaNewsClassification 0.9184 0.8689 0.8001 0.9490
OpusparcusPC 0.9662 0.9662 0.948 0.9662
PAC 0.7168 0.7471 0.7033 0.7387
PawsXPairClassification 0.5999 0.6166 0.5514 0.7524
PlscClusteringP2P.v2 0.7431 0.7518 0.7161 0.7524
PoemSentimentClassification 0.5966 0.6702 0.5067 0.7522
PolEmo2.0-OUT 0.7753 0.8006 0.5348 0.7881
PpcPC 0.9550 0.9524 0.9218 0.9550
PunjabiNewsClassification 0.8261 0.8395 0.807 0.8522
RTE3 0.8955 0.8997 0.8752 0.9123
Robust04InstructionRetrieval -0.0241 0.1111 -0.0748 0.1372
RomaniBibleClustering 0.4322 0.4326 0.4092 0.4514
RuBQReranking 0.7384 0.8049 0.756 0.8051
SCIDOCS 0.2515 0.2817 0.1747 0.3453
SIB200ClusteringS2S 0.4174 0.5067 0.2366 0.4719
SICK-R 0.8275 0.8537 0.8023 0.9465
STS12 0.8155 0.8336 0.8002 0.9546
STS13 0.8989 0.9196 0.8155 0.9776
STS14 0.8541 0.8784 0.7772 0.9753
STS15 0.9044 0.9183 0.8931 0.9811
STS17 0.8858 0.8996 0.8215 0.9323
STS22.v2 0.7169 0.7154 0.643 0.7718
STSB 0.8550 0.8614 0.8236 0.9199
STSBenchmark 0.8908 0.9055 0.8729 0.9504
STSES 0.8175 0.8034 0.8021 0.8231
ScalaClassification 0.5185 0.6586 0.5157 0.5743
SemRel24STS 0.7314 0.7028 0.6266 0.8112
SentimentAnalysisHindi 0.7606 0.7840 0.642 0.8001
SinhalaNewsClassification 0.8229 0.8248 0.6682 0.8229
SiswatiNewsClassification 0.6238 0.5825 0.535 0.7837
SlovakMovieReviewSentimentClassification 0.9035 0.8993 0.7441 0.9441
SpartQA 0.1030 0.1586 0.0565 0.3024
SprintDuplicateQuestions 0.9690 0.9607 0.9318 0.9838
StackExchangeClustering.v2 0.9207 0.7642 0.4643 0.9207
StackOverflowQA 0.9671 0.9659 0.8889 0.9717
StatcanDialogueDatasetRetrieval 0.5111 0.4167 0.1063 0.5807
SwahiliNewsClassification 0.6605 0.6441 0.5969 0.6753
SwednClusteringP2P 0.4584 0.5423 0.3691 0.6213
SwissJudgementClassification 0.5786 0.6097 0.5362 0.6727
T2Reranking 0.6795 0.6829 0.6632 0.7315
TERRa 0.6392 0.6721 0.5842 0.7957
TRECCOVID 0.8631 0.8819 0.7133 0.9499
Tatoeba 0.8197 0.8155 0.7574 0.9515
TempReasonL1 0.0296 0.0805 0.0114 0.0716
ToxicConversationsClassification 0.8875 0.8697 0.7132 0.9759
TswanaNewsClassification 0.5337 0.5018 0.47 0.5337
TweetTopicSingleClassification 0.7111 0.7842 0.6532 0.8171
TwitterHjerneRetrieval 0.9802 0.8139 0.3522 0.9802
TwitterURLCorpus 0.8705 0.8767 0.8589 0.9571
VoyageMMarcoReranking 0.6673 0.7125 0.6821 0.7126
WebLINXCandidatesReranking 0.1097 0.1366 0.0778 0.1595
WikiCitiesClustering 0.9163 0.9042 0.755 0.9381
WikiClusteringP2P.v2 0.2823 0.3282 0.256 0.3234
WikipediaRerankingMultilingual 0.9224 0.9168 0.897 0.9224
WikipediaRetrievalMultilingual 0.9420 0.9301 0.9082 0.9420
WinoGrande 0.6052 0.5175 0.5498 0.7561
XNLI 0.8526 0.8340 0.7477 0.8907
indonli 0.6069 0.6166 0.5174 0.6683
Average 0.6837 0.6946 0.5902 0.7595

Model have high performance on these tasks: BrazilianToxicTweetsClassification,DiaBlaBitextMining,FilipinoShopeeReviewsClassification,IndicLangClassification,IndonesianIdClickbaitClassification,KorHateSpeechMLClassification,MIRACLRetrievalHardNegatives,MasakhaNEWSClassification,NollySentiBitextMining,NusaXBitextMining,OpusparcusPC,PAC,PolEmo2.0-OUT,SIB200ClusteringS2S,ScalaClassification,SinhalaNewsClassification,TempReasonL1,WikiClusteringP2P.v2


@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

Tasks with improvements over previous max

  1. MIRACLRetrievalHardNegatives +3
  2. ScalaClassification +7
  3. SIB200ClusteringS2S +3
  4. IndonesianIdClickbaitClassification +7

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

MIRACLRetrievalHardNegatives is in the training set. Scala can be a bit sporadic so wouldn't worry too much about that.

IndonesianIdClickbaitClassification might be a case of a better prompt (the current ones seem quite bad; we should probably update that one). I see that SIB200 also have custom prompt.

@ybabakhin can I ask about the approach that you used for selecting prompts (I see that not all tasks have them).

@Samoed

Samoed commented Oct 19, 2025

Copy link
Copy Markdown
Member

the current ones seem quite bad; we should probably update that one

We can't update prompts, because results wouldn't be reproducible

@ybabakhin

Copy link
Copy Markdown
Contributor Author

@KennethEnevoldsen approach was the following:

@KennethEnevoldsen

KennethEnevoldsen commented Oct 19, 2025

Copy link
Copy Markdown
Contributor

Thanks for the clarification @ybabakhin. I will merge this in as it doesn't seem like there are any notable changes in the model implementation.

Btw @ybabakhin did you consider running this on RTEB as well? (we can help run the private set if you run the public)

We can't update prompts, because results wouldn't be reproducible

We can introduce a new version of the tasks and rerun existing set of models

@KennethEnevoldsen KennethEnevoldsen merged commit f02b091 into embeddings-benchmark:main Oct 19, 2025
3 checks passed
@ybabakhin

Copy link
Copy Markdown
Contributor Author

@KennethEnevoldsen Thanks for merging!

Btw @ybabakhin did you consider running this on RTEB as well?

Unfortunately, RTEB was released when we already had finalized our data mix. We don't have any Code data in there, while RTEB is pretty code-heavy. So, we're not targeting it with this model.

But we like a Public-Private Leaderboard setup, and we can re-visit RTEB in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants