MMTEB results for llama-embed-nemotron-8b by ybabakhin · Pull Request #302 · embeddings-benchmark/results

ybabakhin · 2025-10-17T21:24:09Z

Adds MMTEB results for llama-embed-nemotron-8b model

Checklist

My model has a model sheet, report or similar
My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
- No, but there is an existing PR: model: llama-embed-nemotron-8b mteb#3407
The results submitted is obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

github-actions · 2025-10-17T21:31:16Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: nvidia/llama-embed-nemotron-8b
Tasks: AILAStatutes, AfriSentiClassification, AlloProfClusteringS2S.v2, AlloprofReranking, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArmenianParaphrasePC, BUCC.v2, BelebeleRetrieval, BibleNLPBitextMining, BigPatentClustering.v2, BiorxivClusteringP2P.v2, BornholmBitextMining, BrazilianToxicTweetsClassification, BulgarianStoreReviewSentimentClassfication, CEDRClassification, CLSClusteringP2P.v2, CSFDSKMovieReviewSentimentClassification, CTKFactsNLI, CataloniaTweetClassification, Core17InstructionRetrieval, CovidRetrieval, CyrillicTurkicLangClassification, CzechProductReviewSentimentClassification, DBpediaClassification, DalajClassification, DiaBlaBitextMining, EstonianValenceClassification, FaroeseSTS, FilipinoShopeeReviewsClassification, FinParaSTS, FinancialPhrasebankClassification, FloresBitextMining, GermanSTSBenchmark, GreekLegalCodeClassification, GujaratiNewsClassification, HALClusteringS2S.v2, HagridRetrieval, IN22GenBitextMining, IndicCrosslingualSTS, IndicGenBenchFloresBitextMining, IndicLangClassification, IndonesianIdClickbaitClassification, IsiZuluNewsClassification, ItaCaseholdClassification, JSICK, KorHateSpeechMLClassification, KorSarcasmClassification, KurdishSentimentClassification, LEMBPasskeyRetrieval, LegalBenchCorporateLobbying, MIRACLRetrievalHardNegatives, MLQARetrieval, MacedonianTweetSentimentClassification, MalteseNewsClassification, MasakhaNEWSClassification, MasakhaNEWSClusteringS2S, MassiveIntentClassification, MedrxivClusteringP2P.v2, MultiEURLEXMultilabelClassification, MultiHateClassification, NTREXBitextMining, NepaliNewsClassification, News21InstructionRetrieval, NollySentiBitextMining, NordicLangClassification, NorwegianCourtsBitextMining, NusaParagraphEmotionClassification, NusaTranslationBitextMining, NusaX-senti, NusaXBitextMining, OdiaNewsClassification, OpusparcusPC, PAC, PawsXPairClassification, PlscClusteringP2P.v2, PoemSentimentClassification, PolEmo2.0-OUT, PpcPC, PunjabiNewsClassification, RTE3, Robust04InstructionRetrieval, RomaniBibleClustering, RuBQReranking, SCIDOCS, SIB200ClusteringS2S, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, STSES, ScalaClassification, SemRel24STS, SentimentAnalysisHindi, SinhalaNewsClassification, SiswatiNewsClassification, SlovakMovieReviewSentimentClassification, SpartQA, SprintDuplicateQuestions, StackExchangeClustering.v2, StackOverflowQA, StatcanDialogueDatasetRetrieval, SwahiliNewsClassification, SwednClusteringP2P, SwissJudgementClassification, T2Reranking, TERRa, TRECCOVID, Tatoeba, TempReasonL1, ToxicConversationsClassification, TswanaNewsClassification, TweetTopicSingleClassification, TwitterHjerneRetrieval, TwitterURLCorpus, VoyageMMarcoReranking, WebLINXCandidatesReranking, WikiCitiesClustering, WikiClusteringP2P.v2, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual, WinoGrande, XNLI, indonli

Results for `nvidia/llama-embed-nemotron-8b`

task_name	google/gemini-embedding-001	nvidia/llama-embed-nemotron-8b	intfloat/multilingual-e5-large	Max result
AILAStatutes	0.4877	0.5403	0.2084	0.8509
AfriSentiClassification	0.5356	0.4939	0.455	0.5399
AlloProfClusteringS2S.v2	0.5636	0.5714	0.3515	0.5965
AlloprofReranking	0.8177	0.8129	0.6944	0.8513
AmazonCounterfactualClassification	0.8820	0.8394	0.7713	0.9696
ArXivHierarchicalClusteringP2P	0.6492	0.6284	0.5569	0.6869
ArXivHierarchicalClusteringS2S	0.6384	0.6389	0.5621	0.6548
ArguAna	0.8644	0.7567	0.5438	0.8979
ArmenianParaphrasePC	0.9689	0.9682	0.9493	0.9689
BUCC.v2	0.9899	0.9898	0.9878	0.9902
BelebeleRetrieval	0.9073	0.8604	0.7791	0.9167
BibleNLPBitextMining	0.2072	0.2149	0.1665	0.9899
BigPatentClustering.v2	0.3806	0.3667	0.3466	0.4553
BiorxivClusteringP2P.v2	0.5386	0.4710	0.3778	0.8417
BornholmBitextMining	0.5169	0.6548	0.4416	0.7633
BrazilianToxicTweetsClassification	0.2802	0.2901	0.2123	0.2802
BulgarianStoreReviewSentimentClassfication	0.7813	0.7967	0.7093	0.8044
CEDRClassification	0.5742	0.5325	0.4484	0.7301
CLSClusteringP2P.v2	0.4268	0.4428	0.4037	0.7572
CSFDSKMovieReviewSentimentClassification	0.4938	0.5543	0.3664	0.6243
CTKFactsNLI	0.8759	0.8735	0.8096	0.8993
CataloniaTweetClassification	0.5451	0.5313	0.504	0.5563
Core17InstructionRetrieval	0.0769	0.1461	-0.0162	0.1648
CovidRetrieval	0.7913	0.7953	0.7561	0.9606
CyrillicTurkicLangClassification	0.9530	0.9252	0.4085	0.9615
CzechProductReviewSentimentClassification	0.6816	0.6807	0.5742	0.6988
DBpediaClassification	0.9476	0.9764	0.8828	0.9926
DalajClassification	0.5047	0.5277	0.5001	0.5352
DiaBlaBitextMining	0.8723	0.8865	0.8483	0.8846
EstonianValenceClassification	0.5352	0.6291	0.4358	0.6820
FaroeseSTS	0.8612	0.8393	0.7239	0.9739
FilipinoShopeeReviewsClassification	0.4845	0.5094	0.3527	0.5052
FinParaSTS	0.2860	0.2656	0.2666	0.3399
FinancialPhrasebankClassification	0.8864	0.9440	0.8404	0.9515
FloresBitextMining	0.8371	0.8021	0.8108	0.8596
GermanSTSBenchmark	0.8809	0.8900	0.8527	0.9541
GreekLegalCodeClassification	0.4376	0.5128	0.3713	0.5648
GujaratiNewsClassification	0.9205	0.8970	0.7674	0.9205
HALClusteringS2S.v2	0.3200	0.3184	0.2261	0.3237
HagridRetrieval	0.9931	0.9897	0.9891	0.9931
IN22GenBitextMining	0.9375	0.8775	0.7675	0.9375
IndicCrosslingualSTS	0.6287	0.5818	0.4387	0.8477
IndicGenBenchFloresBitextMining	0.9677	0.9655	0.8875	0.9881
IndicLangClassification	0.8769	0.9554	0.2025	0.9532
IndonesianIdClickbaitClassification	0.6700	0.7560	0.6122	0.6700
IsiZuluNewsClassification	0.4053	0.3826	0.3241	0.4053
ItaCaseholdClassification	0.7330	0.7321	0.6679	0.9439
JSICK	0.8499	0.8380	0.7983	0.8938
KorHateSpeechMLClassification	0.1769	0.2297	0.1049	0.2167
KorSarcasmClassification	0.6051	0.6388	0.5679	0.6629
KurdishSentimentClassification	0.8639	0.8454	0.7708	0.8639
LEMBPasskeyRetrieval	0.3850	0.8450	0.3825	1.0000
LegalBenchCorporateLobbying	0.9598	0.9615	0.8972	0.9696
MIRACLRetrievalHardNegatives	0.7042	0.7305	0.6675	0.7058
MLQARetrieval	0.8416	0.8388	0.7566	0.8416
MacedonianTweetSentimentClassification	0.7183	0.6868	0.6192	0.7547
MalteseNewsClassification	0.3738	0.3928	0.2533	0.4741
MasakhaNEWSClassification	0.8355	0.8623	0.7754	0.8603
MasakhaNEWSClusteringS2S	0.5745	0.6087	0.3804	0.7182
MassiveIntentClassification	0.8192	0.7635	0.6591	0.9194
MedrxivClusteringP2P.v2	0.4716	0.4201	0.3515	0.7199
MultiEURLEXMultilabelClassification	0.0528	0.0477	0.0516	0.0550
MultiHateClassification	0.7247	0.8032	0.6357	0.8262
NTREXBitextMining	0.9364	0.9105	0.914	0.9368
NepaliNewsClassification	0.9814	0.9753	0.8847	0.9814
News21InstructionRetrieval	0.1026	0.0676	-0.0006	0.1145
NollySentiBitextMining	0.6871	0.8083	0.675	0.8071
NordicLangClassification	0.8597	0.8425	0.8015	0.9199
NorwegianCourtsBitextMining	0.9342	0.9379	0.9404	0.9447
NusaParagraphEmotionClassification	0.5638	0.5592	0.4166	0.6538
NusaTranslationBitextMining	0.7752	0.8779	0.672	0.9222
NusaX-senti	0.8031	0.7762	0.7055	0.8093
NusaXBitextMining	0.8252	0.8824	0.7267	0.8790
OdiaNewsClassification	0.9184	0.8689	0.8001	0.9490
OpusparcusPC	0.9662	0.9662	0.948	0.9662
PAC	0.7168	0.7471	0.7033	0.7387
PawsXPairClassification	0.5999	0.6166	0.5514	0.7524
PlscClusteringP2P.v2	0.7431	0.7518	0.7161	0.7524
PoemSentimentClassification	0.5966	0.6702	0.5067	0.7522
PolEmo2.0-OUT	0.7753	0.8006	0.5348	0.7881
PpcPC	0.9550	0.9524	0.9218	0.9550
PunjabiNewsClassification	0.8261	0.8395	0.807	0.8522
RTE3	0.8955	0.8997	0.8752	0.9123
Robust04InstructionRetrieval	-0.0241	0.1111	-0.0748	0.1372
RomaniBibleClustering	0.4322	0.4326	0.4092	0.4514
RuBQReranking	0.7384	0.8049	0.756	0.8051
SCIDOCS	0.2515	0.2817	0.1747	0.3453
SIB200ClusteringS2S	0.4174	0.5067	0.2366	0.4719
SICK-R	0.8275	0.8537	0.8023	0.9465
STS12	0.8155	0.8336	0.8002	0.9546
STS13	0.8989	0.9196	0.8155	0.9776
STS14	0.8541	0.8784	0.7772	0.9753
STS15	0.9044	0.9183	0.8931	0.9811
STS17	0.8858	0.8996	0.8215	0.9323
STS22.v2	0.7169	0.7154	0.643	0.7718
STSB	0.8550	0.8614	0.8236	0.9199
STSBenchmark	0.8908	0.9055	0.8729	0.9504
STSES	0.8175	0.8034	0.8021	0.8231
ScalaClassification	0.5185	0.6586	0.5157	0.5743
SemRel24STS	0.7314	0.7028	0.6266	0.8112
SentimentAnalysisHindi	0.7606	0.7840	0.642	0.8001
SinhalaNewsClassification	0.8229	0.8248	0.6682	0.8229
SiswatiNewsClassification	0.6238	0.5825	0.535	0.7837
SlovakMovieReviewSentimentClassification	0.9035	0.8993	0.7441	0.9441
SpartQA	0.1030	0.1586	0.0565	0.3024
SprintDuplicateQuestions	0.9690	0.9607	0.9318	0.9838
StackExchangeClustering.v2	0.9207	0.7642	0.4643	0.9207
StackOverflowQA	0.9671	0.9659	0.8889	0.9717
StatcanDialogueDatasetRetrieval	0.5111	0.4167	0.1063	0.5807
SwahiliNewsClassification	0.6605	0.6441	0.5969	0.6753
SwednClusteringP2P	0.4584	0.5423	0.3691	0.6213
SwissJudgementClassification	0.5786	0.6097	0.5362	0.6727
T2Reranking	0.6795	0.6829	0.6632	0.7315
TERRa	0.6392	0.6721	0.5842	0.7957
TRECCOVID	0.8631	0.8819	0.7133	0.9499
Tatoeba	0.8197	0.8155	0.7574	0.9515
TempReasonL1	0.0296	0.0805	0.0114	0.0716
ToxicConversationsClassification	0.8875	0.8697	0.7132	0.9759
TswanaNewsClassification	0.5337	0.5018	0.47	0.5337
TweetTopicSingleClassification	0.7111	0.7842	0.6532	0.8171
TwitterHjerneRetrieval	0.9802	0.8139	0.3522	0.9802
TwitterURLCorpus	0.8705	0.8767	0.8589	0.9571
VoyageMMarcoReranking	0.6673	0.7125	0.6821	0.7126
WebLINXCandidatesReranking	0.1097	0.1366	0.0778	0.1595
WikiCitiesClustering	0.9163	0.9042	0.755	0.9381
WikiClusteringP2P.v2	0.2823	0.3282	0.256	0.3234
WikipediaRerankingMultilingual	0.9224	0.9168	0.897	0.9224
WikipediaRetrievalMultilingual	0.9420	0.9301	0.9082	0.9420
WinoGrande	0.6052	0.5175	0.5498	0.7561
XNLI	0.8526	0.8340	0.7477	0.8907
indonli	0.6069	0.6166	0.5174	0.6683
Average	0.6837	0.6946	0.5902	0.7595

Model have high performance on these tasks: BrazilianToxicTweetsClassification,DiaBlaBitextMining,FilipinoShopeeReviewsClassification,IndicLangClassification,IndonesianIdClickbaitClassification,KorHateSpeechMLClassification,MIRACLRetrievalHardNegatives,MasakhaNEWSClassification,NollySentiBitextMining,NusaXBitextMining,OpusparcusPC,PAC,PolEmo2.0-OUT,SIB200ClusteringS2S,ScalaClassification,SinhalaNewsClassification,TempReasonL1,WikiClusteringP2P.v2

Samoed · 2025-10-18T14:52:21Z

Tasks with improvements over previous max

MIRACLRetrievalHardNegatives +3
ScalaClassification +7
SIB200ClusteringS2S +3
IndonesianIdClickbaitClassification +7

KennethEnevoldsen · 2025-10-19T08:26:13Z

MIRACLRetrievalHardNegatives is in the training set. Scala can be a bit sporadic so wouldn't worry too much about that.

IndonesianIdClickbaitClassification might be a case of a better prompt (the current ones seem quite bad; we should probably update that one). I see that SIB200 also have custom prompt.

@ybabakhin can I ask about the approach that you used for selecting prompts (I see that not all tasks have them).

Samoed · 2025-10-19T08:35:26Z

the current ones seem quite bad; we should probably update that one

We can't update prompts, because results wouldn't be reproducible

ybabakhin · 2025-10-19T11:02:38Z

@KennethEnevoldsen approach was the following:

Take prompts from Qwen3-Embedding models: https://github.com/QwenLM/Qwen3-Embedding/blob/main/evaluation/task_prompts.json
- For example, it already had "IndonesianIdClickbaitClassification": "Given an Indonesian news headlines, classify its into clickbait or non-clickbait",
Edit a general prompt for Retrieval datasets: from Retrieval the relevant passage for the given query -> Given a question, retrieve passages that answer the question
Remove datasets which already have own prompt in mteb
Remove datasets for STS, BitextMining and PairClassification problem types, as they all have general prompts: https://github.com/embeddings-benchmark/mteb/pull/3407/files#diff-cbe979b0366cc5f6a7bbc0f0cf20f494873d162415e68d2b80e0f2bd9a28ce98R437

KennethEnevoldsen · 2025-10-19T20:03:08Z

Thanks for the clarification @ybabakhin. I will merge this in as it doesn't seem like there are any notable changes in the model implementation.

Btw @ybabakhin did you consider running this on RTEB as well? (we can help run the private set if you run the public)

We can't update prompts, because results wouldn't be reproducible

We can introduce a new version of the tasks and rerun existing set of models

ybabakhin · 2025-10-20T07:11:41Z

@KennethEnevoldsen Thanks for merging!

Btw @ybabakhin did you consider running this on RTEB as well?

Unfortunately, RTEB was released when we already had finalized our data mix. We don't have any Code data in there, while RTEB is pretty code-heavy. So, we're not targeting it with this model.

But we like a Public-Private Leaderboard setup, and we can re-visit RTEB in the future

add llama-embed-nemotron-8b

4aeb16d

run make pre-push

70f7bbf

ybabakhin mentioned this pull request Oct 18, 2025

model: llama-embed-nemotron-8b embeddings-benchmark/mteb#3407

Merged

6 tasks

update model_meta to v2.0.0

1a3b95f

Samoed requested a review from KennethEnevoldsen October 19, 2025 07:56

KennethEnevoldsen merged commit f02b091 into embeddings-benchmark:main Oct 19, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MMTEB results for llama-embed-nemotron-8b#302

MMTEB results for llama-embed-nemotron-8b#302
KennethEnevoldsen merged 3 commits into
embeddings-benchmark:mainfrom
ybabakhin:llama-embed-nemotron-8b

ybabakhin commented Oct 17, 2025

Uh oh!

github-actions Bot commented Oct 17, 2025

Uh oh!

Samoed commented Oct 18, 2025

Uh oh!

KennethEnevoldsen commented Oct 19, 2025

Uh oh!

Samoed commented Oct 19, 2025

Uh oh!

ybabakhin commented Oct 19, 2025

Uh oh!

KennethEnevoldsen commented Oct 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

ybabakhin commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ybabakhin commented Oct 17, 2025

Checklist

Uh oh!

github-actions Bot commented Oct 17, 2025

Model Results Comparison

Results for nvidia/llama-embed-nemotron-8b

Uh oh!

Samoed commented Oct 18, 2025

Uh oh!

KennethEnevoldsen commented Oct 19, 2025

Uh oh!

Samoed commented Oct 19, 2025

Uh oh!

ybabakhin commented Oct 19, 2025

Uh oh!

KennethEnevoldsen commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ybabakhin commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Results for `nvidia/llama-embed-nemotron-8b`

KennethEnevoldsen commented Oct 19, 2025 •

edited

Loading