Skip to content

model: llama-embed-nemotron-8b#3407

Merged
Samoed merged 6 commits into
embeddings-benchmark:v2.0.0from
ybabakhin:llama-embed-nemotron-8b
Oct 19, 2025
Merged

model: llama-embed-nemotron-8b#3407
Samoed merged 6 commits into
embeddings-benchmark:v2.0.0from
ybabakhin:llama-embed-nemotron-8b

Conversation

@ybabakhin

Copy link
Copy Markdown
Contributor

Adds llama-embed-nemotron-8b model

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.
  • The model is public, i.e. is available either as an API or the wieght are publicly avaiable to download

@Samoed

Samoed commented Oct 17, 2025

Copy link
Copy Markdown
Member

Do you have plans to integrate your omnin embed model? We're releasing v2 version on Monday with better support for multimodality

Comment thread mteb/models/nvidia_models.py Outdated
Comment thread mteb/models/nvidia_models.py Outdated
model_name,
revision,
max_seq_length=4096,
batch_size=4,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
batch_size=4,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added batch_size handling from encode_kwargs, but some of the benchmarks are getting GPU OOM now. Is it a user's responsibility to specify a proper encode_kwargs={"batch_size": 4} argument?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it really depends on the system that they are on. Currently, the default is 32, but it might be ideal to lower that. Unsure if it is better to get the OOM and adjust it down to a reasonable level, rather than have it at a too low default. I might be leaning toward OOM being better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think, 32 is a reasonable default choice. Actually, I was getting those OOMs for version 1.39.7 which had 128 default for some problem types

Comment thread mteb/models/nvidia_models.py Outdated
with torch.inference_mode():
inputs = self.tokenizer(
batch,
max_length=self.max_seq_length,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better to specify in tokenization config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is our current state with a context length:

  • Base Llama-3.1-8B supports 128k
  • We've tested our llama-embed-nemotron-8b with the context length up to 32k, which we report in the metadata
  • We've ran the evaluation with 4k context length

So, our config has a theoretical 128k limit, but 4k is here for eval reproducibility

@boliu61

boliu61 commented Oct 17, 2025

Copy link
Copy Markdown

Do you have plans to integrate your omnin embed model? We're releasing v2 version on Monday with better support for multimodality

Hi @Samoed, do you mean v2 of M-MTEB? What will happen to the current M-MTEB leaderboard on Monday?

By integrating, you mean combining llama-embed-nemotron-8b and omni-embed-nemotron-3b? We don't have this plan as of now

@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

do you mean v2 of M-MTEB?

I mean this library

What will happen to the current M-MTEB leaderboard on Monday?

Nothing, it will be unchanged

By integrating, you mean combining llama-embed-nemotron-8b and omni-embed-nemotron-3b

No, add it as separate model, because it's omni is multimodal, but nemotron is text only

@ybabakhin

Copy link
Copy Markdown
Contributor Author

I've found the change log here: https://embeddings-benchmark.github.io/mteb/whats_new/, looks nice!

@Samoed Which existing/upcoming Leaderboards would you suggest for the Omni model?

@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

Created issue about discussion of omni model #3411 where we can continue discussion. I will update this PR a bit to align with v2

@ybabakhin

Copy link
Copy Markdown
Contributor Author

I will update this PR a bit to align with v2

Thanks! I added a few changes + lint

@Samoed Do you have any ETA when this model can make it to MMTEB Leaderboard? Do we have to wait for 2.0.0 release, or it can be published earlier?

@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

If you can wait a bit, it’ll be easier to add the model to v2, and it will appear on the leaderboard on Monday with the release of the second version.

@Samoed Samoed changed the base branch from main to v2.0.0 October 18, 2025 09:00
@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

@ybabakhin I've aligned your model with v2 version and added your name to contacts as part of #3399. Can you try to run this implementation?

@Samoed Samoed added the new model Questions related to adding a new model to the benchmark label Oct 18, 2025
@ybabakhin

Copy link
Copy Markdown
Contributor Author

@Samoed I added a small fix to make it work with v2.0.0. batch_size is still being passed in kwargs, even though DataLoader is provided explicitly now.

New eval code works fine:

import mteb

model_name = "nvidia/llama-embed-nemotron-8b"

model = mteb.get_model(model_name)
tasks = mteb.get_tasks(tasks=["HagridRetrieval"])

mteb.evaluate(
    model,
    tasks,
    encode_kwargs={"batch_size": 4},
)

I'm only getting TOKENIZERS_PARALLELISM warnings:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

I will run it on more tasks to check if there are any discrepancies. Shall I also update a model_meta.json to a new 2.0.0 format in embeddings-benchmark/results#302?

@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

batch_size is still being passed in kwargs, even though DataLoader is provided explicitly now.

Yes, this is for models that wouldn't use dataloaders directly (e.g. sentence transformers).

Shall I also update a model_meta.json to a new 2.0.0 format

That would be nice, but this is minor

@ybabakhin

Copy link
Copy Markdown
Contributor Author

@Samoed some tests are failing, but I don't think it is related to the changes in this PR

@Samoed

Samoed commented Oct 18, 2025

Copy link
Copy Markdown
Member

Yes, I see. This is a flaky test that we’re currently working to fix.

Comment thread mteb/models/model_implementations/nvidia_models.py
@ybabakhin

Copy link
Copy Markdown
Contributor Author

@Samoed , @KennethEnevoldsen can you, please, merge this PR now? Also, is v2.0.0 release still planned for tomorrow?

@Samoed

Samoed commented Oct 19, 2025

Copy link
Copy Markdown
Member

Yes, it will be released tomorrow. This pr will be merged and Kenneth finish review of results

@Samoed Samoed merged commit d1ce6aa into embeddings-benchmark:v2.0.0 Oct 19, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model Questions related to adding a new model to the benchmark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants