Add LoTTE Benchmark to MTEB by agu18dec · Pull Request #2009 · embeddings-benchmark/mteb

agu18dec · 2025-02-07T09:52:29Z

Description:
This PR integrates the LoTTE (Long-Tail Topic-stratified Evaluation for IR) benchmark into MTEB. LoTTE consists of domain-specific retrieval tasks derived from StackExchange and GooAQ, evaluating models on natural, information-seeking queries in long-tail topics.

Closes #1836

Changes:

Added LoTTERetrieval task under mteb/tasks/Retrieval/eng/LoTTE_Retrieval.py.
Implemented dataset loading, transformation, and evaluation logic.
Registered LoTTE as a benchmark in benchmarks.py.
Updated metadata to ensure compliance with TaskMetadata.

Testing:
✅ Verified that LoTTERetrieval runs successfully with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and intfloat/multilingual-e5-small.
✅ Ensured scores are neither trivial (near 100%) nor random (near 0%).
✅ Passed make test and make lint.

Notes:

Dataset is hosted on Hugging Face, using revision "main".
Benchmark supports "dev" and "test" splits, with "success@5" as the main metric.
Looking forward to feedback!

Updates:

Moved LoTTERetrieval.py to mteb/tasks/Retrieval/eng/
Used dataset_transform() instead of load_data()
Ensured eval_splits=["test"]
Fixed eval_langs while keeping domain-specific mappings

KennethEnevoldsen

Thanks for the PR. Added a few suggestions.

KennethEnevoldsen · 2025-02-07T11:13:36Z

 )
+
+
+MTEB_LOTTE = Benchmark(


This will just appear as an empty leaderboard. We would probably want at least some models evaluated on it before adding it to the leaderboard (otherwise it will seem like a bug).

I tried evaluating some models on Colab and the dataset is huge so it takes a long time to load for me and to evaluate. I tried chunking it down to ensure its working but I'm not able to run the entire benchmark.

Is this fine in the current version?

Hmm, should we maybe consider downsampling the dataset then? Not really worth adding a benchmark if people can't run it?

This is not a very large dataset. It has 5 splits of approximately 30 MB each, as available at https://huggingface.co/datasets/colbertv2/lotte. However, the data may be different in the tar file, because its size is approximately 2 GB.

the dataset is fine, but sentence transformers takes too long to encode it even on Colab, can you run this on your end and let me know if it works?

@KennethEnevoldsen I've added examples results for this task embeddings-benchmark/results#118, but I'm not sure if we want to add this task as benchmark

Samoed · 2025-02-07T22:16:50Z

When I'm trying to run your task, I receive

  File "/home/samoed/Desktop/mteb/orig_mteb/mteb/abstasks/AbsTaskRetrieval.py", line 130, in _load_corpus
    corpus_ds = load_dataset(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2606, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2314, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 601, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'corpus' not found. Available: ['lifestyle', 'pooled', 'recreation', 'science', 'technology', 'writing']

Samoed · 2025-02-08T10:08:25Z

Can you please run a check to make sure everything is working correctly? Because the data is not loading at the moment.

>>> mteb.get_task("LoTTE").load_data()
{'queries': {'test': {}}, 'corpus': {'test': {}}, 'relevant_docs': {'test': {}}}

Samoed

Also when I'm trying to run I receive error

TypeError: can only concatenate str (not "dict") to str

Can you try to start run to validate if your data is loading correctly? If you have low ram, you can do this on kaggle/colab

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

…d dataset if missing

…o add-lotte-task

…e-task

KennethEnevoldsen · 2025-02-10T13:59:37Z

I will unsubscribe from this seems like it is in good hands with @Samoed. I will happily do a review once @Samoed is happy.

Samoed · 2025-02-11T08:31:19Z

+        merged_queries = {}
+        merged_corpus = {}
+        merged_relevant = {}
+        for domain in self.queries:
+            if split in self.queries[domain]:
+                merged_queries.update(self.queries[domain][split])
+            for key, value in self.queries[domain].items():
+                if key.startswith(split) and key != split:
+                    merged_queries.update(value)
+        for domain in self.corpus:
+            if split in self.corpus[domain]:
+                merged_corpus.update(self.corpus[domain][split])
+        for domain in self.relevant_docs:
+            if split in self.relevant_docs[domain]:
+                merged_relevant.update(self.relevant_docs[domain][split])
+            for key, value in self.relevant_docs[domain].items():
+                if key.startswith(split) and key != split:
+                    merged_relevant.update(value)


I don't think we need to merge all queries, corpus, etc. I think you should left corpus and queries per domain

okay fixed in latest

Samoed · 2025-02-11T21:29:44Z

+                if corpus_file.exists():
+                    with open(corpus_file, encoding="utf-8") as f:
+                        self.corpus[domain][split] = dict(
+                            line.strip().split("\t", 1) for line in f if line.strip()
+                        )
+                elif metadata_file.exists():
+                    corpus = {}
+                    with open(metadata_file, encoding="utf-8") as f:
+                        for line in f:
+                            try:
+                                obj = json.loads(line)
+                                doc_id = obj.get("pid") or obj.get("id")
+                                text = obj.get("text") or obj.get("body")
+                                if doc_id and text:
+                                    corpus[doc_id] = text
+                            except Exception as e:
+                                logger.error(f"Error parsing {metadata_file}: {e}")
+                    self.corpus[domain][split] = corpus
+                else:
+                    logger.warning(f"No corpus file found for {domain} {split}.")


I tried to run task, but there is error occurred.

task.queries["writing"].keys() # dict_keys(['test', 'test.forum']) task.relevant_docs["writing"].keys() # dict_keys(['test', 'test.forum']) task.corpus["writing"].keys() # dict_keys(['test'])

MTEB expecting that all data for task run will be in one split, but for now corpus have different naming scheme. I think we should change domains to writing.search and writing.forum to align with mteb approach. What do you think?

thats why i had earlier merged them. we now load data per domain without merging the “search” and “forum” items into one key. Instead, for each domain we create separate sub‑dictionaries for “search” and “forum” queries (and similarly for qrels).

{ "corpus": { "writing": { ... }, "recreation": { ... }, ... }, "queries": { "writing": { "search": { ... }, "forum": { ... } }, "recreation": { "search": { ... }, "forum": { ... } }, ... },

You should create them like this:

corpus: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },

queries: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },

relevant_docs: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
because mteb can't handle nested dicts

okay updated that

Can you check if this works?

Samoed · 2025-02-12T22:46:07Z

I think this is working now. I will upload this dataset to huggingface and will merge to v2 branch, because it has utilities for uploading/downloading multilingual datasets. Thank you for your work!

# Conflicts: # mteb/tasks/Retrieval/__init__.py # mteb/tasks/Retrieval/eng/__init__.py

Samoed · 2025-02-14T13:57:09Z

@KennethEnevoldsen Should we keep LoTTe as a benchmark, or should we remove it? This PR will be merged to v2

Samoed · 2025-02-14T20:36:12Z

I'll merge it as is, in future we might want to remove the benchmark

KennethEnevoldsen · 2025-02-17T10:51:21Z

@Samoed I would probably remove it for now (to avoid issues with the leaderboard). It is easy to add in later)

Add LoTTE Benchmark to MTEB

dea866b

Samoed reviewed Feb 7, 2025

View reviewed changes

Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated

Samoed reviewed Feb 7, 2025

View reviewed changes

Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated

Samoed reviewed Feb 7, 2025

View reviewed changes

Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated

KennethEnevoldsen approved these changes Feb 7, 2025

View reviewed changes

incorporated PR feedback

d666301

Samoed reviewed Feb 7, 2025

View reviewed changes

Comment thread mteb/benchmarks/benchmarks.py Outdated

incorporating pr feedback across the board

71a2c5c

agu18dec requested a review from KennethEnevoldsen February 8, 2025 08:58

loads data correctly

b0ecd06

Samoed requested changes Feb 9, 2025

View reviewed changes

Comment thread mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated

Comment thread mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated

agu18dec and others added 3 commits February 9, 2025 01:46

Update mteb/tasks/Retrieval/eng/LoTTERetrieval.py

7c60444

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

Custom load_data for LoTTE: iterate domains then splits; auto-downloa…

9a60678

…d dataset if missing

Merge branch 'add-lotte-task' of https://github.com/agu18dec/mteb int…

45de3f5

…o add-lotte-task

Samoed reviewed Feb 10, 2025

View reviewed changes

Comment thread mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated

Merge branch 'main' of https://github.com/agu18dec/mteb into add-lott…

2d05af9

…e-task

KennethEnevoldsen removed their request for review February 10, 2025 14:00

agu18dec added 2 commits February 10, 2025 23:54

bug fixes

ef20316

all make tests pass

af744f4

Samoed reviewed Feb 11, 2025

View reviewed changes

no merging

e1763fc

Samoed reviewed Feb 11, 2025

View reviewed changes

agu18dec and others added 4 commits February 11, 2025 16:03

ensuring tasks can be run

bc8c216

fixed structure

77bfb5e

refactor task loading

26e6ebe

add splits everywhere

47064aa

Samoed changed the base branch from main to v2.0.0 February 13, 2025 17:24

Samoed added 5 commits February 13, 2025 20:27

Merge branch 'refs/heads/v2.0.0' into add-lotte-task

c599037

# Conflicts: # mteb/tasks/Retrieval/__init__.py # mteb/tasks/Retrieval/eng/__init__.py

lint

3f817e5

fix imports

c9bb58b

temporary allow url

3eb082a

upload lotte to mteb

fb8a7b0

Samoed reviewed Feb 13, 2025

View reviewed changes

Comment thread mteb/abstasks/TaskMetadata.py

fix model2vec

f34348f

Samoed mentioned this pull request Feb 13, 2025

add LoTTe result for potion-base-2M embeddings-benchmark/results#118

Merged

2 tasks

Samoed approved these changes Feb 13, 2025

View reviewed changes

remove abstract from citation

aa837e0

Samoed added the v2 label Feb 14, 2025

Samoed merged commit ca60b82 into embeddings-benchmark:v2.0.0 Feb 14, 2025

		)


		MTEB_LOTTE = Benchmark(

Uh oh!

Conversation

agu18dec commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Samoed commented Feb 7, 2025

Uh oh!

Samoed commented Feb 8, 2025

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen commented Feb 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed commented Feb 12, 2025

Uh oh!

Uh oh!

Samoed commented Feb 14, 2025

Uh oh!

Samoed commented Feb 14, 2025

Uh oh!

KennethEnevoldsen commented Feb 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agu18dec commented Feb 7, 2025 •

edited

Loading

Samoed Feb 10, 2025 •

edited

Loading

Samoed Feb 12, 2025 •

edited

Loading