Skip to content

Add LoTTE Benchmark to MTEB#2009

Merged
Samoed merged 22 commits into
embeddings-benchmark:v2.0.0from
agu18dec:add-lotte-task
Feb 14, 2025
Merged

Add LoTTE Benchmark to MTEB#2009
Samoed merged 22 commits into
embeddings-benchmark:v2.0.0from
agu18dec:add-lotte-task

Conversation

@agu18dec

@agu18dec agu18dec commented Feb 7, 2025

Copy link
Copy Markdown
Contributor

Description:
This PR integrates the LoTTE (Long-Tail Topic-stratified Evaluation for IR) benchmark into MTEB. LoTTE consists of domain-specific retrieval tasks derived from StackExchange and GooAQ, evaluating models on natural, information-seeking queries in long-tail topics.

Closes #1836

Changes:

  • Added LoTTERetrieval task under mteb/tasks/Retrieval/eng/LoTTE_Retrieval.py.
  • Implemented dataset loading, transformation, and evaluation logic.
  • Registered LoTTE as a benchmark in benchmarks.py.
  • Updated metadata to ensure compliance with TaskMetadata.

Testing:
✅ Verified that LoTTERetrieval runs successfully with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and intfloat/multilingual-e5-small.
✅ Ensured scores are neither trivial (near 100%) nor random (near 0%).
✅ Passed make test and make lint.

Notes:

  • Dataset is hosted on Hugging Face, using revision "main".
  • Benchmark supports "dev" and "test" splits, with "success@5" as the main metric.
  • Looking forward to feedback!

Updates:

  • Moved LoTTERetrieval.py to mteb/tasks/Retrieval/eng/
  • Used dataset_transform() instead of load_data()
  • Ensured eval_splits=["test"]
  • Fixed eval_langs while keeping domain-specific mappings

Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Added a few suggestions.

Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/lotte/LoTTERetrieval.py Outdated
)


MTEB_LOTTE = Benchmark(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just appear as an empty leaderboard. We would probably want at least some models evaluated on it before adding it to the leaderboard (otherwise it will seem like a bug).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried evaluating some models on Colab and the dataset is huge so it takes a long time to load for me and to evaluate. I tried chunking it down to ensure its working but I'm not able to run the entire benchmark.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fine in the current version?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, should we maybe consider downsampling the dataset then? Not really worth adding a benchmark if people can't run it?

@Samoed Samoed Feb 10, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a very large dataset. It has 5 splits of approximately 30 MB each, as available at https://huggingface.co/datasets/colbertv2/lotte. However, the data may be different in the tar file, because its size is approximately 2 GB.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dataset is fine, but sentence transformers takes too long to encode it even on Colab, can you run this on your end and let me know if it works?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen I've added examples results for this task embeddings-benchmark/results#118, but I'm not sure if we want to add this task as benchmark

Comment thread mteb/benchmarks/benchmarks.py Outdated
Comment thread mteb/benchmarks/benchmarks.py Outdated
@Samoed

Samoed commented Feb 7, 2025

Copy link
Copy Markdown
Member

When I'm trying to run your task, I receive

  File "/home/samoed/Desktop/mteb/orig_mteb/mteb/abstasks/AbsTaskRetrieval.py", line 130, in _load_corpus
    corpus_ds = load_dataset(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2606, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/load.py", line 2314, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 374, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/home/samoed/Desktop/mteb/orig_mteb/.venv/lib/python3.10/site-packages/datasets/builder.py", line 601, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'corpus' not found. Available: ['lifestyle', 'pooled', 'recreation', 'science', 'technology', 'writing']

@Samoed

Samoed commented Feb 8, 2025

Copy link
Copy Markdown
Member

Can you please run a check to make sure everything is working correctly? Because the data is not loading at the moment.

>>> mteb.get_task("LoTTE").load_data()
{'queries': {'test': {}}, 'corpus': {'test': {}}, 'relevant_docs': {'test': {}}}

@Samoed Samoed left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when I'm trying to run I receive error

TypeError: can only concatenate str (not "dict") to str

Can you try to start run to validate if your data is loading correctly? If you have low ram, you can do this on kaggle/colab

Comment thread mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated
Comment thread mteb/tasks/Retrieval/eng/LoTTERetrieval.py Outdated
@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

I will unsubscribe from this seems like it is in good hands with @Samoed. I will happily do a review once @Samoed is happy.

@KennethEnevoldsen KennethEnevoldsen removed their request for review February 10, 2025 14:00
Comment on lines +183 to +200
merged_queries = {}
merged_corpus = {}
merged_relevant = {}
for domain in self.queries:
if split in self.queries[domain]:
merged_queries.update(self.queries[domain][split])
for key, value in self.queries[domain].items():
if key.startswith(split) and key != split:
merged_queries.update(value)
for domain in self.corpus:
if split in self.corpus[domain]:
merged_corpus.update(self.corpus[domain][split])
for domain in self.relevant_docs:
if split in self.relevant_docs[domain]:
merged_relevant.update(self.relevant_docs[domain][split])
for key, value in self.relevant_docs[domain].items():
if key.startswith(split) and key != split:
merged_relevant.update(value)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to merge all queries, corpus, etc. I think you should left corpus and queries per domain

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay fixed in latest

Comment on lines +145 to +164
if corpus_file.exists():
with open(corpus_file, encoding="utf-8") as f:
self.corpus[domain][split] = dict(
line.strip().split("\t", 1) for line in f if line.strip()
)
elif metadata_file.exists():
corpus = {}
with open(metadata_file, encoding="utf-8") as f:
for line in f:
try:
obj = json.loads(line)
doc_id = obj.get("pid") or obj.get("id")
text = obj.get("text") or obj.get("body")
if doc_id and text:
corpus[doc_id] = text
except Exception as e:
logger.error(f"Error parsing {metadata_file}: {e}")
self.corpus[domain][split] = corpus
else:
logger.warning(f"No corpus file found for {domain} {split}.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run task, but there is error occurred.

task.queries["writing"].keys() # dict_keys(['test', 'test.forum'])
task.relevant_docs["writing"].keys() # dict_keys(['test', 'test.forum'])
task.corpus["writing"].keys() # dict_keys(['test'])

MTEB expecting that all data for task run will be in one split, but for now corpus have different naming scheme. I think we should change domains to writing.search and writing.forum to align with mteb approach. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats why i had earlier merged them. we now load data per domain without merging the “search” and “forum” items into one key. Instead, for each domain we create separate sub‑dictionaries for “search” and “forum” queries (and similarly for qrels).

{ "corpus": { "writing": { ... }, "recreation": { ... }, ... }, "queries": { "writing": { "search": { ... }, "forum": { ... } }, "recreation": { "search": { ... }, "forum": { ... } }, ... },

@Samoed Samoed Feb 12, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should create them like this:

  • corpus: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
  • queries: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
  • relevant_docs: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
    because mteb can't handle nested dicts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay updated that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this works?

@Samoed

Samoed commented Feb 12, 2025

Copy link
Copy Markdown
Member

I think this is working now. I will upload this dataset to huggingface and will merge to v2 branch, because it has utilities for uploading/downloading multilingual datasets. Thank you for your work!

@Samoed Samoed changed the base branch from main to v2.0.0 February 13, 2025 17:24
Comment thread mteb/abstasks/TaskMetadata.py
@Samoed Samoed added the v2 label Feb 14, 2025
@Samoed

Samoed commented Feb 14, 2025

Copy link
Copy Markdown
Member

@KennethEnevoldsen Should we keep LoTTe as a benchmark, or should we remove it? This PR will be merged to v2

@Samoed

Samoed commented Feb 14, 2025

Copy link
Copy Markdown
Member

I'll merge it as is, in future we might want to remove the benchmark

@Samoed Samoed merged commit ca60b82 into embeddings-benchmark:v2.0.0 Feb 14, 2025
@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

@Samoed I would probably remove it for now (to avoid issues with the leaderboard). It is easy to add in later)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add LOTTE

3 participants