Add LoTTE Benchmark to MTEB#2009
Conversation
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Thanks for the PR. Added a few suggestions.
| ) | ||
|
|
||
|
|
||
| MTEB_LOTTE = Benchmark( |
There was a problem hiding this comment.
This will just appear as an empty leaderboard. We would probably want at least some models evaluated on it before adding it to the leaderboard (otherwise it will seem like a bug).
There was a problem hiding this comment.
I tried evaluating some models on Colab and the dataset is huge so it takes a long time to load for me and to evaluate. I tried chunking it down to ensure its working but I'm not able to run the entire benchmark.
There was a problem hiding this comment.
Is this fine in the current version?
There was a problem hiding this comment.
Hmm, should we maybe consider downsampling the dataset then? Not really worth adding a benchmark if people can't run it?
There was a problem hiding this comment.
This is not a very large dataset. It has 5 splits of approximately 30 MB each, as available at https://huggingface.co/datasets/colbertv2/lotte. However, the data may be different in the tar file, because its size is approximately 2 GB.
There was a problem hiding this comment.
the dataset is fine, but sentence transformers takes too long to encode it even on Colab, can you run this on your end and let me know if it works?
There was a problem hiding this comment.
@KennethEnevoldsen I've added examples results for this task embeddings-benchmark/results#118, but I'm not sure if we want to add this task as benchmark
|
When I'm trying to run your task, I receive |
|
Can you please run a check to make sure everything is working correctly? Because the data is not loading at the moment. >>> mteb.get_task("LoTTE").load_data()
{'queries': {'test': {}}, 'corpus': {'test': {}}, 'relevant_docs': {'test': {}}} |
Samoed
left a comment
There was a problem hiding this comment.
Also when I'm trying to run I receive error
TypeError: can only concatenate str (not "dict") to str
Can you try to start run to validate if your data is loading correctly? If you have low ram, you can do this on kaggle/colab
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
…d dataset if missing
…o add-lotte-task
| merged_queries = {} | ||
| merged_corpus = {} | ||
| merged_relevant = {} | ||
| for domain in self.queries: | ||
| if split in self.queries[domain]: | ||
| merged_queries.update(self.queries[domain][split]) | ||
| for key, value in self.queries[domain].items(): | ||
| if key.startswith(split) and key != split: | ||
| merged_queries.update(value) | ||
| for domain in self.corpus: | ||
| if split in self.corpus[domain]: | ||
| merged_corpus.update(self.corpus[domain][split]) | ||
| for domain in self.relevant_docs: | ||
| if split in self.relevant_docs[domain]: | ||
| merged_relevant.update(self.relevant_docs[domain][split]) | ||
| for key, value in self.relevant_docs[domain].items(): | ||
| if key.startswith(split) and key != split: | ||
| merged_relevant.update(value) |
There was a problem hiding this comment.
I don't think we need to merge all queries, corpus, etc. I think you should left corpus and queries per domain
There was a problem hiding this comment.
okay fixed in latest
| if corpus_file.exists(): | ||
| with open(corpus_file, encoding="utf-8") as f: | ||
| self.corpus[domain][split] = dict( | ||
| line.strip().split("\t", 1) for line in f if line.strip() | ||
| ) | ||
| elif metadata_file.exists(): | ||
| corpus = {} | ||
| with open(metadata_file, encoding="utf-8") as f: | ||
| for line in f: | ||
| try: | ||
| obj = json.loads(line) | ||
| doc_id = obj.get("pid") or obj.get("id") | ||
| text = obj.get("text") or obj.get("body") | ||
| if doc_id and text: | ||
| corpus[doc_id] = text | ||
| except Exception as e: | ||
| logger.error(f"Error parsing {metadata_file}: {e}") | ||
| self.corpus[domain][split] = corpus | ||
| else: | ||
| logger.warning(f"No corpus file found for {domain} {split}.") |
There was a problem hiding this comment.
I tried to run task, but there is error occurred.
task.queries["writing"].keys() # dict_keys(['test', 'test.forum'])
task.relevant_docs["writing"].keys() # dict_keys(['test', 'test.forum'])
task.corpus["writing"].keys() # dict_keys(['test'])MTEB expecting that all data for task run will be in one split, but for now corpus have different naming scheme. I think we should change domains to writing.search and writing.forum to align with mteb approach. What do you think?
There was a problem hiding this comment.
thats why i had earlier merged them. we now load data per domain without merging the “search” and “forum” items into one key. Instead, for each domain we create separate sub‑dictionaries for “search” and “forum” queries (and similarly for qrels).
{ "corpus": { "writing": { ... }, "recreation": { ... }, ... }, "queries": { "writing": { "search": { ... }, "forum": { ... } }, "recreation": { "search": { ... }, "forum": { ... } }, ... },
There was a problem hiding this comment.
You should create them like this:
- corpus: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
- queries: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
- relevant_docs: "writing_search": {...}, "writing_forum": {...}, "recreation_search": { ... }, "recreation_forum": { ... },
because mteb can't handle nested dicts
There was a problem hiding this comment.
Can you check if this works?
|
I think this is working now. I will upload this dataset to huggingface and will merge to |
# Conflicts: # mteb/tasks/Retrieval/__init__.py # mteb/tasks/Retrieval/eng/__init__.py
|
@KennethEnevoldsen Should we keep LoTTe as a benchmark, or should we remove it? This PR will be merged to |
|
I'll merge it as is, in future we might want to remove the benchmark |
|
@Samoed I would probably remove it for now (to avoid issues with the leaderboard). It is easy to add in later) |
Description:
This PR integrates the LoTTE (Long-Tail Topic-stratified Evaluation for IR) benchmark into MTEB. LoTTE consists of domain-specific retrieval tasks derived from StackExchange and GooAQ, evaluating models on natural, information-seeking queries in long-tail topics.
Closes #1836
Changes:
Testing:
✅ Verified that LoTTERetrieval runs successfully with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 and intfloat/multilingual-e5-small.
✅ Ensured scores are neither trivial (near 100%) nor random (near 0%).
✅ Passed make test and make lint.
Notes:
Updates: