Add new benchmark MAIR by sunnweiwei · Pull Request #1425 · embeddings-benchmark/mteb

sunnweiwei · 2024-11-10T07:01:01Z

Added MAIR (https://arxiv.org/abs/2410.10127, EMNLP 2024), a diverse benchmark for instructed IR.
The data class is defined in mteb/tasks/MAIR/eng/MAIR.py, generating 126 data classes for the 126 tasks in MAIR on the fly.
In benchmarks/benchmarks.py, the benchmark configuration has been added.
Tested several models, and the results are consistent with those of the original repo: https://github.com/sunnweiwei/mair.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

The added data is introduced in https://arxiv.org/abs/2410.10127, which introduces a benchmark for instructable information retrieval. It contains 126 real-world retrieval tasks across 6 domains, with instructions manually annotated. And the data has been sampled to reduce evaluation costs.

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

mangopy · 2024-11-10T11:53:19Z

Following the above process, I am currently open a pull request in https://github.com/embeddings-benchmark/results to submit our evaluation results to the newly-added MAIR benchmark.
However, I am not very clear about the format of the result file.

Samoed · 2024-11-10T12:15:09Z

When you run your tasks, MTEB will generate a folder with results from your runs, and you can submit that folder

Samoed · 2024-11-10T12:16:25Z

+        return
+    self.corpus, self.queries, self.relevant_docs = {}, {}, {}
+    queries_path = self.metadata_dict["dataset"]["path"]
+    docs_path = self.metadata_dict["dataset"]["path"].replace("-Queries", "-Docs")


Can you place queries and docs in same repo?

Thanks for the feedback. To keep Q/D in one repo, I could create a separate repo for each task.

But is that necessary? I think having two repo for Q and D would be easier to manage than having over hundreds of repo for each task.

You can create different splits for queries and documents in the same repo

Thanks, I see.

One issue is that the data in MAIR has a two-level structure: task → subtasks, as some tasks contain multiple subtasks (e.g., IFEval, SWE-Bench). It’s tricky to maintain this structure without flattening it into a single repository.

And I think people may not need to download all the data if they’re only interested in evaluating a few specific tasks. So, if we still need to put Q and D in a single repo, the best way might be to generate 126 separate repo (for each task).

Or do you have any other suggestions?

mangopy · 2024-11-10T12:16:53Z

Another question is that our benchmark has two settings, i.e., evaluting the model with and without instruction. Should I store the result with instruction and without instruction into two files, respectively?
Appreciate for any feedback and response!

Samoed · 2024-11-10T12:23:27Z

If you have results for both instruct and non-instruct, it might be better to create separate tasks, though @orionw might have a clearer perspective on this

orionw · 2024-11-10T13:57:57Z

+1 to adding a duplicate task if you have a specific instruction you want them to use for each. Otherwise models can define their own instructions and in that case you could just submit results to the same task but with a different prompt in the meta info.

If you’re adding an instruction variant (and once #1359 is in) you’d just need to add a version of those tasks with all the same attributes but also a config/attribute called “self.instruction” (query-id -> instruction_text) format

sunnweiwei · 2024-11-10T18:52:46Z

Hi. If we have duplicate tasks with different instructions, will they appear in separate tables on the leaderboard? Like would there be a one called (XXX with instruction) and another called (XXX without instruction)?

orionw · 2024-11-10T19:32:12Z

@sunnweiwei they would appear as different datasets yes. So you could have one leaderboard with them and one without, if desired. Or push it all together into one benchmark.

Does that answer the question? Or do you mean more than one instruction per dataset?

sunnweiwei · 2024-11-10T19:41:00Z

Thanks for the answer! I was thinking to put them into one table for benchmarking purpose, maybe adding a column to indicate if instructions were used. Then people could compare models with and without instructions in the same table. Good to know we can do this then.

Muennighoff · 2025-01-09T05:44:41Z

Would be great to get this in @sunnweiwei in case you're still working on it; I think it'll be very useful to the community!

sunnweiwei · 2025-02-02T03:14:13Z

@Muennighoff Thanks! I also hope that this can be merged. I have placed the queries and documents under one repo: MAIR-Bench/MAIR-QD. I’m not sure if the new version of MTEB includes any changes, so please let me know if any update needed.

KennethEnevoldsen · 2025-02-14T14:26:06Z

Hi @sunnweiwei seems like we have dropped the ball on this one. Do you still have the time to finalize this PR? I sadly can't update your branch so you would have to resolve the conflict by merging main

sunnweiwei · 2025-03-13T02:44:29Z

@KennethEnevoldsen Thanks! I can now work on it. I updated some files, and it now seems to pass the check. Let me know if anything needs to be updated.

Samoed

Can your update benchmarks file to have less changes? Also can you run task.calculate_metadata_metrics()?

Samoed · 2025-03-13T18:53:36Z

I don't think that benchmarks file should be changed that much

Samoed · 2025-03-13T18:54:44Z

+        type="Retrieval",
+        category="s2p",
+        modalities=["text"],
+        eval_splits=TASK2SPLIT.get(task_name, []),


Suggested change

eval_splits=TASK2SPLIT.get(task_name, []),

eval_splits=TASK2SPLIT[task_name],

sunnweiwei mentioned this pull request Nov 10, 2024

Add new benchmark MAIR #1426

Open

Samoed reviewed Nov 10, 2024

View reviewed changes

orionw mentioned this pull request Nov 11, 2024

Consolidate Retrieval/Reranking/Instruction Variants #1359

Merged

1 task

KennethEnevoldsen requested a review from Samoed March 13, 2025 18:12

Samoed requested changes Mar 13, 2025

View reviewed changes

sunnweiwei closed this Mar 14, 2025

	eval_splits=TASK2SPLIT.get(task_name, []),
	eval_splits=TASK2SPLIT[task_name],

Uh oh!

Conversation

sunnweiwei commented Nov 10, 2024 • edited by isaac-chung Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Adding datasets checklist

Adding a model checklist

Uh oh!

mangopy commented Nov 10, 2024

Uh oh!

Samoed commented Nov 10, 2024

Uh oh!

Samoed Nov 10, 2024

Choose a reason for hiding this comment

Uh oh!

sunnweiwei Nov 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Nov 10, 2024

Choose a reason for hiding this comment

Uh oh!

sunnweiwei Nov 10, 2024

Choose a reason for hiding this comment

Uh oh!

mangopy commented Nov 10, 2024

Uh oh!

Samoed commented Nov 10, 2024

Uh oh!

orionw commented Nov 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunnweiwei commented Nov 10, 2024

Uh oh!

orionw commented Nov 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunnweiwei commented Nov 10, 2024

Uh oh!

Muennighoff commented Jan 9, 2025

Uh oh!

sunnweiwei commented Feb 2, 2025

Uh oh!

KennethEnevoldsen commented Feb 14, 2025

Uh oh!

sunnweiwei commented Mar 13, 2025

Uh oh!

Samoed left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sunnweiwei commented Nov 10, 2024 •

edited by isaac-chung

Loading

sunnweiwei Nov 10, 2024 •

edited

Loading

orionw commented Nov 10, 2024 •

edited

Loading

orionw commented Nov 10, 2024 •

edited

Loading

Samoed left a comment •

edited

Loading