Skip to content

Add new benchmark MAIR#1425

Closed
sunnweiwei wants to merge 0 commit into
embeddings-benchmark:mainfrom
sunnweiwei:main
Closed

Add new benchmark MAIR#1425
sunnweiwei wants to merge 0 commit into
embeddings-benchmark:mainfrom
sunnweiwei:main

Conversation

@sunnweiwei

@sunnweiwei sunnweiwei commented Nov 10, 2024

Copy link
Copy Markdown

Fixes #1426

  • Added MAIR (https://arxiv.org/abs/2410.10127, EMNLP 2024), a diverse benchmark for instructed IR.
  • The data class is defined in mteb/tasks/MAIR/eng/MAIR.py, generating 126 data classes for the 126 tasks in MAIR on the fly.
  • In benchmarks/benchmarks.py, the benchmark configuration has been added.
  • Tested several models, and the results are consistent with those of the original repo: https://github.com/sunnweiwei/mair.

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

The added data is introduced in https://arxiv.org/abs/2410.10127, which introduces a benchmark for instructable information retrieval. It contains 126 real-world retrieval tasks across 6 domains, with instructions manually annotated. And the data has been sampled to reduce evaluation costs.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@mangopy

mangopy commented Nov 10, 2024

Copy link
Copy Markdown

Following the above process, I am currently open a pull request in https://github.com/embeddings-benchmark/results to submit our evaluation results to the newly-added MAIR benchmark.
However, I am not very clear about the format of the result file.

@Samoed

Samoed commented Nov 10, 2024

Copy link
Copy Markdown
Member

When you run your tasks, MTEB will generate a folder with results from your runs, and you can submit that folder

Comment thread mteb/tasks/MAIR/eng/MAIR.py Outdated
return
self.corpus, self.queries, self.relevant_docs = {}, {}, {}
queries_path = self.metadata_dict["dataset"]["path"]
docs_path = self.metadata_dict["dataset"]["path"].replace("-Queries", "-Docs")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you place queries and docs in same repo?

@sunnweiwei sunnweiwei Nov 10, 2024

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. To keep Q/D in one repo, I could create a separate repo for each task.

But is that necessary? I think having two repo for Q and D would be easier to manage than having over hundreds of repo for each task.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create different splits for queries and documents in the same repo

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see.

One issue is that the data in MAIR has a two-level structure: task → subtasks, as some tasks contain multiple subtasks (e.g., IFEval, SWE-Bench). It’s tricky to maintain this structure without flattening it into a single repository.

And I think people may not need to download all the data if they’re only interested in evaluating a few specific tasks. So, if we still need to put Q and D in a single repo, the best way might be to generate 126 separate repo (for each task).

Or do you have any other suggestions?

@mangopy

mangopy commented Nov 10, 2024

Copy link
Copy Markdown

Another question is that our benchmark has two settings, i.e., evaluting the model with and without instruction. Should I store the result with instruction and without instruction into two files, respectively?
Appreciate for any feedback and response!

@Samoed

Samoed commented Nov 10, 2024

Copy link
Copy Markdown
Member

If you have results for both instruct and non-instruct, it might be better to create separate tasks, though @orionw might have a clearer perspective on this

@orionw

orionw commented Nov 10, 2024

Copy link
Copy Markdown
Contributor

+1 to adding a duplicate task if you have a specific instruction you want them to use for each. Otherwise models can define their own instructions and in that case you could just submit results to the same task but with a different prompt in the meta info.

If you’re adding an instruction variant (and once #1359 is in) you’d just need to add a version of those tasks with all the same attributes but also a config/attribute called “self.instruction” (query-id -> instruction_text) format

@sunnweiwei

Copy link
Copy Markdown
Author

Hi. If we have duplicate tasks with different instructions, will they appear in separate tables on the leaderboard? Like would there be a one called (XXX with instruction) and another called (XXX without instruction)?

@orionw

orionw commented Nov 10, 2024

Copy link
Copy Markdown
Contributor

@sunnweiwei they would appear as different datasets yes. So you could have one leaderboard with them and one without, if desired. Or push it all together into one benchmark.

Does that answer the question? Or do you mean more than one instruction per dataset?

@sunnweiwei

Copy link
Copy Markdown
Author

Thanks for the answer! I was thinking to put them into one table for benchmarking purpose, maybe adding a column to indicate if instructions were used. Then people could compare models with and without instructions in the same table. Good to know we can do this then.

@Muennighoff

Copy link
Copy Markdown
Contributor

Would be great to get this in @sunnweiwei in case you're still working on it; I think it'll be very useful to the community!

@sunnweiwei

Copy link
Copy Markdown
Author

@Muennighoff Thanks! I also hope that this can be merged. I have placed the queries and documents under one repo: MAIR-Bench/MAIR-QD. I’m not sure if the new version of MTEB includes any changes, so please let me know if any update needed.

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Hi @sunnweiwei seems like we have dropped the ball on this one. Do you still have the time to finalize this PR? I sadly can't update your branch so you would have to resolve the conflict by merging main

@sunnweiwei

Copy link
Copy Markdown
Author

@KennethEnevoldsen Thanks! I can now work on it. I updated some files, and it now seems to pass the check. Let me know if anything needs to be updated.

@Samoed Samoed left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can your update benchmarks file to have less changes? Also can you run task.calculate_metadata_metrics()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that benchmarks file should be changed that much

Comment thread mteb/tasks/MAIR/eng/MAIR.py Outdated
type="Retrieval",
category="s2p",
modalities=["text"],
eval_splits=TASK2SPLIT.get(task_name, []),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eval_splits=TASK2SPLIT.get(task_name, []),
eval_splits=TASK2SPLIT[task_name],

@sunnweiwei sunnweiwei closed this Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add new benchmark MAIR

6 participants