Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support by AdnanElAssadi56 · Pull Request #1947 · embeddings-benchmark/mteb

AdnanElAssadi56 · 2025-02-04T05:17:43Z

Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support

Description

This pull request integrates the BIRCO (Benchmark of Information Retrieval Tasks with Complex Objectives) datasets into the MTEB framework. The changes include:

New Base Class (BIRCOBase)
A new base class is added under mteb/tasks/Reranking/eng/BIRCO/BIRCOBase.py to load dataset JSON files from the Hugging Face hub.
Assumption:
The datasets from user bpHigh are hosted on Hugging Face, and each dataset’s JSON file is expected to be located at {metadata.dataset["path"]}/dataset.json. Please verify that these files exist and are accessible.
Dataset-Specific Task Classes
Five new dataset-specific classes have been implemented (each inheriting from BIRCOBase) in the directory mteb/tasks/Reranking/eng/BIRCO/:
- BIRCODorisMaeReranking.py
- BIRCOArguAnaReranking.py
- BIRCOClinicalTrialReranking.py
- BIRCORELICReranking.py
- BIRCOWhatsThatBookReranking.py
Each class contains a detailed metadata object (with fields such as name, description, reference, dataset path, revision, evaluation splits, etc.), and implements:
- get_query: Returns the query prepended with a task-specific instruction.
- get_positive_docs and get_negative_docs: Return the respective lists from the sample.
Placeholders:
- The revision field (e.g., "YOUR_REVISION_PLACEHOLDER") must be updated with the actual dataset revision hash from Hugging Face.
- Date values (e.g., ("2024-04-03", "2024-04-03")) should be updated as necessary.
Evaluator Adjustments
Updates in mteb/evaluation/evaluators/RerankingEvaluator.py ensure that graded relevance samples are correctly evaluated using pytrec_eval, computing metrics such as nDCG@10 and Recall@10.
Test Coverage
A new test file tests/test_birco.py has been added to validate the integration. This test ensures:
- Each BIRCO task loads its metadata correctly.
- The get_query method returns a string that starts with “Instruction:”.
This suite can be expanded as needed.

How to Test

Run the Test Suite:
From the repository root, execute:
```
pytest tests/test_birco.py
```

This confirms that all BIRCO task classes are loaded correctly and their methods produce the expected outputs.

Local Evaluation Example
You can run a local evaluation using one of the new tasks, for example:

python -m mteb.run --task BIRCODorisMaeReranking --model <your-model>

Replace <your-model> with your chosen model to verify that evaluation metrics are computed properly.

Outstanding Items

Metadata Revisions & Dates:
- Update all placeholder values (e.g., "YOUR_REVISION_PLACEHOLDER", date fields) with the correct dataset revision hashes and date values from Hugging Face.
- These datasets are assumed to be hosted by user bpHigh—please verify that the corresponding dataset.json files exist at the specified paths.
Dataset Availability Check:
- Ensure that each dataset’s JSON file exists at {metadata.dataset["path"]}/dataset.json to avoid runtime errors.

Next Steps

Review & Feedback:
- I kindly request that reviewers verify:
  1. The integration aligns with MTEB’s overall structure.
  2. Metadata accurately reflects the BIRCO paper details.
  3. The dataset paths and revision placeholders are updated once the actual values are available.
Merge Process:
- Once the placeholders are updated and the integration is validated through testing, this PR can be merged.

Thank you for reviewing this integration. I look forward to your feedback and suggestions for further improvements.

Samoed · 2025-02-04T06:57:21Z

I think this is about this issue #818?

Samoed · 2025-02-04T06:58:20Z

Remove this

Samoed · 2025-02-04T06:58:51Z

Remove changes here

Samoed · 2025-02-04T07:05:33Z

            if len(sample["positive"]) > 0 and len(sample["negative"]) > 0
        ]

+    def _detect_graded_relevance(self):


What do graded_relevance?

Samoed · 2025-02-04T07:07:14Z

+
+        return qrels, run
+
+    def _calculate_graded_metrics(self, qrels, run):


Use methods that used in this evaluator for getting metrics

Samoed · 2025-02-04T07:07:52Z

+        avg_character_length={"test": 800}
+    )
+
+    def get_query(self, sample):


For instruction you should specify prompt in TaskMetadata

I can't find usage of get_query in code

Samoed · 2025-02-04T07:09:02Z

Integrate loader inside each task. You can move all task inside one file and there use one function

Samoed · 2025-02-04T07:11:05Z

+
+        # Load dataset.json from the HF hub directory specified in metadata.
+        # NOTE: Ensure the dataset is available at the path provided in metadata.
+        dataset_path = Path(self.metadata.dataset["path"]) / "dataset.json"


You need reupload datasets to load them using standart datasets methods

Samoed · 2025-02-04T07:12:17Z

 import logging
 from typing import Any

+import pytrec_eval


Use our standart methods for evaluating

Samoed · 2025-02-04T07:15:05Z

+        qrels, run = self._prepare_graded_evaluation(model)
+        return self._calculate_graded_metrics(qrels, run)
+
+    def _compute_legacy_metrics(self, model):


This is not legacy metrics

Samoed · 2025-02-04T07:16:49Z

+        ),
+        reference="https://github.com/BIRCO-benchmark/BIRCO",
+        dataset={
+            "path": "bpHigh/BIRCO_ArguAna",


Where is this stored? I can't find this on HF

Samoed · 2025-02-04T07:19:13Z

@orionw You might be interested in this as it is related to the reranking tasks. In general, I think we should consider integrating this into the v2 branch, where the Reranking tasks have been refactored.

orionw · 2025-02-04T13:09:36Z

Hi @AdnanElAssadi56 and thanks for the PR! Would really love to have Birco added!

Sorry about the confusion w.r.t. the v2 branch. Given that we are moving to it soon (in the coming couple weeks) I would suggest you merge into there.

I think that should also make a lot of this easier -- there reranking already uses pytrec_eval and can work for graded relevance, has an easier loading style, etc.

Again sorry about the confusion!

Samoed · 2025-02-10T22:31:59Z

This benchmark is developing in #2022

AdnanElAssadi56 added 5 commits February 3, 2025 17:54

Added Preliminary Datasets

62ab522

Adjusted ReRank Computations + Edited Birco Datasets

7a58b6a

Minor Adjustments and Commenting + Completed Integration with Testing

1022839

Commit message including new files

1f762ec

Moved the birco test file

c747cce

Samoed requested changes Feb 4, 2025

View reviewed changes

Samoed requested a review from orionw February 4, 2025 07:32

Samoed mentioned this pull request Feb 8, 2025

Integrate Birco V2 #2022

Merged

Samoed closed this Feb 10, 2025


		return qrels, run

		def _calculate_graded_metrics(self, qrels, run):

Uh oh!

Conversation

AdnanElAssadi56 commented Feb 4, 2025

Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support

Description

How to Test

Outstanding Items

Next Steps

Uh oh!

Samoed commented Feb 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionw commented Feb 4, 2025

Uh oh!

Samoed commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Samoed Feb 4, 2025 •

edited

Loading

Samoed commented Feb 4, 2025 •

edited

Loading