Skip to content

Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support#1947

Closed
AdnanElAssadi56 wants to merge 5 commits into
embeddings-benchmark:mainfrom
AdnanElAssadi56:integrate_birco
Closed

Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support#1947
AdnanElAssadi56 wants to merge 5 commits into
embeddings-benchmark:mainfrom
AdnanElAssadi56:integrate_birco

Conversation

@AdnanElAssadi56

Copy link
Copy Markdown
Contributor

Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support

Description

This pull request integrates the BIRCO (Benchmark of Information Retrieval Tasks with Complex Objectives) datasets into the MTEB framework. The changes include:

  • New Base Class (BIRCOBase)
    A new base class is added under mteb/tasks/Reranking/eng/BIRCO/BIRCOBase.py to load dataset JSON files from the Hugging Face hub.
    Assumption:
    The datasets from user bpHigh are hosted on Hugging Face, and each dataset’s JSON file is expected to be located at {metadata.dataset["path"]}/dataset.json. Please verify that these files exist and are accessible.

  • Dataset-Specific Task Classes
    Five new dataset-specific classes have been implemented (each inheriting from BIRCOBase) in the directory mteb/tasks/Reranking/eng/BIRCO/:

    • BIRCODorisMaeReranking.py
    • BIRCOArguAnaReranking.py
    • BIRCOClinicalTrialReranking.py
    • BIRCORELICReranking.py
    • BIRCOWhatsThatBookReranking.py

    Each class contains a detailed metadata object (with fields such as name, description, reference, dataset path, revision, evaluation splits, etc.), and implements:

    • get_query: Returns the query prepended with a task-specific instruction.
    • get_positive_docs and get_negative_docs: Return the respective lists from the sample.

    Placeholders:

    • The revision field (e.g., "YOUR_REVISION_PLACEHOLDER") must be updated with the actual dataset revision hash from Hugging Face.
    • Date values (e.g., ("2024-04-03", "2024-04-03")) should be updated as necessary.
  • Evaluator Adjustments
    Updates in mteb/evaluation/evaluators/RerankingEvaluator.py ensure that graded relevance samples are correctly evaluated using pytrec_eval, computing metrics such as nDCG@10 and Recall@10.

  • Test Coverage
    A new test file tests/test_birco.py has been added to validate the integration. This test ensures:

    • Each BIRCO task loads its metadata correctly.
    • The get_query method returns a string that starts with “Instruction:”.

    This suite can be expanded as needed.

How to Test

  1. Run the Test Suite:
    From the repository root, execute:
    pytest tests/test_birco.py
    

This confirms that all BIRCO task classes are loaded correctly and their methods produce the expected outputs.

  1. Local Evaluation Example
    You can run a local evaluation using one of the new tasks, for example:
python -m mteb.run --task BIRCODorisMaeReranking --model <your-model>  

Replace <your-model> with your chosen model to verify that evaluation metrics are computed properly.


Outstanding Items

  • Metadata Revisions & Dates:

    • Update all placeholder values (e.g., "YOUR_REVISION_PLACEHOLDER", date fields) with the correct dataset revision hashes and date values from Hugging Face.
    • These datasets are assumed to be hosted by user bpHigh—please verify that the corresponding dataset.json files exist at the specified paths.
  • Dataset Availability Check:

    • Ensure that each dataset’s JSON file exists at {metadata.dataset["path"]}/dataset.json to avoid runtime errors.

Next Steps

  • Review & Feedback:

    • I kindly request that reviewers verify:
      1. The integration aligns with MTEB’s overall structure.
      2. Metadata accurately reflects the BIRCO paper details.
      3. The dataset paths and revision placeholders are updated once the actual values are available.
  • Merge Process:

    • Once the placeholders are updated and the integration is validated through testing, this PR can be merged.

Thank you for reviewing this integration. I look forward to your feedback and suggestions for further improvements.

@Samoed

Samoed commented Feb 4, 2025

Copy link
Copy Markdown
Member

I think this is about this issue #818?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

Comment thread mteb/__init__.py

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove changes here

if len(sample["positive"]) > 0 and len(sample["negative"]) > 0
]

def _detect_graded_relevance(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do graded_relevance?


return qrels, run

def _calculate_graded_metrics(self, qrels, run):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use methods that used in this evaluator for getting metrics

avg_character_length={"test": 800}
)

def get_query(self, sample):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instruction you should specify prompt in TaskMetadata

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find usage of get_query in code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrate loader inside each task. You can move all task inside one file and there use one function


# Load dataset.json from the HF hub directory specified in metadata.
# NOTE: Ensure the dataset is available at the path provided in metadata.
dataset_path = Path(self.metadata.dataset["path"]) / "dataset.json"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need reupload datasets to load them using standart datasets methods

import logging
from typing import Any

import pytrec_eval

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use our standart methods for evaluating

qrels, run = self._prepare_graded_evaluation(model)
return self._calculate_graded_metrics(qrels, run)

def _compute_legacy_metrics(self, model):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not legacy metrics

),
reference="https://github.com/BIRCO-benchmark/BIRCO",
dataset={
"path": "bpHigh/BIRCO_ArguAna",

@Samoed Samoed Feb 4, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this stored? I can't find this on HF

@Samoed

Samoed commented Feb 4, 2025

Copy link
Copy Markdown
Member

@orionw You might be interested in this as it is related to the reranking tasks. In general, I think we should consider integrating this into the v2 branch, where the Reranking tasks have been refactored.

@Samoed Samoed requested a review from orionw February 4, 2025 07:32
@orionw

orionw commented Feb 4, 2025

Copy link
Copy Markdown
Contributor

Hi @AdnanElAssadi56 and thanks for the PR! Would really love to have Birco added!

Sorry about the confusion w.r.t. the v2 branch. Given that we are moving to it soon (in the coming couple weeks) I would suggest you merge into there.

I think that should also make a lot of this easier -- there reranking already uses pytrec_eval and can work for graded relevance, has an easier loading style, etc.

Again sorry about the confusion!

@Samoed Samoed mentioned this pull request Feb 8, 2025
@Samoed

Samoed commented Feb 10, 2025

Copy link
Copy Markdown
Member

This benchmark is developing in #2022

@Samoed Samoed closed this Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants