Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support#1947
Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support#1947AdnanElAssadi56 wants to merge 5 commits into
Conversation
|
I think this is about this issue #818? |
| if len(sample["positive"]) > 0 and len(sample["negative"]) > 0 | ||
| ] | ||
|
|
||
| def _detect_graded_relevance(self): |
|
|
||
| return qrels, run | ||
|
|
||
| def _calculate_graded_metrics(self, qrels, run): |
There was a problem hiding this comment.
Use methods that used in this evaluator for getting metrics
| avg_character_length={"test": 800} | ||
| ) | ||
|
|
||
| def get_query(self, sample): |
There was a problem hiding this comment.
For instruction you should specify prompt in TaskMetadata
There was a problem hiding this comment.
I can't find usage of get_query in code
There was a problem hiding this comment.
Integrate loader inside each task. You can move all task inside one file and there use one function
|
|
||
| # Load dataset.json from the HF hub directory specified in metadata. | ||
| # NOTE: Ensure the dataset is available at the path provided in metadata. | ||
| dataset_path = Path(self.metadata.dataset["path"]) / "dataset.json" |
There was a problem hiding this comment.
You need reupload datasets to load them using standart datasets methods
| import logging | ||
| from typing import Any | ||
|
|
||
| import pytrec_eval |
There was a problem hiding this comment.
Use our standart methods for evaluating
| qrels, run = self._prepare_graded_evaluation(model) | ||
| return self._calculate_graded_metrics(qrels, run) | ||
|
|
||
| def _compute_legacy_metrics(self, model): |
| ), | ||
| reference="https://github.com/BIRCO-benchmark/BIRCO", | ||
| dataset={ | ||
| "path": "bpHigh/BIRCO_ArguAna", |
There was a problem hiding this comment.
Where is this stored? I can't find this on HF
|
@orionw You might be interested in this as it is related to the reranking tasks. In general, I think we should consider integrating this into the |
|
Hi @AdnanElAssadi56 and thanks for the PR! Would really love to have Birco added! Sorry about the confusion w.r.t. the v2 branch. Given that we are moving to it soon (in the coming couple weeks) I would suggest you merge into there. I think that should also make a lot of this easier -- there reranking already uses pytrec_eval and can work for graded relevance, has an easier loading style, etc. Again sorry about the confusion! |
|
This benchmark is developing in #2022 |
Integrate BIRCO Benchmark Datasets into MTEB with Graded Evaluation Support
Description
This pull request integrates the BIRCO (Benchmark of Information Retrieval Tasks with Complex Objectives) datasets into the MTEB framework. The changes include:
New Base Class (
BIRCOBase)A new base class is added under
mteb/tasks/Reranking/eng/BIRCO/BIRCOBase.pyto load dataset JSON files from the Hugging Face hub.Assumption:
The datasets from user bpHigh are hosted on Hugging Face, and each dataset’s JSON file is expected to be located at
{metadata.dataset["path"]}/dataset.json. Please verify that these files exist and are accessible.Dataset-Specific Task Classes
Five new dataset-specific classes have been implemented (each inheriting from
BIRCOBase) in the directorymteb/tasks/Reranking/eng/BIRCO/:BIRCODorisMaeReranking.pyBIRCOArguAnaReranking.pyBIRCOClinicalTrialReranking.pyBIRCORELICReranking.pyBIRCOWhatsThatBookReranking.pyEach class contains a detailed metadata object (with fields such as name, description, reference, dataset path, revision, evaluation splits, etc.), and implements:
get_query: Returns the query prepended with a task-specific instruction.get_positive_docsandget_negative_docs: Return the respective lists from the sample.Placeholders:
revisionfield (e.g.,"YOUR_REVISION_PLACEHOLDER") must be updated with the actual dataset revision hash from Hugging Face.("2024-04-03", "2024-04-03")) should be updated as necessary.Evaluator Adjustments
Updates in
mteb/evaluation/evaluators/RerankingEvaluator.pyensure that graded relevance samples are correctly evaluated using pytrec_eval, computing metrics such as nDCG@10 and Recall@10.Test Coverage
A new test file
tests/test_birco.pyhas been added to validate the integration. This test ensures:get_querymethod returns a string that starts with “Instruction:”.This suite can be expanded as needed.
How to Test
From the repository root, execute:
This confirms that all BIRCO task classes are loaded correctly and their methods produce the expected outputs.
You can run a local evaluation using one of the new tasks, for example:
Replace
<your-model>with your chosen model to verify that evaluation metrics are computed properly.Outstanding Items
Metadata Revisions & Dates:
"YOUR_REVISION_PLACEHOLDER", date fields) with the correct dataset revision hashes and date values from Hugging Face.bpHigh—please verify that the correspondingdataset.jsonfiles exist at the specified paths.Dataset Availability Check:
{metadata.dataset["path"]}/dataset.jsonto avoid runtime errors.Next Steps
Review & Feedback:
Merge Process:
Thank you for reviewing this integration. I look forward to your feedback and suggestions for further improvements.