Skip to content

feat: Forbid task metadata and add upload functions#1362

Merged
Samoed merged 21 commits into
embeddings-benchmark:v2.0.0from
Samoed:forbid_task_metadata
Dec 4, 2024
Merged

feat: Forbid task metadata and add upload functions#1362
Samoed merged 21 commits into
embeddings-benchmark:v2.0.0from
Samoed:forbid_task_metadata

Conversation

@Samoed

@Samoed Samoed commented Oct 30, 2024

Copy link
Copy Markdown
Member

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Some retrieval tasks need to be reuploaded because they are loaded from different repositories. I’ve created upload functions to convert these datasets into our current format. I tested each reuploaded dataset, and the scores matched, except for mFollowIR (rus). In the main branch, the main_score is -0.039465099069488106, whereas the reuploaded dataset gives -0.031187925634321677. However, this run was only for testing purposes.

Initially, I tried adding this script to the mteb folder, but it gave an error: AttributeError: module 'logging' has no attribute 'getLogger'. So, I moved it to the scripts folder.

Additionally, some tasks may not be imported successfully. For example, I tried to load IndicXnliPairClassification, but it resulted in an error.

uploaded.zip
mteb main.zip

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

@Samoed is this PR stale?

@Samoed

Samoed commented Nov 11, 2024

Copy link
Copy Markdown
Member Author

Yes, some datasets (mostly from CMTEB) load data from multiple repositories on HF, so we need to convert them first to complete this PR.

@KennethEnevoldsen KennethEnevoldsen changed the base branch from main to v2.0.0 November 11, 2024 10:02
@Samoed Samoed closed this Nov 26, 2024
@Samoed Samoed reopened this Nov 26, 2024
# Conflicts:
#	mteb/abstasks/TaskMetadata.py
@Samoed Samoed changed the title feat: Forbid task metadata feat: Forbid task metadata and add upload functions Dec 1, 2024
@Samoed Samoed marked this pull request as ready for review December 1, 2024 10:51
@orionw

orionw commented Dec 1, 2024

Copy link
Copy Markdown
Contributor

I’m traveling and won’t be at a computer til the end of the week, but this looks good.

Are there any datasets that are still not converted?

And is the mFollowIR Russian still an issue? FWIW the v2 branch fixed a small bug that the current one doesn’t have, so the numbers from main and v2 will be different. The number looks reasonable and I wouldn’t worry about it.

@Samoed

Samoed commented Dec 1, 2024

Copy link
Copy Markdown
Member Author

I've tested from v2.0.0 branch, but mostly this is not the issue, because I just tested how uploader would work with multilingual tasks. I didn't change mFollowIR dataset version

@Samoed

Samoed commented Dec 3, 2024

Copy link
Copy Markdown
Member Author

@KennethEnevoldsen Can you review, please?

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things and a suggestion to move the upload utility to the class object (assuming we want to maintain it)

Generally though this looks great!

Comment thread mteb/tasks/Retrieval/multilingual/WikipediaRetrievalMultilingual.py Outdated
Comment thread tests/test_TaskMetadata.py
Comment thread scripts/upload_utils.py Outdated
Comment thread scripts/upload_utils.py Outdated
Comment thread scripts/upload_utils.py Outdated
Samoed and others added 4 commits December 4, 2024 23:44
…al.py

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* fix FilipinoHateSpeechClassification

* update tests

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving it <3!

@Samoed Samoed merged commit dec5d6a into embeddings-benchmark:v2.0.0 Dec 4, 2024
@Samoed Samoed deleted the forbid_task_metadata branch October 20, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants