Skip to content

add MIEB results and rename model to pass tests#122

Merged
gowitheflow-1998 merged 5 commits into
mainfrom
add-mieb-results
Feb 23, 2025
Merged

add MIEB results and rename model to pass tests#122
gowitheflow-1998 merged 5 commits into
mainfrom
add-mieb-results

Conversation

@isaac-chung

@isaac-chung isaac-chung commented Feb 16, 2025

Copy link
Copy Markdown
Contributor

Fixes embeddings-benchmark/mteb#1823

Add MIEB results. The following models have been renamed to add org name (based on local test failures):

Related MTEB issue: embeddings-benchmark/mteb#2074

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the results files checker make pre-push.

Adding a model checklist

  • I have added model implementation to mteb/models/ directory. Instruction to add a model can be found here in the following PR ____

@isaac-chung

Copy link
Copy Markdown
Contributor Author

When pointing embeddings-benchmark/mteb#2035 to this branch, it seems like MIEB results cannot be displayed due to "Number of parameters".

@isaac-chung

isaac-chung commented Feb 16, 2025

Copy link
Copy Markdown
Contributor Author

@gowitheflow-1998 @KennethEnevoldsen here's a screenshot of the LB, hacked to point to this branch. eng and lite versions were able to render as well. Cache needed to be wiped.

Screenshot 2025-02-16 at 21 55 38

@gowitheflow-1998

Copy link
Copy Markdown
Member

there's a few task where the main metric was wrong when we implemented them and isn't matching with the paper. Let me double-check all tasks and get back. Might be a good idea to replace the scores in main metric with the actual main metrics before we merge I think

@isaac-chung isaac-chung marked this pull request as draft February 17, 2025 03:25
@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

Also seems like the performance v. model size plot need some model references. You can add these in:

mteb.leaderboard.figures.models_to_annotate which is currently:

models_to_annotate = [
    "all-MiniLM-L6-v2",
    "GritLM-7B",
    "LaBSE",
    "multilingual-e5-large-instruct",
]

@isaac-chung

isaac-chung commented Feb 18, 2025

Copy link
Copy Markdown
Contributor Author

Also seems like the performance v. model size plot need some model references. You can add these in:

mteb.leaderboard.figures.models_to_annotate which is currently:

models_to_annotate = [
    "all-MiniLM-L6-v2",
    "GritLM-7B",
    "LaBSE",
    "multilingual-e5-large-instruct",
]

What does "some model reference" mean? How do we select the models for this list?

Figured it out 👍

[update] Added a few models that ranked first from a few task types:

  • "EVA02-CLIP-bigE-14-plus"
  • "voyage-multimodal-3"
  • "e5-v"
  • "VLM2Vec-Full"

@isaac-chung

Copy link
Copy Markdown
Contributor Author

The performance per task type plot isn't showing though 🤔 says it only contains one task type when there are 8.

@KennethEnevoldsen

Copy link
Copy Markdown
Contributor

hmm not sure why this is happening - @x-tabdeveloping do you have an idea?

@x-tabdeveloping

Copy link
Copy Markdown
Contributor

I'll have a look at it tomorrow

@x-tabdeveloping

Copy link
Copy Markdown
Contributor

@isaac-chung My guess would be it's cause of mteb.leaderboard.figures.task_types:

task_types = [
    "BitextMining",
    "Classification",
    "MultilabelClassification",
    "Clustering",
    "PairClassification",
    "Reranking",
    "Retrieval",
    "STS",
    "Summarization",
    # "InstructionRetrieval",
    # Not displayed, because the scores are negative,
    # doesn't work well with the radar chart.
    "Speed",
]

The reason I made this list was because instruction retrieval shows scores in the negatives and that doesn't really work with the radar chart.
We could either extend this list or just make a list with the exceptions and infer the list of task types from somewhere else.

@isaac-chung

Copy link
Copy Markdown
Contributor Author

@isaac-chung My guess would be it's cause of mteb.leaderboard.figures.task_types:

task_types = [
    "BitextMining",
    "Classification",
    "MultilabelClassification",
    "Clustering",
    "PairClassification",
    "Reranking",
    "Retrieval",
    "STS",
    "Summarization",
    # "InstructionRetrieval",
    # Not displayed, because the scores are negative,
    # doesn't work well with the radar chart.
    "Speed",
]

The reason I made this list was because instruction retrieval shows scores in the negatives and that doesn't really work with the radar chart. We could either extend this list or just make a list with the exceptions and infer the list of task types from somewhere else.

That's it. Thanks! It's working now.

@gowitheflow-1998

Copy link
Copy Markdown
Member

have fixed main metric issue by overwriting main scores with actual main metric scores; deleted previous incomplete Jina runs with a old version that only has a few task results.

overwritten scores include:

task_metric_mapping = {"BLINKIT2IRetrieval.json":"cv_recall_at_1",
 "BLINKIT2TRetrieval.json":"cv_recall_at_1",
 "ImageCoDeT2IRetrieval.json":"cv_recall_at_3",
 "ROxfordEasyI2IMultiChoice.json":"map_at_5",
 "ROxfordMediumI2IMultiChoice.json":"map_at_5",
 "ROxfordHardI2IMultiChoice.json":"map_at_5",
 "RParisEasyI2IMultiChoice.json":"map_at_5",
 "RParisMediumI2IMultiChoice.json":"map_at_5",
 "RParisHardI2IMultiChoice.json":"map_at_5", 
 "TinyImageNetClustering.json":"nmi",
 "CIFAR10Clustering.json":"nmi",
 "CIFAR100Clustering.json":"nmi",
 "ImageNet10Clustering.json":"nmi",
 "ImageNetDog15Clustering.json":"nmi",
}

@isaac-chung

Copy link
Copy Markdown
Contributor Author

@gowitheflow-1998 good stuff! Are we ready to merge?

@gowitheflow-1998 gowitheflow-1998 marked this pull request as ready for review February 23, 2025 13:59
@gowitheflow-1998 gowitheflow-1998 merged commit de7d977 into main Feb 23, 2025
@gowitheflow-1998

Copy link
Copy Markdown
Member

@gowitheflow-1998 good stuff! Are we ready to merge?

yeah, merged! adding @Muennighoff as co-author for running most of the results here!

@Muennighoff

Copy link
Copy Markdown
Contributor

Does this have everything from https://github.com/embeddings-benchmark/tmp i.e. we can safely delete that repo?

@gowitheflow-1998

Copy link
Copy Markdown
Member

Does this have everything from https://github.com/embeddings-benchmark/tmp i.e. we can safely delete that repo?

yeah! all results are here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MIEB] migrate results from tmp repo to results repo

6 participants