Skip to content

[v2] Merge main 30 08#3102

Merged
Samoed merged 120 commits into
v2.0.0from
merge_main_30_08
Sep 1, 2025
Merged

[v2] Merge main 30 08#3102
Samoed merged 120 commits into
v2.0.0from
merge_main_30_08

Conversation

@Samoed

@Samoed Samoed commented Aug 30, 2025

Copy link
Copy Markdown
Member

Also aligned Rergression task with classification

makram93 and others added 30 commits July 11, 2025 22:06
* feat: unify text and image embeddings for all tasks

* fix: uniform batch size

* fix: update error message

* fix: update code task

* fix: update max length

* fix: apply review suggestions
* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

* kalm-emb-v2

---------

Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* adding vidore benchmarks

* fix typo

* clean vidore names + per lang eval

* lint

* vidore names

* bibtex fix

* fix revision

* vidore v2 citation

* update citation format and fix per-language mappings

* lint: citations

* typo citations

* fix revisiions

* lint

* fix colnomic3b revision

* fix colqwen2.5 revision + latest repo version

* fix query agmentation tokens

* colsmol revision
Automatically generated by python-semantic-release
* Adding Classification Evaluator test

* Modifications due to the comments

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update tests/test_evaluators/test_ClassificationEvaluator.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Modifications due to the comments

* Modifications due to the comments

* Adding STSEvaluator and SummarizationEvaluator tests

* Correcting due to the comments

* Correcting due to the comments

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Classification dataset cleaning

* Update pull request number

* Fix metadata test

* fix formatting

* add script for cleaning
Add JapaneseSentimentClassification
* change document to passage

* fix prompt names

* fix kwargs check

* fix default prompt
Automatically generated by python-semantic-release
add opensearch inf-free models

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add BareExamQA retrieval task

* ran linter

* updated details

* updated details

* fixed subtype name

* fixed changes

* ran linter again
specify revision for opensearch
Automatically generated by python-semantic-release
… been checked (#2940)

* fix: Only import SparseEncoder once sentence-transformer version have been checked

fixes #2936

* Update mteb/models/opensearch_neural_sparse_models.py

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
…2939)

The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.
* docs: Update adding_a_dataset.md

* Update docs/adding_a_dataset.md
Automatically generated by python-semantic-release
* BSARD loader fixed

* BSARDv2 metadata fixed

* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Added govreport task

* Updated description
* Added BillSum datasets

* fixed billsumca

* Updated BillSumCA description

* Updated BillSumUS description

* Update mteb/tasks/Retrieval/eng/BillSumCA.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* Update mteb/tasks/Retrieval/eng/BillSumUS.py

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>

* lint

* lint

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
…2716)

* Add RuSciBench

* fix bitext mining lang

* Add regression task

* fix init

* add missing files

* Improve description

* Add superseded_by

* fix lint

* Update regression task to match with v2

* Add stratified_subsampling for regression task

* Add boostrap for regression task

* Rename task class, add model as evaluator argument

* fix import

* fix import 2

* fixes

* fix

* Rename regression model protocol
semantic-release and others added 8 commits August 28, 2025 14:41
Automatically generated by python-semantic-release
* Commentout bibtex formatting

* Remove `-n auto`

* get back bibtex

* try limiting versions

* revert coverage

* revert coverage

---------

Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* feat - Combine Plots and Tables into a Single Tab #3009

* feat - Resize the plot to make it more readable

* feat - Remove the (radar chart)

* feat - Add a comment stating that it only shows the Top 5 models in the table.

* feat - adjust layout

* Update mteb/leaderboard/app.py

* format

---------

Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
# Conflicts:
#	Makefile
#	docs/adding_a_dataset.md
#	mteb/abstasks/AbsTaskRetrieval.py
#	mteb/abstasks/TaskMetadata.py
#	mteb/abstasks/__init__.py
#	mteb/benchmarks/get_benchmark.py
#	mteb/encoder_interface.py
#	mteb/evaluation/evaluators/Image/Any2AnyRetrievalEvaluator.py
#	mteb/evaluation/evaluators/RerankingEvaluator.py
#	mteb/evaluation/evaluators/RetrievalEvaluator.py
#	mteb/evaluation/evaluators/__init__.py
#	mteb/leaderboard/app.py
#	mteb/leaderboard/benchmark_selector.py
#	mteb/models/cohere_v.py
#	mteb/models/model_implementations/bedrock_models.py
#	mteb/models/model_implementations/cohere_models.py
#	mteb/models/model_implementations/colbert_models.py
#	mteb/models/model_implementations/colpali_models.py
#	mteb/models/model_implementations/colqwen_models.py
#	mteb/models/model_implementations/colsmol_models.py
#	mteb/models/model_implementations/gme_v_models.py
#	mteb/models/model_implementations/google_models.py
#	mteb/models/model_implementations/jasper_models.py
#	mteb/models/model_implementations/jina_models.py
#	mteb/models/model_implementations/kalm_models.py
#	mteb/models/model_implementations/llm2vec_models.py
#	mteb/models/model_implementations/nomic_models.py
#	mteb/models/model_implementations/ru_sentence_models.py
#	mteb/models/model_implementations/voyage_models.py
#	mteb/models/model_implementations/voyage_v.py
#	mteb/models/overview.py
#	mteb/models/sentence_transformer_wrapper.py
#	mteb/models/vlm2vec_models.py
#	mteb/models/wrapper.py
#	mteb/tasks/BitextMining/__init__.py
#	mteb/tasks/Classification/__init__.py
#	mteb/tasks/Classification/ces/CSFDCZMovieReviewSentimentClassification.py
#	mteb/tasks/Classification/ces/CzechProductReviewSentimentClassification.py
#	mteb/tasks/Classification/ces/CzechSoMeSentimentClassification.py
#	mteb/tasks/Classification/dan/AngryTweetsClassification.py
#	mteb/tasks/Classification/dan/DKHateClassification.py
#	mteb/tasks/Classification/dan/DanishPoliticalCommentsClassification.py
#	mteb/tasks/Classification/dan/DdiscoCohesionClassification.py
#	mteb/tasks/Classification/deu/GermanPoliticiansTwitterSentimentClassification.py
#	mteb/tasks/Classification/deu/TenKGnadClassification.py
#	mteb/tasks/Classification/eng/AmazonPolarityClassification.py
#	mteb/tasks/Classification/eng/ArxivClassification.py
#	mteb/tasks/Classification/eng/Banking77Classification.py
#	mteb/tasks/Classification/eng/DBpediaClassification.py
#	mteb/tasks/Classification/eng/EmotionClassification.py
#	mteb/tasks/Classification/eng/FinancialPhrasebankClassification.py
#	mteb/tasks/Classification/eng/FrenkEnClassification.py
#	mteb/tasks/Classification/eng/ImdbClassification.py
#	mteb/tasks/Classification/eng/LegalBenchClassification.py
#	mteb/tasks/Classification/eng/NewsClassification.py
#	mteb/tasks/Classification/eng/PatentClassification.py
#	mteb/tasks/Classification/eng/PoemSentimentClassification.py
#	mteb/tasks/Classification/eng/SDSEyeProtectionClassification.py
#	mteb/tasks/Classification/eng/SDSGlovesClassification.py
#	mteb/tasks/Classification/eng/ToxicChatClassification.py
#	mteb/tasks/Classification/eng/ToxicConversationsClassification.py
#	mteb/tasks/Classification/eng/TweetSentimentExtractionClassification.py
#	mteb/tasks/Classification/eng/TweetTopicSingleClassification.py
#	mteb/tasks/Classification/eng/WikipediaBioMetChemClassification.py
#	mteb/tasks/Classification/eng/WikipediaChemFieldsClassification.py
#	mteb/tasks/Classification/eng/WikipediaCompChemSpectroscopyClassification.py
#	mteb/tasks/Classification/eng/WikipediaCrystallographyAnalyticalClassification.py
#	mteb/tasks/Classification/eng/WikipediaTheoreticalAppliedClassification.py
#	mteb/tasks/Classification/eng/YahooAnswersTopicsClassification.py
#	mteb/tasks/Classification/eng/YelpReviewFullClassification.py
#	mteb/tasks/Classification/est/estonian_valence.py
#	mteb/tasks/Classification/fas/FaMTEBClassification.py
#	mteb/tasks/Classification/fil/FilipinoHateSpeechClassification.py
#	mteb/tasks/Classification/fin/FinToxicityClassification.py
#	mteb/tasks/Classification/fra/FrenchBookReviews.py
#	mteb/tasks/Classification/fra/MovieReviewSentimentClassification.py
#	mteb/tasks/Classification/guj/GujaratiNewsClassification.py
#	mteb/tasks/Classification/heb/HebrewSentimentAnalysis.py
#	mteb/tasks/Classification/hin/HindiDiscourseClassification.py
#	mteb/tasks/Classification/hin/SentimentAnalysisHindi.py
#	mteb/tasks/Classification/hrv/FrenkHrClassification.py
#	mteb/tasks/Classification/ind/IndonesianIdClickbaitClassification.py
#	mteb/tasks/Classification/ind/IndonesianMongabayConservationClassification.py
#	mteb/tasks/Classification/ita/ItalianLinguistAcceptabilityClassification.py
#	mteb/tasks/Classification/jav/JavaneseIMDBClassification.py
#	mteb/tasks/Classification/jpn/WRIMEClassification.py
#	mteb/tasks/Classification/kan/KannadaNewsClassification.py
#	mteb/tasks/Classification/kor/KlueTC.py
#	mteb/tasks/Classification/kor/KorHateClassification.py
#	mteb/tasks/Classification/kor/KorSarcasmClassification.py
#	mteb/tasks/Classification/kur/KurdishSentimentClassification.py
#	mteb/tasks/Classification/mal/MalayalamNewsClassification.py
#	mteb/tasks/Classification/mar/MarathiNewsClassification.py
#	mteb/tasks/Classification/mkd/MacedonianTweetSentimentClassification.py
#	mteb/tasks/Classification/mya/MyanmarNews.py
#	mteb/tasks/Classification/nep/NepaliNewsClassification.py
#	mteb/tasks/Classification/nld/DutchBookReviewSentimentClassification.py
#	mteb/tasks/Classification/nob/NoRecClassification.py
#	mteb/tasks/Classification/nob/NorwegianParliamentClassification.py
#	mteb/tasks/Classification/ory/OdiaNewsClassification.py
#	mteb/tasks/Classification/pol/PolishClassification.py
#	mteb/tasks/Classification/ron/Moroco.py
#	mteb/tasks/Classification/ron/RomanianReviewsSentiment.py
#	mteb/tasks/Classification/ron/RomanianSentimentClassification.py
#	mteb/tasks/Classification/rus/GeoreviewClassification.py
#	mteb/tasks/Classification/rus/HeadlineClassification.py
#	mteb/tasks/Classification/rus/InappropriatenessClassification.py
#	mteb/tasks/Classification/rus/RuReviewsClassification.py
#	mteb/tasks/Classification/rus/RuSciBenchGRNTIClassification.py
#	mteb/tasks/Classification/rus/RuSciBenchOECDClassification.py
#	mteb/tasks/Classification/rus/ru_toixic_classification_okmlcup.py
#	mteb/tasks/Classification/rus/senti_ru_eval.py
#	mteb/tasks/Classification/sin/SinhalaNewsClassification.py
#	mteb/tasks/Classification/sin/SinhalaNewsSourceClassification.py
#	mteb/tasks/Classification/slk/CSFDSKMovieReviewSentimentClassification.py
#	mteb/tasks/Classification/slk/SlovakHateSpeechClassification.py
#	mteb/tasks/Classification/slv/FrenkSlClassification.py
#	mteb/tasks/Classification/spa/SpanishNewsClassification.py
#	mteb/tasks/Classification/spa/SpanishSentimentClassification.py
#	mteb/tasks/Classification/ssw/SiswatiNewsClassification.py
#	mteb/tasks/Classification/svk/SlovakMovieReviewSentimentClassification.py
#	mteb/tasks/Classification/swa/SwahiliNewsClassification.py
#	mteb/tasks/Classification/swe/DalajClassification.py
#	mteb/tasks/Classification/swe/SweRecClassification.py
#	mteb/tasks/Classification/swe/SwedishSentimentClassification.py
#	mteb/tasks/Classification/tam/TamilNewsClassification.py
#	mteb/tasks/Classification/tel/TeluguAndhraJyotiNewsClassification.py
#	mteb/tasks/Classification/tha/WisesightSentimentClassification.py
#	mteb/tasks/Classification/tsn/TswanaNewsClassification.py
#	mteb/tasks/Classification/tur/TurkishMovieSentimentClassification.py
#	mteb/tasks/Classification/tur/TurkishProductSentimentClassification.py
#	mteb/tasks/Classification/ukr/UkrFormalityClassification.py
#	mteb/tasks/Classification/urd/UrduRomanSentimentClassification.py
#	mteb/tasks/Classification/vie/VieStudentFeedbackClassification.py
#	mteb/tasks/Classification/zho/CMTEBClassification.py
#	mteb/tasks/Classification/zho/YueOpenriceReviewClassification.py
#	mteb/tasks/Classification/zul/IsiZuluNewsClassification.py
#	mteb/tasks/Clustering/__init__.py
#	mteb/tasks/Image/Any2AnyRetrieval/__init__.py
#	mteb/tasks/PairClassification/__init__.py
#	mteb/tasks/Reranking/__init__.py
#	mteb/tasks/Retrieval/__init__.py
#	mteb/tasks/STS/__init__.py
#	pyproject.toml
#	tests/test_benchmark/mock_models.py
#	tests/test_benchmark/test_benchmark.py
#	tests/test_models/test_model_meta.py
#	tests/test_reproducible_workflow.py
@Samoed Samoed added the v2 label Aug 30, 2025
@Samoed Samoed changed the base branch from main to v2.0.0 August 30, 2025 18:56
@isaac-chung isaac-chung mentioned this pull request Aug 30, 2025
5 tasks
@Samoed

Samoed commented Aug 30, 2025

Copy link
Copy Markdown
Member Author

Now tests are failing, because some tasks have missing metadata. I'll calculate it later

@KennethEnevoldsen KennethEnevoldsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question but generally looks good - thanks for doing the merge

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean when you say that you aligned it with classification?

@Samoed Samoed Sep 1, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly all in e78d04c Changed LinearRegressionEvaluator __init__ and __call__ to work with datasets, in AbsTaskTextRegression aligned _evaluate_subset with v2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - made me think that it might be easy to merge it with the classification (but maybe not) - regardless it is for another PR - feel free to merge

@Samoed Samoed merged commit 94edcc6 into v2.0.0 Sep 1, 2025
9 checks passed
@Samoed Samoed deleted the merge_main_30_08 branch September 1, 2025 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.