Skip to content

Add Summarization task#11

Merged
NouamaneTazi merged 3 commits into
mainfrom
summarization
Jun 24, 2022
Merged

Add Summarization task#11
NouamaneTazi merged 3 commits into
mainfrom
summarization

Conversation

@NouamaneTazi

Copy link
Copy Markdown
Member

No description provided.

@NouamaneTazi NouamaneTazi merged commit bdb2691 into main Jun 24, 2022
@NouamaneTazi NouamaneTazi deleted the summarization branch June 24, 2022 13:27
Muennighoff added a commit that referenced this pull request Feb 19, 2024
* feat: add xmarket es dataset

* refactor: use multilingual dataset

* fix: update revision id

* refactor: add constant for language

* feat: add two clustering datasets

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* feat: import classes

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: flores dataset

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* feat: add miracl reranking task for spanish

* feat: use hf repo with all reranking langs

* feat: update revision hash

* refactor: use description for language

* feat: add stses task

* fix: get scores from label column

* refactor: add revision to data loading

* Added spanish passage retrieval

* feat: mintaka and xpqa retrieval tasks

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* feat: import classes

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* fix: typo in data loading

* fix: id

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: try out multilingual task

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: multilingual task import

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: cmon man

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: go back to monolingual tasks

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: remove unused import

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* refactor: loading logic

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>

* feat: add miracl as retrieval task

* fix: nested corpus

* refactor: get lang from description

* Update mteb/tasks/Retrieval/MIRACLRetrieval.py

Co-authored-by: Michael Günther <michael.guenther@jina.ai>

* feat: allow multlingual reranking tasks

* feat: make miraclreranking multilingual

* refactor: rename miraclretrieval

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* style: add missing eof empty line

* feat: make xmarket retrieval task multilingual

* refactor: rename xmarket

* refactor: turn spanish tasks multilingual (#11)

* refactor: make xpqa retrieval multilingual

* fix: formatting of xpqa dataset

* refactor: make mintaka into multilingual task

* refactor: make miracl retrieval multilingual

* feat: add revision ids for hf datasets

* refactor: remove patool

* Update mteb/tasks/Reranking/__init__.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/STS/__init__.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

---------

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: guenthermi <guenthermi50@gmail.com>
Co-authored-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: Markus Krimmel <markus.krimmel@jina.ai>
Co-authored-by: Michael Günther <michael.guenther@jina.ai>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Muennighoff added a commit that referenced this pull request Feb 22, 2024
* add Masakhane dataset config

* add trigram lang code for dataset who use it

* create french script eval

* fix French word

* add some documentation

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* 4 pair classification (#10)

* add Opusparcus dataset

* multilingual usage

* use eval_split of config files

* change eval_split according to data

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* Clustering with HAL S2S dataset (#11)

HAL S2S dataset creation and evaluation on clustering task.

* adding BSARD dataset

* add BSARD to benchmark

* adding Hagrid dataset

* DiaBLa and Flores Bitext Mining evaluation (#12)

* Add DiaBLa dataset for bitext mining

* Add DiaBLa dataset for bitext mining

* deduplicate bitext task

* add Flores

* format files

* add flores to evaluation script

* remove prints

* add revision

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* adding dataset processing for mteb

* adding BSARD dataset

* add BSARD to benchmark

* adding Hagrid dataset

* fix change on langmapping

* reset alphabetical order

* add revision handling

* Clustering: Add AlloProf dataset  (#17)

AlloProf dataset for clustering task

* handling of revision

* change split + add revision handling

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* adding dataset processing for mteb

* adding BSARD dataset

* add BSARD to benchmark

* adding Hagrid dataset

* add script to process and upload alloprof on HF

* adding dataset processing for mteb

* refactor few thing

* reset alphabetical order

* add revision handling

* handling of revision

* change split + add revision handling

* use eval variable

* alphabetic order

* Add MLSUM dataset for clustering task (#21)

* Use Masakhane dataset for clustering task (#23)

* 16 add datasets to readmemd (#18)

* run task table

* run task table

* Add MLSUM dataset for clustering task (#21)

* Use Masakhane dataset for clustering task (#23)

* run task table

* refresh readme

* refresh readme

* run task table

* refresh readme

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com>

* load only test split (#25)

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* Update mteb/tasks/BitextMining/DiaBLaBitextMining.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/HALClusteringS2S.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* renaming masakhane (#28)

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* Syntec dataset addition (#26)

* add scrpit to process & load to HF

* add script to enable download of data from HF

* add syntec dataset files to gitignore

* add syntecretrieval

* add syntec retrival

* build dataloading script

* remove datasets

* correct typo

---------

Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr>

* 30 add syntec reranking (#31)

* change name to secify retrieval

* add reranking tasks

* create script to upload dataset fo reranking task

* create reranking task

* add reranking tasks

* add model name in description

* SummEval translated to french (#32)

* 7 sts (#33)

* taike into account multilingual tasks

* add stsbenchmark multilingual dataset

* add STS tasks

* taike into account multilingual tasks

* add stsbenchmark multilingual dataset

* add STS tasks

* add coma

* Adding sick fr dataset to sts tasks (#34)

* Adding sick fr dataset to sts tasks
* modifying dataset in load function to have the right column names

* Fix alloprof dataset (#36)

* change revision to use

* remove duplicate data

* change main metric because dataset is hard (#37)

* Fix alloprof dataset (#40)

* change revision to use

* remove duplicate data

* change revision

* handle queries train test split

* change dataset creation method

* change revision

* handle queries train test split

* change dataset creation method

* Fix DiaBLa by inheriting CrossLingual class (#42)

* Fix DiaBLa by inheriting CrossLingual class

* remove remaining print

* Fix DiaBLa integration

* Update mteb/tasks/BitextMining/FloresBitextMining.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Classification/MasakhaNEWSClassification.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

* Update mteb/tasks/BitextMining/FloresBitextMining.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/evaluation/MTEB.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/abstasks/AbsTaskPairClassification.py

Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>

* Update README.md

* Update scripts/data/syntec/create_data_reranking.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update scripts/data/alloprof/create_data_reranking.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update scripts/run_mteb_french.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update scripts/run_mteb_french.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/evaluation/MTEB.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/evaluation/MTEB.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Retrieval/HagridRetrieval.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/MLSUMClusteringP2P.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/MLSUMClusteringS2S.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py

* Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py

* Update mteb/tasks/STS/SickFrSTS.py

* Inherit OpusparcusPC init from MultilingualTask

* remove unnecessary init

* Remove train split from evaluation on MasakhaNEWSClassification (#52)

remove train split from evaluation

* put script on HF dataset repos (#56)

* put script on HF dataset repos

* remove scripts

* 49 fix dictionnary in syntecretrieval (#54)

* add trust remote code arg

* leave corpus as dict

* remove trust remote code

* add Tatoeba & BUCC BitextMining tasks (#57)

add bucc and tatoeba bitextmining tasks

* 46 add other languages to masakhaneweclusterings2s and p2p (#58)

* add other language to clustering tasks

* fix main score and S2S task

* update run fr becnhmark script

* Update run_mteb_french.py

* Update AbsTaskClustering.py

* remove train and validation splits

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com>
Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: mciancone <73994289+Sunalwing@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com>
Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>
Muennighoff added a commit that referenced this pull request Feb 27, 2024
* add Masakhane dataset config

* add trigram lang code for dataset who use it

* create french script eval

* fix French word

* add some documentation

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* 4 pair classification (#10)

* add Opusparcus dataset

* multilingual usage

* use eval_split of config files

* change eval_split according to data

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* Clustering with HAL S2S dataset (#11)

HAL S2S dataset creation and evaluation on clustering task.

* adding BSARD dataset

* add BSARD to benchmark

* adding Hagrid dataset

* DiaBLa and Flores Bitext Mining evaluation (#12)

* Add DiaBLa dataset for bitext mining

* Add DiaBLa dataset for bitext mining

* deduplicate bitext task

* add Flores

* format files

* add flores to evaluation script

* remove prints

* add revision

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* adding dataset processing for mteb

* adding BSARD dataset

* add BSARD to benchmark

* adding Hagrid dataset

* fix change on langmapping

* reset alphabetical order

* add revision handling

* Clustering: Add AlloProf dataset  (#17)

AlloProf dataset for clustering task

* handling of revision

* change split + add revision handling

* add script to process and upload alloprof on HF

* build script for HF

* adding dataset processing for mteb

* refactor few thing

* remove whitespaces

* adding dataset processing for mteb

* adding BSARD dataset

* add BSARD to benchmark

* adding Hagrid dataset

* add script to process and upload alloprof on HF

* adding dataset processing for mteb

* refactor few thing

* reset alphabetical order

* add revision handling

* handling of revision

* change split + add revision handling

* use eval variable

* alphabetic order

* Add MLSUM dataset for clustering task (#21)

* Use Masakhane dataset for clustering task (#23)

* 16 add datasets to readmemd (#18)

* run task table

* run task table

* Add MLSUM dataset for clustering task (#21)

* Use Masakhane dataset for clustering task (#23)

* run task table

* refresh readme

* refresh readme

* run task table

* refresh readme

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com>

* load only test split (#25)

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* Update mteb/tasks/BitextMining/DiaBLaBitextMining.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/HALClusteringS2S.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* renaming masakhane (#28)

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>

* Syntec dataset addition (#26)

* add scrpit to process & load to HF

* add script to enable download of data from HF

* add syntec dataset files to gitignore

* add syntecretrieval

* add syntec retrival

* build dataloading script

* remove datasets

* correct typo

---------

Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr>

* 30 add syntec reranking (#31)

* change name to secify retrieval

* add reranking tasks

* create script to upload dataset fo reranking task

* create reranking task

* add reranking tasks

* add model name in description

* SummEval translated to french (#32)

* 7 sts (#33)

* taike into account multilingual tasks

* add stsbenchmark multilingual dataset

* add STS tasks

* taike into account multilingual tasks

* add stsbenchmark multilingual dataset

* add STS tasks

* add coma

* Adding sick fr dataset to sts tasks (#34)

* Adding sick fr dataset to sts tasks
* modifying dataset in load function to have the right column names

* Fix alloprof dataset (#36)

* change revision to use

* remove duplicate data

* change main metric because dataset is hard (#37)

* Fix alloprof dataset (#40)

* change revision to use

* remove duplicate data

* change revision

* handle queries train test split

* change dataset creation method

* change revision

* handle queries train test split

* change dataset creation method

* Fix DiaBLa by inheriting CrossLingual class (#42)

* Fix DiaBLa by inheriting CrossLingual class

* remove remaining print

* Fix DiaBLa integration

* Update mteb/tasks/BitextMining/FloresBitextMining.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Classification/MasakhaNEWSClassification.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update README.md

* Update mteb/tasks/BitextMining/FloresBitextMining.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/evaluation/MTEB.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/abstasks/AbsTaskPairClassification.py

Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>

* Update README.md

* Update scripts/data/syntec/create_data_reranking.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update scripts/data/alloprof/create_data_reranking.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update scripts/run_mteb_french.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update scripts/run_mteb_french.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/evaluation/MTEB.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/evaluation/MTEB.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Retrieval/HagridRetrieval.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/MLSUMClusteringP2P.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/MLSUMClusteringS2S.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py

* Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py

* Update mteb/tasks/STS/SickFrSTS.py

* Inherit OpusparcusPC init from MultilingualTask

* remove unnecessary init

* Remove train split from evaluation on MasakhaNEWSClassification (#52)

remove train split from evaluation

* put script on HF dataset repos (#56)

* put script on HF dataset repos

* remove scripts

* 49 fix dictionnary in syntecretrieval (#54)

* add trust remote code arg

* leave corpus as dict

* remove trust remote code

* add Tatoeba & BUCC BitextMining tasks (#57)

add bucc and tatoeba bitextmining tasks

* 46 add other languages to masakhaneweclusterings2s and p2p (#58)

* add other language to clustering tasks

* fix main score and S2S task

* update run fr becnhmark script

* Update run_mteb_french.py

* Update AbsTaskClustering.py

* remove train and validation splits

* remove Hagrid (#60)

---------

Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr>
Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com>
Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr>
Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com>
Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>
Samoed added a commit that referenced this pull request Jun 4, 2026
The mean-task-type summary builder (MIEB, ViDoRe-style) writes BOTH
"Rank" (the primary, assigned in sort-by-Mean(Task) order) and
"Rank (Borda)" (kept for back-compat). The adapter was falling back
to "Rank (Borda)" first, so MIEB rows ended up labelled with Borda
ranks even though the rows themselves were mean-sorted — the actual
top model (jina-embeddings-v5-omni-small at mean 0.65) showed up
as rank #11 on the home leaderboard.

Switching the lookup order to ("Rank", "Rank (Mean Task)", "Rank
(Borda)") so the explicit primary rank wins when present. Standard
builders only emit "Rank (Borda)" so they still pick that up via
fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Samoed added a commit that referenced this pull request Jun 21, 2026
* change to polars leaderboard

* WIP changes

* speedup

* upds

* fixes

* rewrite mostly to polars

* fixes

* more speedup

* lint a bit

* fix init

* fix tests

* fix tests

* fix typing

* refactor to private

* Add mteb/api FastAPI service and HF Space Dockerfile

New mteb/api subpackage exposes the leaderboard data as a FastAPI
service backed by ResultCache + the existing polars summary builders.
Routes mirror the SvelteKit frontend's data needs: benchmark menu,
benchmark detail, and prerendered summary tables. CORS origins,
preload, and cache locations come from settings.

Dockerfile clones mteb@api, installs .[api], and serves uvicorn on
:7860 as UID 1000 — drop-in for a Hugging Face Space.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix typing

* Annotate cors_origins with NoDecode so env strings parse

pydantic-settings' EnvSettingsSource tries to json.loads any field it
considers complex *before* invoking field_validators, which made the
documented comma-separated MTEB_API_CORS_ORIGINS format crash with
JSONDecodeError at app startup inside the HF Space. NoDecode skips
that pre-parse step and lets the existing field_validator split on
commas as advertised.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Bust Docker layer cache when the api branch advances

`RUN git clone` always produces the same layer hash because the command
string never changes, so HF Spaces was rebuilding the image on top of a
stale checkout — the cors_origins NoDecode fix never made it into the
running container. Pull the latest commit SHA from GitHub via ADD just
before the clone; ADD invalidates the layer whenever the response body
changes, which forces a fresh clone per push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Move leaderboard_parquet_path onto ResultCache

The api module needed only this one-line helper from
mteb.leaderboard.app, but importing it pulled in gradio, pandas, and
cachetools — none of which belong in the [api] extra. Promoting it to
a property on ResultCache lets every consumer (api, leaderboard,
bench script) reach the path without dragging the Gradio stack into
the API container.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Pre-fetch mteb/results HF dataset at build time

Drops the cold-start cost of cloning the GitHub results repo on first
request by pulling the same data from huggingface.co/datasets/mteb/results
during image build. Goes into the default huggingface_hub cache under
HF_HOME so callers reach it via the standard hub APIs. The download is
guarded with `|| true` so it stays non-fatal while the dataset is still
being populated upstream — the API just falls back to the GitHub clone
on first request.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Switch leaderboard cache loader to per-config HF dataset layout

The results-repo sync now pushes one HF dataset config per benchmark
(plus a ``default`` config holding every result, deduped). Rewires the
API consumer to match:

* ``_load_from_hub`` enumerates configs and ``load_dataset(name=cfg,
  split='train')`` each. A failure on one config no longer poisons the
  whole load.
* ``_load_per_benchmark_frames`` collapses to two paths — hub or cold
  rebuild — and returns a ``(per_benchmark, all_results)`` tuple
  instead of the ``_LoadedFrames`` dataclass. The two named wrappers
  (``get_all_benchmark_frames`` / ``get_all_results_df``) go away;
  callers destructure inline.
* Hub-supplied ``default`` config short-circuits the per-benchmark
  concat for the unified view.

Other follow-ups:

* ``BenchmarkResults`` gains ``load_leaderboard_frame`` and
  ``split_leaderboard_frame`` so loading the raw combined frame can be
  decoupled from splitting it. The new
  ``_split_by_benchmark_tasks`` filters via an inner join on
  ``(task_name, split, subset)`` tuples — off-spec subsets/splits no
  longer leak through to ``_create_summary_table``'s
  ``group_by(model_name, task_name).mean()``.
* ``MTEB_API_CACHE_REPO`` moves to ``Settings`` alongside
  ``cors_origins`` / ``preload``; consumers go through
  ``settings.cache_repo()``.
* /robots.txt added to silence Space probes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix import

* fix repo name

* fix results

* benchmarks: add MVEB benchmark suite + leaderboard Video menu

Adds the MVEB (Massive Video Embedding Benchmark) benchmark objects to
main so the leaderboard and get_benchmark() can resolve them. The
underlying tasks are already on main; this adds only the curated
benchmark groupings and their registration.

- benchmarks.py: MVEB (23 tasks), MVEB(text, video) (19), MVEB(video)
  (9), MVEB(beta, extended) (184, alias MVEB(extended)).
- benchmarks/__init__.py: import + __all__ registration.
- _leaderboard_menu.py: new "Video" group under General Purpose.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* add more parameters to schemas

* add train on to cache results

* Lifespan warmup, /scores route, per-task num_models

Replace the deprecated `@on_event("startup")` hook with a FastAPI
`lifespan` context manager. `warmup_blocking` is dispatched via
`asyncio.to_thread` so its sync polars work runs without blocking the
event loop, and uvicorn holds the listener until it returns — the
first request lands on a fully warm cache. Heavy summary preloading
stays gated behind `MTEB_API_PRELOAD=1` in a daemon thread.

Add `num_models` to `TaskMetaSchema`, derived from the unified
results frame by `routes._task_num_models_map()` and overlaid on
every `/tasks` and `/tasks/{name}` response. Drives the new "Models
evaluated" stat + sort on the frontend `/tasks` cards.

Rename `/benchmarks/{name}/summary` to `/scores` as the canonical
path (keep `/summary` as a hidden alias) and refresh the README.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* benchmarks: mark MVEB suite beta; drop extended from leaderboard menu

Addresses review:
- Rename to beta variants: MVEB(beta), MVEB(video, beta),
  MVEB(text, video, beta) (consistent with MAEB/RTEB). Old names kept
  as aliases so get_benchmark("MVEB") etc. still resolve.
- Leaderboard "Video" menu no longer displays MVEB(beta, extended).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* api: expose Benchmark.language_view through the schema

Surfaces the per-language column list the frontend's "Performance
per language" tab needs. Mirrors the existing pattern for
benchmark.languages — codes are resolved via language_label() so
the frontend renders "German" rather than "deu-Latn". Empty default
collapses to None so the frontend treats "no language view" as a
single missing-value check (and hides the tab entirely on benchmarks
that didn't opt in).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* api: pick "Rank" over "Rank (Borda)" when both columns exist

The mean-task-type summary builder (MIEB, ViDoRe-style) writes BOTH
"Rank" (the primary, assigned in sort-by-Mean(Task) order) and
"Rank (Borda)" (kept for back-compat). The adapter was falling back
to "Rank (Borda)" first, so MIEB rows ended up labelled with Borda
ranks even though the rows themselves were mean-sorted — the actual
top model (jina-embeddings-v5-omni-small at mean 0.65) showed up
as rank #11 on the home leaderboard.

Switching the lookup order to ("Rank", "Rank (Mean Task)", "Rank
(Borda)") so the explicit primary rank wins when present. Standard
builders only emit "Rank (Borda)" so they still pick that up via
fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* benchmarks: customisable sort column + relabel MIEB's mean to TaskType

Add `Benchmark.summary_sort_column: ClassVar[str | None] = None` — a
new opt-in hook on every benchmark for choosing which polars column
the summary frame is sorted by (and which populates the displayed
`Rank`). `None` keeps the historical builder default; subclasses set
it explicitly when they want a different sort.

`_create_summary_table_mean_task_type` gains a `sort_by` arg that
overrides the default (sort by `mean_column_name`).

MIEB updated to declare its actual semantics:
  - aggregations = (MEAN_TASK_TYPE, TASK_TYPES) — the column previously
    labelled "Mean (Task)" is computed as mean-of-per-type-means, which
    is mathematically a Mean (TaskType). The misleading label was
    causing the frontend to recompute and disagree with the canonical
    value (jina on MIEB(eng) showed 58 instead of 65).
  - mean_column_name = "Mean (TaskType)" so the leaderboard column
    matches the actual aggregation.
  - summary_sort_column = "Mean (TaskType)" — sort by the renamed
    column (was sorting by it anyway, just plumbed through the new
    hook).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix

* api: /benchmarks/{name}/leaders for slim home-page tiles

Adds a new endpoint that returns one row per size bucket — the
highest-scoring model in each [min, max) megaparameter range.
Lets the leaderboard home page render its featured mini-tables
without pulling the full /scores payload for every primary tile
(several MB on multilingual benchmarks → a few hundred bytes).

  GET /benchmarks/{name}/leaders?buckets=[[0,500],[500,1000],[1000,5000],[5000,null]]

`buckets` is a JSON-encoded array of `[min, max]` tuples in
millions of parameters (`null` or omitted second element = open-
ended top bucket). Backend converts each bucket's millions to
billions internally to compare against `total_params_b`.

Score selection prefers `mean_task` but falls back to
`mean_task_type` for benchmark builders that only populate the
latter (MIEB / ViDoRe-style) — without the fallback MIEB returned
`leader: null` because its rows have null `mean_task`.

Response is a slim `LeaderRowSchema` with `name / displayName /
org / modelType / rank / meanTask / totalParamsB` — just enough for
the frontend to render an `<org>/<name>` link + score.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* remove misc

* fix docs

* API: /v1 prefix, per-language endpoint, scoped task languages, home menu

Routes / app layout
- All app routes mounted under `/v1` (benchmarks, tasks, models, icon).
  Infra paths (`/health`, `/metrics`, robots, favicon) stay at root via
  `infra_router`.

New endpoint
- `GET /v1/benchmarks/{name}/per-language` returns one row per model
  with mean main_score keyed by human language label (matches the
  labels emitted on `Benchmark.languages` / `TaskMeta.languages`).
  Built from the long results frame: explode the `language` list →
  group by `(model_name, language)` → mean(score) → `language_label`.
  Same in-memory cache pattern as summary; warmed alongside summaries
  in `preload_summaries_in_background`.

Scoped task languages
- `scoped_task_meta_schema(task)` returns a TaskMetaSchema with
  `languages` derived from the instance's `task.languages` (post
  `filter_languages` / hf_subsets) instead of the metadata union.
  Benchmarks like MTEB(Scandinavian, v1) that pin tasks to a subset
  no longer surface the task's full unscoped language list (Burmese,
  etc.) in the per-benchmark filter sidebar.

Schema changes
- `TaskMetaSchema.main_score: str | None` — populated from
  `TaskMetadata.main_score` (e.g. `ndcg_at_10`, `cosine_spearman`).
  Surfaced on /tasks cards, /tasks/[name], and PerTaskTab tooltip.
- `LeaderModelSchema` drops `display_name` + `org`; carries just
  `name` + `model_type`. Clients split on `/` as needed.
- New `BenchmarkPerLanguageRowSchema` + `BenchmarkPerLanguageSchema`.

Home menu
- `HOME_BENCHMARK_ENTRIES` — flat 4-section layout (Language /
  Modality / Retrieval / Domain) reusing the curated benchmarks from
  `GP_BENCHMARK_ENTRIES + R_BENCHMARK_ENTRIES`. `/v1/benchmarks/menu`
  serves this for the leaderboardv2 home page. Gradio leaderboard
  keeps using `GP + R` unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* add lock

* API: language-scoped /scores, lenient means, facet metadata

- /benchmarks/{name}/scores accepts ?languages=...; long_df is pre-filtered
  to subsets whose language list intersects the picks (matched by raw ISO
  code OR display label) before bench._create_summary_table runs.
- When languages is non-empty, per-row mean_task / mean_task_type are
  recomputed with lenient skipna (mean over present per-task scores) so
  partial-coverage models surface a score instead of '-' once the visible
  task slice shrinks. Unfiltered case keeps the canonical strict means so
  full-coverage models can't be outranked by partial peers.
- Per-(name, sorted-languages) summary cache keyed on the picks tuple so
  the same selection on a refresh hits LRU.
- ModelMetaSchema gains a `languages` field (display labels via
  language_label). Drives the /models language facet on the frontend.
- BenchmarkSchema gains `simplified_task_types`. Drives the "Task group"
  facet on /benchmarks (5 buckets: retrieval / classification /
  pair-classification / clustering / semantic-similarity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* API: generate + serve per-entity OG hero cards

Add scripts/generate_og_images.py (Playwright) and a matplotlib-only
alternate at scripts/generate_og_images_mpl.py. Both walk the mteb
registry, render one PNG per benchmark / task / model using a shared
template under scripts/og-template/, and write hash sidecars so re-runs
skip unchanged entities. FastAPI mounts the rendered dir at /og with a
long Cache-Control. The Dockerfile picks up the work in a builder stage
that ships Chromium + Playwright; the runtime stage only inherits the
pre-rendered PNG files, keeping the deployed image lean. Torch installs
from the CPU index to avoid the CUDA wheels.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* API: OG slug matches JS encodeURIComponent + nested model paths; open CORS to *

Two interlocking 404s on the OG mount, plus a CORS gap that broke
client-side share-card validators:

  - _slug used quote(safe=""), which over-encodes ()!*' relative to
    JavaScript's encodeURIComponent. Benchmark names like MTEB(eng, v2)
    wrote files at MTEB%28eng...%29.png while ShareMeta requested
    MTEB(eng...).png. Add safe="!*'()" to match.
  - Model names like microsoft/harrier-... were stored as flat
    org%2Fname.png. Starlette decodes %2F to / before file lookup, so
    every model card 404'd. Split on / and write to nested directories
    (org subdir per model); ensure the parent dir exists at plan time.
  - cors_origins defaulted to a four-host allowlist that blocked
    opengraph.xyz-style client-side previewers (no security gain —
    everything we serve is public). Default to ["*"]; env override
    replaces rather than merges.

Drop the redundant MTEB_API_CORS_ORIGINS line from the Dockerfile so
the image inherits the open default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* tmp

* OG cards: use simplified task group instead of raw type

Task cards previously showed the raw type (BitextMining, STS,
PairClassification, …) on the secondary badge + the "Type" stat.
Switch to the simplified group (bitext-mining, semantic-similarity,
pair-classification, …) so the OG hero matches the task-group chip
on TaskCard and the type-badge on /tasks/[name] — same wording
across every surface. Fall back to the raw type when simplified
isn't populated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* API: overlay num_models on the benchmark detail endpoint

`/v1/benchmarks/{name}` was returning the raw schema with the default
`num_models=0`, while `/v1/benchmarks` and the menu both apply
`_with_num_models`. The discrepancy bit the home-page primary tiles
once any of them were removed from HOME_BENCHMARK_ENTRIES — the
frontend's fallback fetches the detail endpoint directly and was
getting "0 models" even for benchmarks with hundreds of evaluated
models (e.g. RTEB(beta) with numModels=267 on /benchmarks but 0 on
/benchmarks/RTEB(beta)).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* api: pick leaders by rank instead of by score

`_pick_leader` was excluding rows whose `mean_task`/`mean_task_type`
were both null, which on a partial-coverage benchmark (e.g. RTEB
locally without SWEbenchCodeRetrieval) is every row — strict-skipna
nulls the headline mean as soon as one task is missing. Result: the
home-page bucket leaders went empty and tiles read "No size-bucketed
data yet" even though /scores rendered 267 ranked rows for the same
benchmark.

Switch the picker to `row.rank` (lowest = best). The summary builder
fills Borda rank in from per-task scores regardless of whether the
strict aggregate survives, so every row in the bucket has a deter-
ministic rank and the leader is always defined. The response still
echoes back `mean_task` (or `mean_task_type`) so the tile's "Leader:
… · {score}" line keeps printing a number when there is one.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* upd menu

* API: pre-serialize + pre-gzip bytes cache, drop ETag middleware

Caches Serialized(body, body_gzip, etag) per endpoint so warm requests
skip pydantic-core JSON dumping, gzip compression, and SHA-1 hashing.
_cached_json handles encoding negotiation, 304 revalidation, and a
4-hour Cache-Control inline; ETagMiddleware is gone (it buffered every
response body just to compute an etag we now precompute once).

In-process bench warm-request medians vs HEAD:

  per_language(MTEB Multilingual, v2)  301ms -> 6.3ms  (48x, 6.4MB)
  tasks_list                            30ms -> 1.3ms  (23x, 2.1MB)
  scores(MTEB Multilingual, v2)         32ms -> 1.5ms  (21x, 1.9MB)
  models_list                           11ms -> 0.65ms (17x, 922KB)
  scores(BEIR)                         9.1ms -> 0.55ms (17x, 588KB)

Other changes folded in:
- Per-name asyncio.Lock coalesces cold builds; OrderedDict LRU bounds
  the language-scoped summary cache (was unbounded).
- Parallelised warmup_blocking (parquet load + training-dataset prewarm
  + schema prewarm run on a 3-thread pool); prewarm_schema_caches now
  threads task/benchmark/model schema construction across 16 workers.
- Filter caches no longer key by name_query — substring searches reuse
  the warm full list and post-filter.
- model_construct on hot per-row schema builds skips pydantic
  validation on values produced internally.
- Aggressive trim of module/function docstrings; DRY'd _with_num_models
  helpers; flattened the language_view ternary; dropped a few dead
  branches (code != "Unknown", discriminator=None wrapper, etc.).
- mteb/benchmarks/benchmark.py: fix _get_benchmarks_on_leaderboard tuple
  wrap that broke startup against HOME_BENCHMARK_ENTRIES.
- scripts/bench_api_inproc.py: in-process httpx + ASGITransport bench
  with a --diff mode for before/after comparison.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* API: drop redundant schema caches, off-thread serialize, single-flight bytes

Follow-up to the bytes-cache commit. The three schema caches
(_per_language_schemas, _task_score_schemas, _model_score_schemas) and
the language-scoped _summary_lang_schemas were write-only once the bytes
cache landed — only get_*_bytes ever read them. Drop them; serialise
straight into the bytes cache.

Correctness fixes:
- gzip.compress now runs inside asyncio.to_thread; was blocking the
  event loop ~50-100ms per multi-MB cold build.
- Each get_*_bytes acquires a per-key asyncio.Lock so two concurrent
  cold requests share the one build + serialise (the schema-build lock
  was protecting that half; the serialise half was racy).
- body_gzip is now bytes | None instead of "same object as body" —
  proper signal that there's no gzip variant for tiny payloads.

Other:
- Unify the four get_*_bytes paths via _cached_bytes(store, locks, key,
  builder); cuts ~40 lines.
- _prewarm_list_schemas runs its four builders on a thread pool.
- Drop _row_index_cache in aggregators (id()-keyed; latent recycling
  hazard if eviction is ever added). The replacement linear scan over
  ~500 rows x 50 benchmarks runs in microseconds, no bench regression.
- benchmark_scores reuses _as_tuple instead of inlining comma-split.
- Stale "ETagMiddleware" reference in routes.py module docstring updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* API: 304 includes Vary, leaders metric before build, typed bytes locks

Three small fixes flagged in the post-commit review:

- 304 responses now include ``Vary: accept-encoding`` so intermediate
  caches key correctly across encoding variants (RFC 7232).
- benchmark_leaders increments its Prometheus counter before the build,
  consistent with every other route handler; failed builds no longer
  silently drop the metric.
- Replace the stringly-keyed ``_bytes_locks`` dict-of-dicts with five
  named module-level lock dicts. A typo in a key now becomes a NameError
  instead of silently allocating a fresh empty dict (which would have
  broken single-flight without any signal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix vidore aggregation

* fix benchmarks visibility on model cards

* upds

* tmp

* upadate lock

* upd test

* fix typecheck

* start refactor

* start refactor

* simplify schemas

* simplify schemas

* simplify imports

* api: speed up table generation + disk-cache the per-benchmark split

Builders:
- SummaryTable wrapper carries rank/primary/public/private column
  pointers; aggregator reads via pointers instead of column-name guessing.
- Long-form Borda (rank().over partition) replaces the N-column horizontal
  expression tree.
- Inner-join metadata attach replaces map_batches + per-row Python resolver.
- mean_horizontal(ignore_nulls=False), lazy fusion of select+with_columns,
  sort after attach (skip filtered rows), shared per-task pivot between
  summary and per-task builders.
- Canonical column names: Model holds org/name, type columns stay
  CamelCase; leaderboard styler humanises + wraps in markdown at display.

Split:
- Parallel join in _split_by_benchmark_tasks (~12s -> ~9s).

Disk cache:
- frames.py persists per-benchmark + unified frames to
  ~/.cache/mteb/leaderboard/ after first build; invalidated by HF dataset
  commit SHA. Warm restart 40s -> 5s.
- Dockerfile bakes the cache into the base layer so runtime first request
  reads from disk instead of downloading + splitting.

Observability:
- INFO logs around each warmup phase with durations.
- create_app wires logging.basicConfig so warmup logs surface alongside
  uvicorn output. New settings: LOG_LEVEL, DISK_CACHE, PRELOAD_CONCURRENCY.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* api: fix correctness bugs + consolidate table builders

Correctness:
- frames.py: _rebuild_from_full_repository() was called with no args,
  triggering TypeError on the no-disk-cache fallback path.
- aggregators.py: _pick_leader had a duplicated `r.total_params_b is None`.
- benchmark.py: RtebBenchmark renamed Retrieval -> Mean (Task) then
  immediately dropped Mean (Task) — collapsed to one drop+rename chain.
- schemas.py: BenchmarkSchema.language_view lost dedupe; restored.
- routes.py: robots.txt docstring sync'd to the new Allow body.
- aggregators.py: build_benchmark_per_language now offloads polars work
  to asyncio.to_thread so cold misses don't pin the event loop.
- aggregators.py: row[col] -> row.get(col) for optional mean cols.

Cache + concurrency:
- routes.py: _leader_bytes bounded with LRU eviction (was unbounded).
- frames.py: atomic disk-cache write — .tmp + Path.replace per shard,
  manifest atomic-swapped, stale sweep happens AFTER swap.
- warmup.py + app.py: preload runs as asyncio.create_task on the
  serving loop instead of a daemon thread with its own asyncio.run().
- icons.py: cache_clear() no longer wipes _fetch_locks.

Table builders:
- new _STANDARD_META_COLS + _order_summary_cols replace 5 copies of
  the final column-ordering boilerplate.
- _build_joint_with_type_means_and_borda shared by mean_task +
  mean_task_type builders.
- _PublicPrivateBuild dataclass + _build_public_private_joint shared
  by mean_public_private and Vidore; Vidore's wrapper inlined.
- per-task table + mean_subset migrated to _borda_rank_from_long;
  only Benchmark.to_dataframe still uses the wide-form _get_borda_rank.
- leaderboard/table.py: deleted dead pandas Borda helpers.

API helpers:
- aggregators.py: _per_task_rows_and_cols, _filter_long_df_by_languages,
  _read_row_metrics slice the 140-line build_benchmark_summary.
- aggregators.py: _extract_trained_on_map is one polars groupby instead
  of a per-row setdefault loop; build_task_scores derives all_subsets
  while filling seen.
- cache.py: _cache_or_build generic single-flight helper; _cached_bytes
  and get_summary are thin wrappers. summary-schema cache now emits
  hit/miss metrics.
- routes.py: _serialize_schemas + _safe_load_frames fold the per-list
  and per-map boilerplate; _require_task / _require_model helpers +
  dropped dead try/except KeyError in model_scores.
- routes.py: deleted deprecated /benchmarks/{name}/summary alias.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* make zeroshot summary optional

* fix column

* api: consolidate cache layer + tighten aggregator helpers

Bundle each single-flight cache (store + locks + LRU cap + metric label)
into a CacheLayer dataclass so the 11 module globals collapse to 6
instances and helper signatures lose 3 args. Share the ResultCache root
for the leaderboard disk cache so MTEB_CACHE overrides apply uniformly,
and promote the JSON Cache-Control max-age to a settings knob
(HTTP_MAX_AGE) so dev hard refreshes can opt out of browser caching.
Aggregators get smaller too: inlined _read_row_metrics, dropped
redundant float() coercions, and renamed lenient_means to
language_filtered to match its definition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* upd lock

* fix

* Apply suggestions from code review

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix old lb

* ci: build API image from local source + consolidate workflows

Replace the Dockerfile's git-clone-then-install path with a COPY from
the build context (filtered by a new .dockerignore) so CI tests the
checkout under review instead of whatever's already on the upstream
branch. Add api_docker.yml — builds the image, polls /health for up to
2 min, and publishes ghcr.io/<repo>/api:{sha,latest} on main. Drop the
two old docker-test workflows (leaderboard_docker.yml,
hf_space_docker.yml) and strip leaderboard_refresh.yaml down to the
HF Space rebuild curl (publishing now lives in api_docker.yml).
Healthcheck points at the real backend /health endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* add new version to multilingual

* simplify metrics

* use aggregations directly

* add descriptive stats

* add proper typehints for router

* expose split axis on /tasks/{name}/scores

The unified results frame previously collapsed (model, task, subset,
split) → (model, task, subset) via `max(score)` before serialising, so
the leaderboard could never show per-split scores even though tasks
like MassiveIntentClassification evaluate on multiple splits.

Plumbs `split` through:
- `_UNIFIED_SCHEMA` carries `split`; `_dedupe_unified` groups by
  `(model, task, split, subset)`, deduping only across rerun rows.
- New `_CACHE_SCHEMA_VERSION = 2`, written into and validated against
  `manifest.json`. Stale disk caches from before the bump are rebuilt
  on next boot.
- `TaskScoreRowSchema.subset_scores` becomes
  `dict[str, dict[str, float]]` (outer subset, inner split) so clients
  can pivot either axis off one payload.
- `TaskScoresSchema` adds a top-level `splits: list[str]` listing
  every split observed across models for the task.
- `build_task_scores` walks the deduped unified frame directly and
  populates the nested map. The per-row `score` rollup keeps the
  prior semantics — per-subset value is the max across splits the
  model ran, then mean across subsets when the model covers every
  subset — so existing leaderboard ranks don't shift just from
  surfacing the extra axis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* update stats

* rename

* update

* skip old dataset version

* fix docs error

* show only fully evaluated models #4826

* remove cells with missing scores

* rename `superseded_by`

* fix vidorev1&v2

* build docker before refresh

* simplify docker

* fix triggering rules

* simplify comments

* simplify a bit

* add missing file

* Update mteb/api/README.md

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>

---------

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: AdnanElAssadi <aassadi22@ku.edu.tr>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant