Add Summarization task by NouamaneTazi · Pull Request #11 · embeddings-benchmark/mteb

NouamaneTazi · 2022-06-22T11:56:59Z

No description provided.

* feat: add xmarket es dataset * refactor: use multilingual dataset * fix: update revision id * refactor: add constant for language * feat: add two clustering datasets Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * feat: import classes Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: flores dataset Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * feat: add miracl reranking task for spanish * feat: use hf repo with all reranking langs * feat: update revision hash * refactor: use description for language * feat: add stses task * fix: get scores from label column * refactor: add revision to data loading * Added spanish passage retrieval * feat: mintaka and xpqa retrieval tasks Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * feat: import classes Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * fix: typo in data loading * fix: id Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: try out multilingual task Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: multilingual task import Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: cmon man Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: go back to monolingual tasks Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: remove unused import Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * refactor: loading logic Signed-off-by: jupyterjazz <saba.sturua@jina.ai> * feat: add miracl as retrieval task * fix: nested corpus * refactor: get lang from description * Update mteb/tasks/Retrieval/MIRACLRetrieval.py Co-authored-by: Michael Günther <michael.guenther@jina.ai> * feat: allow multlingual reranking tasks * feat: make miraclreranking multilingual * refactor: rename miraclretrieval Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * style: add missing eof empty line * feat: make xmarket retrieval task multilingual * refactor: rename xmarket * refactor: turn spanish tasks multilingual (#11) * refactor: make xpqa retrieval multilingual * fix: formatting of xpqa dataset * refactor: make mintaka into multilingual task * refactor: make miracl retrieval multilingual * feat: add revision ids for hf datasets * refactor: remove patool * Update mteb/tasks/Reranking/__init__.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/STS/__init__.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> --------- Signed-off-by: jupyterjazz <saba.sturua@jina.ai> Co-authored-by: guenthermi <guenthermi50@gmail.com> Co-authored-by: jupyterjazz <saba.sturua@jina.ai> Co-authored-by: Markus Krimmel <markus.krimmel@jina.ai> Co-authored-by: Michael Günther <michael.guenther@jina.ai> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

* add Masakhane dataset config * add trigram lang code for dataset who use it * create french script eval * fix French word * add some documentation * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * 4 pair classification (#10) * add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * Clustering with HAL S2S dataset (#11) HAL S2S dataset creation and evaluation on clustering task. * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * DiaBLa and Flores Bitext Mining evaluation (#12) * Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * fix change on langmapping * reset alphabetical order * add revision handling * Clustering: Add AlloProf dataset (#17) AlloProf dataset for clustering task * handling of revision * change split + add revision handling * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * add script to process and upload alloprof on HF * adding dataset processing for mteb * refactor few thing * reset alphabetical order * add revision handling * handling of revision * change split + add revision handling * use eval variable * alphabetic order * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * 16 add datasets to readmemd (#18) * run task table * run task table * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * run task table * refresh readme * refresh readme * run task table * refresh readme --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> * load only test split (#25) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Update mteb/tasks/BitextMining/DiaBLaBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/HALClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * renaming masakhane (#28) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Syntec dataset addition (#26) * add scrpit to process & load to HF * add script to enable download of data from HF * add syntec dataset files to gitignore * add syntecretrieval * add syntec retrival * build dataloading script * remove datasets * correct typo --------- Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> * 30 add syntec reranking (#31) * change name to secify retrieval * add reranking tasks * create script to upload dataset fo reranking task * create reranking task * add reranking tasks * add model name in description * SummEval translated to french (#32) * 7 sts (#33) * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * add coma * Adding sick fr dataset to sts tasks (#34) * Adding sick fr dataset to sts tasks * modifying dataset in load function to have the right column names * Fix alloprof dataset (#36) * change revision to use * remove duplicate data * change main metric because dataset is hard (#37) * Fix alloprof dataset (#40) * change revision to use * remove duplicate data * change revision * handle queries train test split * change dataset creation method * change revision * handle queries train test split * change dataset creation method * Fix DiaBLa by inheriting CrossLingual class (#42) * Fix DiaBLa by inheriting CrossLingual class * remove remaining print * Fix DiaBLa integration * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Classification/MasakhaNEWSClassification.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/abstasks/AbsTaskPairClassification.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> * Update README.md * Update scripts/data/syntec/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/data/alloprof/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Retrieval/HagridRetrieval.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringP2P.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py * Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py * Update mteb/tasks/STS/SickFrSTS.py * Inherit OpusparcusPC init from MultilingualTask * remove unnecessary init * Remove train split from evaluation on MasakhaNEWSClassification (#52) remove train split from evaluation * put script on HF dataset repos (#56) * put script on HF dataset repos * remove scripts * 49 fix dictionnary in syntecretrieval (#54) * add trust remote code arg * leave corpus as dict * remove trust remote code * add Tatoeba & BUCC BitextMining tasks (#57) add bucc and tatoeba bitextmining tasks * 46 add other languages to masakhaneweclusterings2s and p2p (#58) * add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: mciancone <73994289+Sunalwing@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com> Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>

* add Masakhane dataset config * add trigram lang code for dataset who use it * create french script eval * fix French word * add some documentation * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * 4 pair classification (#10) * add Opusparcus dataset * multilingual usage * use eval_split of config files * change eval_split according to data --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * Clustering with HAL S2S dataset (#11) HAL S2S dataset creation and evaluation on clustering task. * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * DiaBLa and Flores Bitext Mining evaluation (#12) * Add DiaBLa dataset for bitext mining * Add DiaBLa dataset for bitext mining * deduplicate bitext task * add Flores * format files * add flores to evaluation script * remove prints * add revision --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * fix change on langmapping * reset alphabetical order * add revision handling * Clustering: Add AlloProf dataset (#17) AlloProf dataset for clustering task * handling of revision * change split + add revision handling * add script to process and upload alloprof on HF * build script for HF * adding dataset processing for mteb * refactor few thing * remove whitespaces * adding dataset processing for mteb * adding BSARD dataset * add BSARD to benchmark * adding Hagrid dataset * add script to process and upload alloprof on HF * adding dataset processing for mteb * refactor few thing * reset alphabetical order * add revision handling * handling of revision * change split + add revision handling * use eval variable * alphabetic order * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * 16 add datasets to readmemd (#18) * run task table * run task table * Add MLSUM dataset for clustering task (#21) * Use Masakhane dataset for clustering task (#23) * run task table * refresh readme * refresh readme * run task table * refresh readme --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> * load only test split (#25) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Update mteb/tasks/BitextMining/DiaBLaBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/HALClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * renaming masakhane (#28) Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> * Syntec dataset addition (#26) * add scrpit to process & load to HF * add script to enable download of data from HF * add syntec dataset files to gitignore * add syntecretrieval * add syntec retrival * build dataloading script * remove datasets * correct typo --------- Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> * 30 add syntec reranking (#31) * change name to secify retrieval * add reranking tasks * create script to upload dataset fo reranking task * create reranking task * add reranking tasks * add model name in description * SummEval translated to french (#32) * 7 sts (#33) * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * taike into account multilingual tasks * add stsbenchmark multilingual dataset * add STS tasks * add coma * Adding sick fr dataset to sts tasks (#34) * Adding sick fr dataset to sts tasks * modifying dataset in load function to have the right column names * Fix alloprof dataset (#36) * change revision to use * remove duplicate data * change main metric because dataset is hard (#37) * Fix alloprof dataset (#40) * change revision to use * remove duplicate data * change revision * handle queries train test split * change dataset creation method * change revision * handle queries train test split * change dataset creation method * Fix DiaBLa by inheriting CrossLingual class (#42) * Fix DiaBLa by inheriting CrossLingual class * remove remaining print * Fix DiaBLa integration * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Classification/MasakhaNEWSClassification.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update README.md * Update mteb/tasks/BitextMining/FloresBitextMining.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/abstasks/AbsTaskPairClassification.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> * Update README.md * Update scripts/data/syntec/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/data/alloprof/create_data_reranking.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update scripts/run_mteb_french.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/evaluation/MTEB.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Retrieval/HagridRetrieval.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringP2P.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MLSUMClusteringS2S.py Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> * Update mteb/tasks/Clustering/MasakhaNEWSClusteringP2P.py * Update mteb/tasks/Clustering/MasakhaNEWSClusteringS2S.py * Update mteb/tasks/STS/SickFrSTS.py * Inherit OpusparcusPC init from MultilingualTask * remove unnecessary init * Remove train split from evaluation on MasakhaNEWSClassification (#52) remove train split from evaluation * put script on HF dataset repos (#56) * put script on HF dataset repos * remove scripts * 49 fix dictionnary in syntecretrieval (#54) * add trust remote code arg * leave corpus as dict * remove trust remote code * add Tatoeba & BUCC BitextMining tasks (#57) add bucc and tatoeba bitextmining tasks * 46 add other languages to masakhaneweclusterings2s and p2p (#58) * add other language to clustering tasks * fix main score and S2S task * update run fr becnhmark script * Update run_mteb_french.py * Update AbsTaskClustering.py * remove train and validation splits * remove Hagrid (#60) --------- Co-authored-by: Gabriel Sequeira <gsequeira@openstudio.fr> Co-authored-by: Marion Schaeffer <92590517+schmarion@users.noreply.github.com> Co-authored-by: mciancone@openstudio.fr <mciancone@openstudio.fr> Co-authored-by: Sequeira Gabriel <gabriel.sequeira@outlook.fr> Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com> Co-authored-by: wissam-sib <36303760+wissam-sib@users.noreply.github.com> Co-authored-by: Wissam Siblini <wissam.siblini92@gmail.com>

The mean-task-type summary builder (MIEB, ViDoRe-style) writes BOTH "Rank" (the primary, assigned in sort-by-Mean(Task) order) and "Rank (Borda)" (kept for back-compat). The adapter was falling back to "Rank (Borda)" first, so MIEB rows ended up labelled with Borda ranks even though the rows themselves were mean-sorted — the actual top model (jina-embeddings-v5-omni-small at mean 0.65) showed up as rank #11 on the home leaderboard. Switching the lookup order to ("Rank", "Rank (Mean Task)", "Rank (Borda)") so the explicit primary rank wins when present. Standard builders only emit "Rank (Borda)" so they still pick that up via fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* change to polars leaderboard * WIP changes * speedup * upds * fixes * rewrite mostly to polars * fixes * more speedup * lint a bit * fix init * fix tests * fix tests * fix typing * refactor to private * Add mteb/api FastAPI service and HF Space Dockerfile New mteb/api subpackage exposes the leaderboard data as a FastAPI service backed by ResultCache + the existing polars summary builders. Routes mirror the SvelteKit frontend's data needs: benchmark menu, benchmark detail, and prerendered summary tables. CORS origins, preload, and cache locations come from settings. Dockerfile clones mteb@api, installs .[api], and serves uvicorn on :7860 as UID 1000 — drop-in for a Hugging Face Space. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix typing * Annotate cors_origins with NoDecode so env strings parse pydantic-settings' EnvSettingsSource tries to json.loads any field it considers complex *before* invoking field_validators, which made the documented comma-separated MTEB_API_CORS_ORIGINS format crash with JSONDecodeError at app startup inside the HF Space. NoDecode skips that pre-parse step and lets the existing field_validator split on commas as advertised. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Bust Docker layer cache when the api branch advances `RUN git clone` always produces the same layer hash because the command string never changes, so HF Spaces was rebuilding the image on top of a stale checkout — the cors_origins NoDecode fix never made it into the running container. Pull the latest commit SHA from GitHub via ADD just before the clone; ADD invalidates the layer whenever the response body changes, which forces a fresh clone per push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Move leaderboard_parquet_path onto ResultCache The api module needed only this one-line helper from mteb.leaderboard.app, but importing it pulled in gradio, pandas, and cachetools — none of which belong in the [api] extra. Promoting it to a property on ResultCache lets every consumer (api, leaderboard, bench script) reach the path without dragging the Gradio stack into the API container. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Pre-fetch mteb/results HF dataset at build time Drops the cold-start cost of cloning the GitHub results repo on first request by pulling the same data from huggingface.co/datasets/mteb/results during image build. Goes into the default huggingface_hub cache under HF_HOME so callers reach it via the standard hub APIs. The download is guarded with `|| true` so it stays non-fatal while the dataset is still being populated upstream — the API just falls back to the GitHub clone on first request. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Switch leaderboard cache loader to per-config HF dataset layout The results-repo sync now pushes one HF dataset config per benchmark (plus a ``default`` config holding every result, deduped). Rewires the API consumer to match: * ``_load_from_hub`` enumerates configs and ``load_dataset(name=cfg, split='train')`` each. A failure on one config no longer poisons the whole load. * ``_load_per_benchmark_frames`` collapses to two paths — hub or cold rebuild — and returns a ``(per_benchmark, all_results)`` tuple instead of the ``_LoadedFrames`` dataclass. The two named wrappers (``get_all_benchmark_frames`` / ``get_all_results_df``) go away; callers destructure inline. * Hub-supplied ``default`` config short-circuits the per-benchmark concat for the unified view. Other follow-ups: * ``BenchmarkResults`` gains ``load_leaderboard_frame`` and ``split_leaderboard_frame`` so loading the raw combined frame can be decoupled from splitting it. The new ``_split_by_benchmark_tasks`` filters via an inner join on ``(task_name, split, subset)`` tuples — off-spec subsets/splits no longer leak through to ``_create_summary_table``'s ``group_by(model_name, task_name).mean()``. * ``MTEB_API_CACHE_REPO`` moves to ``Settings`` alongside ``cors_origins`` / ``preload``; consumers go through ``settings.cache_repo()``. * /robots.txt added to silence Space probes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix import * fix repo name * fix results * benchmarks: add MVEB benchmark suite + leaderboard Video menu Adds the MVEB (Massive Video Embedding Benchmark) benchmark objects to main so the leaderboard and get_benchmark() can resolve them. The underlying tasks are already on main; this adds only the curated benchmark groupings and their registration. - benchmarks.py: MVEB (23 tasks), MVEB(text, video) (19), MVEB(video) (9), MVEB(beta, extended) (184, alias MVEB(extended)). - benchmarks/__init__.py: import + __all__ registration. - _leaderboard_menu.py: new "Video" group under General Purpose. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * add more parameters to schemas * add train on to cache results * Lifespan warmup, /scores route, per-task num_models Replace the deprecated `@on_event("startup")` hook with a FastAPI `lifespan` context manager. `warmup_blocking` is dispatched via `asyncio.to_thread` so its sync polars work runs without blocking the event loop, and uvicorn holds the listener until it returns — the first request lands on a fully warm cache. Heavy summary preloading stays gated behind `MTEB_API_PRELOAD=1` in a daemon thread. Add `num_models` to `TaskMetaSchema`, derived from the unified results frame by `routes._task_num_models_map()` and overlaid on every `/tasks` and `/tasks/{name}` response. Drives the new "Models evaluated" stat + sort on the frontend `/tasks` cards. Rename `/benchmarks/{name}/summary` to `/scores` as the canonical path (keep `/summary` as a hidden alias) and refresh the README. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * benchmarks: mark MVEB suite beta; drop extended from leaderboard menu Addresses review: - Rename to beta variants: MVEB(beta), MVEB(video, beta), MVEB(text, video, beta) (consistent with MAEB/RTEB). Old names kept as aliases so get_benchmark("MVEB") etc. still resolve. - Leaderboard "Video" menu no longer displays MVEB(beta, extended). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * api: expose Benchmark.language_view through the schema Surfaces the per-language column list the frontend's "Performance per language" tab needs. Mirrors the existing pattern for benchmark.languages — codes are resolved via language_label() so the frontend renders "German" rather than "deu-Latn". Empty default collapses to None so the frontend treats "no language view" as a single missing-value check (and hides the tab entirely on benchmarks that didn't opt in). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * api: pick "Rank" over "Rank (Borda)" when both columns exist The mean-task-type summary builder (MIEB, ViDoRe-style) writes BOTH "Rank" (the primary, assigned in sort-by-Mean(Task) order) and "Rank (Borda)" (kept for back-compat). The adapter was falling back to "Rank (Borda)" first, so MIEB rows ended up labelled with Borda ranks even though the rows themselves were mean-sorted — the actual top model (jina-embeddings-v5-omni-small at mean 0.65) showed up as rank #11 on the home leaderboard. Switching the lookup order to ("Rank", "Rank (Mean Task)", "Rank (Borda)") so the explicit primary rank wins when present. Standard builders only emit "Rank (Borda)" so they still pick that up via fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * benchmarks: customisable sort column + relabel MIEB's mean to TaskType Add `Benchmark.summary_sort_column: ClassVar[str | None] = None` — a new opt-in hook on every benchmark for choosing which polars column the summary frame is sorted by (and which populates the displayed `Rank`). `None` keeps the historical builder default; subclasses set it explicitly when they want a different sort. `_create_summary_table_mean_task_type` gains a `sort_by` arg that overrides the default (sort by `mean_column_name`). MIEB updated to declare its actual semantics: - aggregations = (MEAN_TASK_TYPE, TASK_TYPES) — the column previously labelled "Mean (Task)" is computed as mean-of-per-type-means, which is mathematically a Mean (TaskType). The misleading label was causing the frontend to recompute and disagree with the canonical value (jina on MIEB(eng) showed 58 instead of 65). - mean_column_name = "Mean (TaskType)" so the leaderboard column matches the actual aggregation. - summary_sort_column = "Mean (TaskType)" — sort by the renamed column (was sorting by it anyway, just plumbed through the new hook). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix * api: /benchmarks/{name}/leaders for slim home-page tiles Adds a new endpoint that returns one row per size bucket — the highest-scoring model in each [min, max) megaparameter range. Lets the leaderboard home page render its featured mini-tables without pulling the full /scores payload for every primary tile (several MB on multilingual benchmarks → a few hundred bytes). GET /benchmarks/{name}/leaders?buckets=[[0,500],[500,1000],[1000,5000],[5000,null]] `buckets` is a JSON-encoded array of `[min, max]` tuples in millions of parameters (`null` or omitted second element = open- ended top bucket). Backend converts each bucket's millions to billions internally to compare against `total_params_b`. Score selection prefers `mean_task` but falls back to `mean_task_type` for benchmark builders that only populate the latter (MIEB / ViDoRe-style) — without the fallback MIEB returned `leader: null` because its rows have null `mean_task`. Response is a slim `LeaderRowSchema` with `name / displayName / org / modelType / rank / meanTask / totalParamsB` — just enough for the frontend to render an `<org>/<name>` link + score. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * remove misc * fix docs * API: /v1 prefix, per-language endpoint, scoped task languages, home menu Routes / app layout - All app routes mounted under `/v1` (benchmarks, tasks, models, icon). Infra paths (`/health`, `/metrics`, robots, favicon) stay at root via `infra_router`. New endpoint - `GET /v1/benchmarks/{name}/per-language` returns one row per model with mean main_score keyed by human language label (matches the labels emitted on `Benchmark.languages` / `TaskMeta.languages`). Built from the long results frame: explode the `language` list → group by `(model_name, language)` → mean(score) → `language_label`. Same in-memory cache pattern as summary; warmed alongside summaries in `preload_summaries_in_background`. Scoped task languages - `scoped_task_meta_schema(task)` returns a TaskMetaSchema with `languages` derived from the instance's `task.languages` (post `filter_languages` / hf_subsets) instead of the metadata union. Benchmarks like MTEB(Scandinavian, v1) that pin tasks to a subset no longer surface the task's full unscoped language list (Burmese, etc.) in the per-benchmark filter sidebar. Schema changes - `TaskMetaSchema.main_score: str | None` — populated from `TaskMetadata.main_score` (e.g. `ndcg_at_10`, `cosine_spearman`). Surfaced on /tasks cards, /tasks/[name], and PerTaskTab tooltip. - `LeaderModelSchema` drops `display_name` + `org`; carries just `name` + `model_type`. Clients split on `/` as needed. - New `BenchmarkPerLanguageRowSchema` + `BenchmarkPerLanguageSchema`. Home menu - `HOME_BENCHMARK_ENTRIES` — flat 4-section layout (Language / Modality / Retrieval / Domain) reusing the curated benchmarks from `GP_BENCHMARK_ENTRIES + R_BENCHMARK_ENTRIES`. `/v1/benchmarks/menu` serves this for the leaderboardv2 home page. Gradio leaderboard keeps using `GP + R` unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * add lock * API: language-scoped /scores, lenient means, facet metadata - /benchmarks/{name}/scores accepts ?languages=...; long_df is pre-filtered to subsets whose language list intersects the picks (matched by raw ISO code OR display label) before bench._create_summary_table runs. - When languages is non-empty, per-row mean_task / mean_task_type are recomputed with lenient skipna (mean over present per-task scores) so partial-coverage models surface a score instead of '-' once the visible task slice shrinks. Unfiltered case keeps the canonical strict means so full-coverage models can't be outranked by partial peers. - Per-(name, sorted-languages) summary cache keyed on the picks tuple so the same selection on a refresh hits LRU. - ModelMetaSchema gains a `languages` field (display labels via language_label). Drives the /models language facet on the frontend. - BenchmarkSchema gains `simplified_task_types`. Drives the "Task group" facet on /benchmarks (5 buckets: retrieval / classification / pair-classification / clustering / semantic-similarity). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * API: generate + serve per-entity OG hero cards Add scripts/generate_og_images.py (Playwright) and a matplotlib-only alternate at scripts/generate_og_images_mpl.py. Both walk the mteb registry, render one PNG per benchmark / task / model using a shared template under scripts/og-template/, and write hash sidecars so re-runs skip unchanged entities. FastAPI mounts the rendered dir at /og with a long Cache-Control. The Dockerfile picks up the work in a builder stage that ships Chromium + Playwright; the runtime stage only inherits the pre-rendered PNG files, keeping the deployed image lean. Torch installs from the CPU index to avoid the CUDA wheels. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * API: OG slug matches JS encodeURIComponent + nested model paths; open CORS to * Two interlocking 404s on the OG mount, plus a CORS gap that broke client-side share-card validators: - _slug used quote(safe=""), which over-encodes ()!*' relative to JavaScript's encodeURIComponent. Benchmark names like MTEB(eng, v2) wrote files at MTEB%28eng...%29.png while ShareMeta requested MTEB(eng...).png. Add safe="!*'()" to match. - Model names like microsoft/harrier-... were stored as flat org%2Fname.png. Starlette decodes %2F to / before file lookup, so every model card 404'd. Split on / and write to nested directories (org subdir per model); ensure the parent dir exists at plan time. - cors_origins defaulted to a four-host allowlist that blocked opengraph.xyz-style client-side previewers (no security gain — everything we serve is public). Default to ["*"]; env override replaces rather than merges. Drop the redundant MTEB_API_CORS_ORIGINS line from the Dockerfile so the image inherits the open default. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * tmp * OG cards: use simplified task group instead of raw type Task cards previously showed the raw type (BitextMining, STS, PairClassification, …) on the secondary badge + the "Type" stat. Switch to the simplified group (bitext-mining, semantic-similarity, pair-classification, …) so the OG hero matches the task-group chip on TaskCard and the type-badge on /tasks/[name] — same wording across every surface. Fall back to the raw type when simplified isn't populated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * API: overlay num_models on the benchmark detail endpoint `/v1/benchmarks/{name}` was returning the raw schema with the default `num_models=0`, while `/v1/benchmarks` and the menu both apply `_with_num_models`. The discrepancy bit the home-page primary tiles once any of them were removed from HOME_BENCHMARK_ENTRIES — the frontend's fallback fetches the detail endpoint directly and was getting "0 models" even for benchmarks with hundreds of evaluated models (e.g. RTEB(beta) with numModels=267 on /benchmarks but 0 on /benchmarks/RTEB(beta)). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * api: pick leaders by rank instead of by score `_pick_leader` was excluding rows whose `mean_task`/`mean_task_type` were both null, which on a partial-coverage benchmark (e.g. RTEB locally without SWEbenchCodeRetrieval) is every row — strict-skipna nulls the headline mean as soon as one task is missing. Result: the home-page bucket leaders went empty and tiles read "No size-bucketed data yet" even though /scores rendered 267 ranked rows for the same benchmark. Switch the picker to `row.rank` (lowest = best). The summary builder fills Borda rank in from per-task scores regardless of whether the strict aggregate survives, so every row in the bucket has a deter- ministic rank and the leader is always defined. The response still echoes back `mean_task` (or `mean_task_type`) so the tile's "Leader: … · {score}" line keeps printing a number when there is one. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * upd menu * API: pre-serialize + pre-gzip bytes cache, drop ETag middleware Caches Serialized(body, body_gzip, etag) per endpoint so warm requests skip pydantic-core JSON dumping, gzip compression, and SHA-1 hashing. _cached_json handles encoding negotiation, 304 revalidation, and a 4-hour Cache-Control inline; ETagMiddleware is gone (it buffered every response body just to compute an etag we now precompute once). In-process bench warm-request medians vs HEAD: per_language(MTEB Multilingual, v2) 301ms -> 6.3ms (48x, 6.4MB) tasks_list 30ms -> 1.3ms (23x, 2.1MB) scores(MTEB Multilingual, v2) 32ms -> 1.5ms (21x, 1.9MB) models_list 11ms -> 0.65ms (17x, 922KB) scores(BEIR) 9.1ms -> 0.55ms (17x, 588KB) Other changes folded in: - Per-name asyncio.Lock coalesces cold builds; OrderedDict LRU bounds the language-scoped summary cache (was unbounded). - Parallelised warmup_blocking (parquet load + training-dataset prewarm + schema prewarm run on a 3-thread pool); prewarm_schema_caches now threads task/benchmark/model schema construction across 16 workers. - Filter caches no longer key by name_query — substring searches reuse the warm full list and post-filter. - model_construct on hot per-row schema builds skips pydantic validation on values produced internally. - Aggressive trim of module/function docstrings; DRY'd _with_num_models helpers; flattened the language_view ternary; dropped a few dead branches (code != "Unknown", discriminator=None wrapper, etc.). - mteb/benchmarks/benchmark.py: fix _get_benchmarks_on_leaderboard tuple wrap that broke startup against HOME_BENCHMARK_ENTRIES. - scripts/bench_api_inproc.py: in-process httpx + ASGITransport bench with a --diff mode for before/after comparison. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * API: drop redundant schema caches, off-thread serialize, single-flight bytes Follow-up to the bytes-cache commit. The three schema caches (_per_language_schemas, _task_score_schemas, _model_score_schemas) and the language-scoped _summary_lang_schemas were write-only once the bytes cache landed — only get_*_bytes ever read them. Drop them; serialise straight into the bytes cache. Correctness fixes: - gzip.compress now runs inside asyncio.to_thread; was blocking the event loop ~50-100ms per multi-MB cold build. - Each get_*_bytes acquires a per-key asyncio.Lock so two concurrent cold requests share the one build + serialise (the schema-build lock was protecting that half; the serialise half was racy). - body_gzip is now bytes | None instead of "same object as body" — proper signal that there's no gzip variant for tiny payloads. Other: - Unify the four get_*_bytes paths via _cached_bytes(store, locks, key, builder); cuts ~40 lines. - _prewarm_list_schemas runs its four builders on a thread pool. - Drop _row_index_cache in aggregators (id()-keyed; latent recycling hazard if eviction is ever added). The replacement linear scan over ~500 rows x 50 benchmarks runs in microseconds, no bench regression. - benchmark_scores reuses _as_tuple instead of inlining comma-split. - Stale "ETagMiddleware" reference in routes.py module docstring updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * API: 304 includes Vary, leaders metric before build, typed bytes locks Three small fixes flagged in the post-commit review: - 304 responses now include ``Vary: accept-encoding`` so intermediate caches key correctly across encoding variants (RFC 7232). - benchmark_leaders increments its Prometheus counter before the build, consistent with every other route handler; failed builds no longer silently drop the metric. - Replace the stringly-keyed ``_bytes_locks`` dict-of-dicts with five named module-level lock dicts. A typo in a key now becomes a NameError instead of silently allocating a fresh empty dict (which would have broken single-flight without any signal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix vidore aggregation * fix benchmarks visibility on model cards * upds * tmp * upadate lock * upd test * fix typecheck * start refactor * start refactor * simplify schemas * simplify schemas * simplify imports * api: speed up table generation + disk-cache the per-benchmark split Builders: - SummaryTable wrapper carries rank/primary/public/private column pointers; aggregator reads via pointers instead of column-name guessing. - Long-form Borda (rank().over partition) replaces the N-column horizontal expression tree. - Inner-join metadata attach replaces map_batches + per-row Python resolver. - mean_horizontal(ignore_nulls=False), lazy fusion of select+with_columns, sort after attach (skip filtered rows), shared per-task pivot between summary and per-task builders. - Canonical column names: Model holds org/name, type columns stay CamelCase; leaderboard styler humanises + wraps in markdown at display. Split: - Parallel join in _split_by_benchmark_tasks (~12s -> ~9s). Disk cache: - frames.py persists per-benchmark + unified frames to ~/.cache/mteb/leaderboard/ after first build; invalidated by HF dataset commit SHA. Warm restart 40s -> 5s. - Dockerfile bakes the cache into the base layer so runtime first request reads from disk instead of downloading + splitting. Observability: - INFO logs around each warmup phase with durations. - create_app wires logging.basicConfig so warmup logs surface alongside uvicorn output. New settings: LOG_LEVEL, DISK_CACHE, PRELOAD_CONCURRENCY. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * api: fix correctness bugs + consolidate table builders Correctness: - frames.py: _rebuild_from_full_repository() was called with no args, triggering TypeError on the no-disk-cache fallback path. - aggregators.py: _pick_leader had a duplicated `r.total_params_b is None`. - benchmark.py: RtebBenchmark renamed Retrieval -> Mean (Task) then immediately dropped Mean (Task) — collapsed to one drop+rename chain. - schemas.py: BenchmarkSchema.language_view lost dedupe; restored. - routes.py: robots.txt docstring sync'd to the new Allow body. - aggregators.py: build_benchmark_per_language now offloads polars work to asyncio.to_thread so cold misses don't pin the event loop. - aggregators.py: row[col] -> row.get(col) for optional mean cols. Cache + concurrency: - routes.py: _leader_bytes bounded with LRU eviction (was unbounded). - frames.py: atomic disk-cache write — .tmp + Path.replace per shard, manifest atomic-swapped, stale sweep happens AFTER swap. - warmup.py + app.py: preload runs as asyncio.create_task on the serving loop instead of a daemon thread with its own asyncio.run(). - icons.py: cache_clear() no longer wipes _fetch_locks. Table builders: - new _STANDARD_META_COLS + _order_summary_cols replace 5 copies of the final column-ordering boilerplate. - _build_joint_with_type_means_and_borda shared by mean_task + mean_task_type builders. - _PublicPrivateBuild dataclass + _build_public_private_joint shared by mean_public_private and Vidore; Vidore's wrapper inlined. - per-task table + mean_subset migrated to _borda_rank_from_long; only Benchmark.to_dataframe still uses the wide-form _get_borda_rank. - leaderboard/table.py: deleted dead pandas Borda helpers. API helpers: - aggregators.py: _per_task_rows_and_cols, _filter_long_df_by_languages, _read_row_metrics slice the 140-line build_benchmark_summary. - aggregators.py: _extract_trained_on_map is one polars groupby instead of a per-row setdefault loop; build_task_scores derives all_subsets while filling seen. - cache.py: _cache_or_build generic single-flight helper; _cached_bytes and get_summary are thin wrappers. summary-schema cache now emits hit/miss metrics. - routes.py: _serialize_schemas + _safe_load_frames fold the per-list and per-map boilerplate; _require_task / _require_model helpers + dropped dead try/except KeyError in model_scores. - routes.py: deleted deprecated /benchmarks/{name}/summary alias. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * make zeroshot summary optional * fix column * api: consolidate cache layer + tighten aggregator helpers Bundle each single-flight cache (store + locks + LRU cap + metric label) into a CacheLayer dataclass so the 11 module globals collapse to 6 instances and helper signatures lose 3 args. Share the ResultCache root for the leaderboard disk cache so MTEB_CACHE overrides apply uniformly, and promote the JSON Cache-Control max-age to a settings knob (HTTP_MAX_AGE) so dev hard refreshes can opt out of browser caching. Aggregators get smaller too: inlined _read_row_metrics, dropped redundant float() coercions, and renamed lenient_means to language_filtered to match its definition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * upd lock * fix * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix old lb * ci: build API image from local source + consolidate workflows Replace the Dockerfile's git-clone-then-install path with a COPY from the build context (filtered by a new .dockerignore) so CI tests the checkout under review instead of whatever's already on the upstream branch. Add api_docker.yml — builds the image, polls /health for up to 2 min, and publishes ghcr.io/<repo>/api:{sha,latest} on main. Drop the two old docker-test workflows (leaderboard_docker.yml, hf_space_docker.yml) and strip leaderboard_refresh.yaml down to the HF Space rebuild curl (publishing now lives in api_docker.yml). Healthcheck points at the real backend /health endpoint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * add new version to multilingual * simplify metrics * use aggregations directly * add descriptive stats * add proper typehints for router * expose split axis on /tasks/{name}/scores The unified results frame previously collapsed (model, task, subset, split) → (model, task, subset) via `max(score)` before serialising, so the leaderboard could never show per-split scores even though tasks like MassiveIntentClassification evaluate on multiple splits. Plumbs `split` through: - `_UNIFIED_SCHEMA` carries `split`; `_dedupe_unified` groups by `(model, task, split, subset)`, deduping only across rerun rows. - New `_CACHE_SCHEMA_VERSION = 2`, written into and validated against `manifest.json`. Stale disk caches from before the bump are rebuilt on next boot. - `TaskScoreRowSchema.subset_scores` becomes `dict[str, dict[str, float]]` (outer subset, inner split) so clients can pivot either axis off one payload. - `TaskScoresSchema` adds a top-level `splits: list[str]` listing every split observed across models for the task. - `build_task_scores` walks the deduped unified frame directly and populates the nested map. The per-row `score` rollup keeps the prior semantics — per-subset value is the max across splits the model ran, then mean across subsets when the model covers every subset — so existing leaderboard ranks don't shift just from surfacing the extra axis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * update stats * rename * update * skip old dataset version * fix docs error * show only fully evaluated models #4826 * remove cells with missing scores * rename `superseded_by` * fix vidorev1&v2 * build docker before refresh * simplify docker * fix triggering rules * simplify comments * simplify a bit * add missing file * Update mteb/api/README.md Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> --------- Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: AdnanElAssadi <aassadi22@ku.edu.tr> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

NouamaneTazi added 3 commits June 24, 2022 15:26

add Summarization abstract task

f2b0e53

add SummEval task

3ba3e65

add more scores to summarization evaluator

12ae05f

NouamaneTazi force-pushed the summarization branch from 7aca7b8 to 12ae05f Compare June 24, 2022 13:27

NouamaneTazi merged commit bdb2691 into main Jun 24, 2022

NouamaneTazi deleted the summarization branch June 24, 2022 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Summarization task#11

Add Summarization task#11
NouamaneTazi merged 3 commits into
mainfrom
summarization

NouamaneTazi commented Jun 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NouamaneTazi commented Jun 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant