Skip to content

Releases: embeddings-benchmark/mteb

2.16.1

23 Jun 20:11

Choose a tag to compare

2.16.1 (2026-06-23)

Fix

  • fix: use text task prefixes for gemini-embedding-2; improve doc prefix (#4851) (b243378)

Unknown

  • Add vultr/VultronRetrieverCore-Qwen3.5-4.5B (#4850)

Add vultr/VultronRetrieverCore-Qwen3.5-4.5B ModelMeta (04d5dd6)

  • Change leaderboard to refresh (#4848) (64576f4)

  • Add vultr/VultronRetrieverFlash-Qwen3.5-0.8B (#4845)

Add vultr/VultronRetrieverFlash-Qwen3.5-0.8B ModelMeta

Signed-off-by: Athrael Soju <athrael.soju@gmail.com> (fe5f0d0)

  • fix types in api schemas (#4843)

  • fix types in api schemas

  • fix typing

  • retrigger ci (d7b54b0)

2.16.0

21 Jun 12:12

Choose a tag to compare

2.16.0 (2026-06-21)

Feature

  • feat: Add mteb/api FastAPI service for new leaderboard (#4760)

  • change to polars leaderboard

  • WIP changes

  • speedup

  • upds

  • fixes

  • rewrite mostly to polars

  • fixes

  • more speedup

  • lint a bit

  • fix init

  • fix tests

  • fix tests

  • fix typing

  • refactor to private

  • Add mteb/api FastAPI service and HF Space Dockerfile

New mteb/api subpackage exposes the leaderboard data as a FastAPI
service backed by ResultCache + the existing polars summary builders.
Routes mirror the SvelteKit frontend's data needs: benchmark menu,
benchmark detail, and prerendered summary tables. CORS origins,
preload, and cache locations come from settings.

Dockerfile clones mteb@api, installs .[api], and serves uvicorn on
:7860 as UID 1000 — drop-in for a Hugging Face Space.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • fix typing

  • Annotate cors_origins with NoDecode so env strings parse

pydantic-settings' EnvSettingsSource tries to json.loads any field it
considers complex before invoking field_validators, which made the
documented comma-separated MTEB_API_CORS_ORIGINS format crash with
JSONDecodeError at app startup inside the HF Space. NoDecode skips
that pre-parse step and lets the existing field_validator split on
commas as advertised.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • Bust Docker layer cache when the api branch advances

RUN git clone always produces the same layer hash because the command
string never changes, so HF Spaces was rebuilding the image on top of a
stale checkout — the cors_origins NoDecode fix never made it into the
running container. Pull the latest commit SHA from GitHub via ADD just
before the clone; ADD invalidates the layer whenever the response body
changes, which forces a fresh clone per push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • Move leaderboard_parquet_path onto ResultCache

The api module needed only this one-line helper from
mteb.leaderboard.app, but importing it pulled in gradio, pandas, and
cachetools — none of which belong in the [api] extra. Promoting it to
a property on ResultCache lets every consumer (api, leaderboard,
bench script) reach the path without dragging the Gradio stack into
the API container.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • Pre-fetch mteb/results HF dataset at build time

Drops the cold-start cost of cloning the GitHub results repo on first
request by pulling the same data from huggingface.co/datasets/mteb/results
during image build. Goes into the default huggingface_hub cache under
HF_HOME so callers reach it via the standard hub APIs. The download is
guarded with || true so it stays non-fatal while the dataset is still
being populated upstream — the API just falls back to the GitHub clone
on first request.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • Switch leaderboard cache loader to per-config HF dataset layout

The results-repo sync now pushes one HF dataset config per benchmark
(plus a default config holding every result, deduped). Rewires the
API consumer to match:

  • _load_from_hub enumerates configs and load_dataset(name=cfg, split=&#39;train&#39;) each. A failure on one config no longer poisons the
    whole load.
  • _load_per_benchmark_frames collapses to two paths — hub or cold
    rebuild — and returns a (per_benchmark, all_results) tuple
    instead of the _LoadedFrames dataclass. The two named wrappers
    (get_all_benchmark_frames / get_all_results_df) go away;
    callers destructure inline.
  • Hub-supplied default config short-circuits the per-benchmark
    concat for the unified view.

Other follow-ups:

  • BenchmarkResults gains load_leaderboard_frame and
    split_leaderboard_frame so loading the raw combined frame can be
    decoupled from splitting it. The new
    _split_by_benchmark_tasks filters via an inner join on
    (task_name, split, subset) tuples — off-spec subsets/splits no
    longer leak through to _create_summary_table's
    group_by(model_name, task_name).mean().
  • MTEB_API_CACHE_REPO moves to Settings alongside
    cors_origins / preload; consumers go through
    settings.cache_repo().
  • /robots.txt added to silence Space probes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • fix import

  • fix repo name

  • fix results

  • benchmarks: add MVEB benchmark suite + leaderboard Video menu

Adds the MVEB (Massive Video Embedding Benchmark) benchmark objects to
main so the leaderboard and get_benchmark() can resolve them. The
underlying tasks are already on main; this adds only the curated
benchmark groupings and their registration.

  • benchmarks.py: MVEB (23 tasks), MVEB(text, video) (19), MVEB(video)
    (9), MVEB(beta, extended) (184, alias MVEB(extended)).
  • benchmarks/init.py: import + all registration.
  • _leaderboard_menu.py: new "Video" group under General Purpose.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

  • add more parameters to schemas

  • add train on to cache results

  • Lifespan warmup, /scores route, per-task num_models

Replace the deprecated @on_event(&#34;startup&#34;) hook with a FastAPI
lifespan context manager. warmup_blocking is dispatched via
asyncio.to_thread so its sync polars work runs without blocking the
event loop, and uvicorn holds the listener until it returns — the
first request lands on a fully warm cache. Heavy summary preloading
stays gated behind MTEB_API_PRELOAD=1 in a daemon thread.

Add num_models to TaskMetaSchema, derived from the unified
results frame by routes._task_num_models_map() and overlaid on
every /tasks and /tasks/{name} response. Drives the new "Models
evaluated" stat + sort on the frontend /tasks cards.

Rename /benchmarks/{name}/summary to /scores as the canonical
path (keep /summary as a hidden alias) and refresh the README.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • benchmarks: mark MVEB suite beta; drop extended from leaderboard menu

Addresses review:

  • Rename to beta variants: MVEB(beta), MVEB(video, beta),
    MVEB(text, video, beta) (consistent with MAEB/RTEB). Old names kept
    as aliases so get_benchmark("MVEB") etc. still resolve.
  • Leaderboard "Video" menu no longer displays MVEB(beta, extended).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

  • api: expose Benchmark.language_view through the schema

Surfaces the per-language column list the frontend's "Performance
per language" tab needs. Mirrors the existing pattern for
benchmark.languages — codes are resolved via language_label() so
the frontend renders "German" rather than "deu-Latn". Empty default
collapses to None so the frontend treats "no language view" as a
single missing-value check (and hides the tab entirely on benchmarks
that didn't opt in).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • api: pick "Rank" over "Rank (Borda)" when both columns exist

The mean-task-type summary builder (MIEB, ViDoRe-style) writes BOTH
"Rank" (the primary, assigned in sort-by-Mean(Task) order) and
"Rank (Borda)" (kept for back-compat). The adapter was falling back
to "Rank (Borda)" first, so MIEB rows ended up labelled with Borda
ranks even though the rows themselves were mean-sorted — the actual
top model (jina-embeddings-v5-omni-small at mean 0.65) showed up
as rank #11 on the home leaderboard.

Switching the lookup order to ("Rank", "Rank (Mean Task)", "Rank
(Borda)") so the explicit primary rank wins when present. Standard
builders only emit "Rank (Borda)" so they still pick that up via
fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • benchmarks: customisable sort column + relabel MIEB's mean to TaskType

Add Benchmark.summary_sort_column: ClassVar[str | None] = None — a
new opt-in hook on every benchmark for choosing which polars column
the summary frame is sorted by (and which populates the displayed
Rank). None keeps the historical builder default; subclasses set
it explicitly when they want a different sort.

_create_summary_table_mean_task_type gains a sort_by arg that
overrides the default (sort by mean_column_name).

MIEB updated to declare its actual semantics:

  • aggregations = (MEAN_TASK_TYPE, TASK_TYPES) — the column previously
    labelled "Mean (Task)" is computed as mean-of-per-type-means, which
    is mathematically a Mean (TaskType). The misleading label was
    causing the frontend to recompute and disagree with the canonical
    value (jina on MIEB(eng) showed 58 instead of 65).
  • mean_column_name = "Mean (TaskType)" so the leaderboard column
    matches the actual aggregation.
  • summary_sort_column = "Mean (TaskType)" — sort by the renamed
    column (was sorting by it anyway, just plumbed through the new
    hook).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

  • fix

  • api: /benchmarks/{name}/leaders for slim home-page tiles

Adds a new endpoint that returns one row per size bucket — the
highest-scoring model in each [min, max) megaparameter range.
Lets the leaderboard home page render its featured mini-tables
without pulling the full /scores payload for every primary tile
(several MB on multilingual benchmarks → a few hundred bytes).

GET /benchmarks/{name}/leaders?buckets=[[0,500],[500,1000],[1000,5000],[5000,null]]

buckets is a JSON-encoded array of [min, max] tuples in
millions of parameters (null or omitted second element = open-
ended top bucket). Backend converts each bucket's millions to
billions internally to compare against total_params_b.

Score selection prefers mean_task but falls back to
mean_task_type for benchmark builders that only populate the
latter (MIEB / ViDoRe-style) — wit...

Read more

2.15.6

20 Jun 17:03

Choose a tag to compare

2.15.6 (2026-06-20)

Fix

  • fix: results mteb version parse (#4840)

fix version parse (23c96a3)

2.15.5

19 Jun 21:00

Choose a tag to compare

2.15.5 (2026-06-19)

Ci

  • ci: revert leaderboard refresh trusted publishing (#4831)

revert leaderboard refresh (14cf6d6)

  • ci: Setup hf trusted publishing (#4804)

  • setup hf trusted publishing

  • remove comment (a1f0a62)

Fix

  • fix: Get modalities of models from config (#4789)

infer modalities (2c17992)

Unknown

  • model: update VultronRetrieverPrime-Qwen3.5-8B repo path to the vultr org (#4836)

model: Update VultronRetrieverPrime-Qwen3.5-8B metadata to reflect new repository path (e091c0d)

  • Add VultronRetrieverPrime-Qwen3.5-8B ModelMeta (#4833)

Late-interaction (ColBERT MaxSim) visual document retriever: ColQwen3.5, dim 320,
8.4B params, Apache-2.0, 6 languages. Reuses the existing ColQwen3_5Wrapper.
Official ViDoRe scores V1 0.9208 / V2 0.6818 / V3 0.6472 (results PR to follow).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (99a830c)

  • Update Querit model implementation: Supplementing the citation information of Querit (#4824)

Update Querit model implementation: Supplementing the citation information of the Querit

Co-authored-by: zhongyunfei <zhongyunfei@baidu.com> (cfce4f2)

  • model: Add VIRTUE multimodal embedding models (Sony VIRTUE-2B/7B-SCaR) (#4822)

  • Add VIRTUE multimodal embedding models (Sony VIRTUE-2B/7B-SCaR)

  • address review feedback (41f6c7e)

  • Fix Querit model implementation: Supplementing the base model information of the Querit/Querit-4B (#4819)

  • Fix Querit model implementation: Supplementing the base model information of the Querit/Querit-4B

  • Fix Querit model implementation: Supplementing the base model information of the Querit/Querit-4B


Co-authored-by: zhongyunfei <zhongyunfei@baidu.com> (240da5b)

  • mveb: fix and unify domain tags across all 50 source datasets (#4738)

  • mveb: fix and unify domain tags across all 50 source datasets

The MVEB+ video task set had inconsistent and partially-wrong domains
tags. Issues fixed:

  • MSR-VTT had no domain tags at all (empty list). Now tagged ["Web"].
  • AVMeme-Exam was tagged with "Music" (it's internet memes, not music
    content). Now ["Entertainment", "Web"].
  • AudioCaps_AV was tagged "Encyclopaedic" (it's audio captioning). Now
    ["AudioScene", "Web"].
  • VGGSound was tagged just ["Web"] despite being audio-visual events.
    Now ["AudioScene", "Web"]. Same fix for VGGSound_AV_RETRIEVAL.
  • AV-SpeakerBench was tagged ("Web") on the base task and ("Spoken")
    on the PC variant --- same source data, inconsistent tags. Unified
    to ("Spoken").
  • WorldSense_1min was over-tagged with Entertainment+Music in some
    files and just ["Web"] in others. Unified to ["AudioScene", "Scene",
    "Web"].
  • Several datasets tagged "Spoken" without speech-driven content
    (DiDeMo, MSVD, ActivityNetCaptions, VATEX, panda-70m, TUNA-Bench).
    Removed the Spoken tag from those.
  • AVE-Dataset clustering tasks tagged with ["Music", "Scene", "Spoken"]
    (clearly wrong). Now aligned with the rest of AVE-Dataset:
    ["AudioScene", "Web"].
  • MELD was tagged just ["Entertainment"] across base and clustering
    variants; MELD is the Friends sitcom, so dialogue is central.
    Added "Spoken" -> ["Entertainment", "Spoken"].
  • UCF101 missing "Sport" tag. UCF101 has substantial Sport content.
    Now ["Scene", "Sport", "Web"].
  • Human-Animal-Cartoon missing "Entertainment" tag despite the cartoon
    domain. Now ["Entertainment", "Scene", "Web"].
  • PerceptionTest missing "Scene" tag despite being a scene-perception
    benchmark. Now ["Scene", "Web"].
  • Video-MME missing "Spoken" tag despite the narration-heavy content.
    Now ["Spoken", "Web"].
  • HMDB51 missing "Web" tag (sourced largely from web video). Now
    ["Scene", "Web"].
  • VideoCon, Vinoground (zachz/*) missing "Web" tag. Added.
  • RAVDESS tag list kept at ["Spoken"] (speech-emotion primary).
  • AVQA tag list extended with "AudioScene" (it's an audio-visual QA
    benchmark).

All 50 unique source datasets across 184 video tasks now have
consistent, non-empty domain tags. Verified by re-importing every
task: 184 tasks load cleanly.

Tags use only the existing TaskDomain Literal vocabulary in
task_metadata.py; no new domains added.

  • mveb: enrich domain taxonomy + fix mislabeled action/egocentric/meme datasets

Adds 5 video content domains to TaskDomain (Activity, Instructional,
Egocentric, Nature, Animation) and re-tags datasets that were mislabeled
or under-characterized, so the domain set actually reflects benchmark
content:

  • Action recognition (Kinetics-400/600/700, HMDB51, UCF101, SSv2,
    ActivityNet, VATEX, NExT-QA, Vinoground, VideoCon) -> Activity
    (was the catch-all "Scene", which means visual place/setting).
  • Breakfast, YouCook2 -> Instructional (cooking / how-to).
  • Diving48 -> Activity + Sport.
  • EgoSchema -> Egocentric (was bare "Web").
  • Human-Animal-Cartoon -> Activity + Animation + Nature.
  • AVMeme-Exam -> + Social (internet memes).
  • PerceptionTest -> drop misapplied "Scene".

Scene is now reserved for genuine visual-scene content (WorldSense).
All 184 video tasks load; every domain validates against TaskDomain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>


Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (343df1a)

  • Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced (#4808)

  • Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced.

  • Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced.

  • Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced.

  • Apply suggestions from code review

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

  • Apply suggestions from code review

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>


Co-authored-by: zhongyunfei <zhongyunfei@baidu.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (4976113)

  • Rename number_texts_intersect_with_train to samples_in_train (#4809) (e76f291)

  • polish MVEB leaderboard names + icons (#4803)

benchmarks: polish MVEB leaderboard names + icons

  • display_name: align scope names with MAEB/MIEB convention —
    "Video" -> "MVEB Video-Only" (cf. "MAEB Audio-Only"),
    "Text+Video" -> "MVEB Video-Text" (cf. MIEB "Image-Text").
    MVEB(beta) stays "MVEB" as the headline benchmark.
  • icon: switch the three displayed MVEB benchmarks from the monochrome,
    unpinned master/svg/libre-gui-activity icon to the colored,
    commit-pinned svg-color/libre-gui-video icon, matching the convention
    used by the other benchmarks.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (5039c18)

2.15.4

12 Jun 11:07

Choose a tag to compare

2.15.4 (2026-06-12)

Ci

  • ci: Update healthcheck for new leaderboard (#4802)

update healthcheck (225b98e)

Fix

  • fix: update revision for NightOwl-CodeEmbedding (#4799)

fix: update revision for nightowl code embedding model (8e00d7e)

Unknown

  • model: add Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 (#4796)

  • Add ColTurk-VDR-Qwen3VL-4B-v1.0 (late-interaction visual document retriever)

  • Apply review suggestions: n_embedding_parameters + adapted_from (7ea3685)

  • fix filter by model size (#4794) (dc31aec)

2.15.3

10 Jun 11:31

Choose a tag to compare

2.15.3 (2026-06-10)

Fix

  • fix: Support scikit-learn 1.9 in ZeroShotClassification (#4790)

scikit-learn 1.9 raises "ValueError: Mix of label input types" when
classification metrics receive string y_true with numeric y_pred.
Zeroshot predictions are always integer indices into the candidate
labels, so string dataset labels are now mapped to their candidate
index before scoring. Unmappable string labels raise a clear error
instead of silently scoring 0.0, which is what scikit-learn < 1.9 did.

Removes the <1.9.0 pin introduced as a stopgap in #4783.

Fixes #4784

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> (5e8f0e1)

2.15.2

10 Jun 10:51

Choose a tag to compare

2.15.2 (2026-06-10)

Fix

  • fix: normalize benchmark definitions (#4792)

  • fix: remove *extended and MAEB+

These are mostly for reproducibility so we have moved them to their respectives script repos.

embeddings-benchmark/maeb-paper#3
embeddings-benchmark/mveb-paper#1

  • remove script to maeb

  • fix imports

  • fix: normalize benchamrk definitions

Normalize to:

  • defintion of what is measures
  • rationale/notable characteristics
  • version updates (7071da1)

Unknown

  • model: NightOwl-CodeEmbedding (#4791)

  • model: add NightOwl CodeEmbedding metadata

  • fix: remove programming language codes from model metadata

  • fix: update memory usage for NightOwl CodeEmbedding model

  • fix: update NightOwl CodeEmbedding model metadata

  • fix: update revision for NightOwl CodeEmbedding model (b691769)

2.15.1

09 Jun 16:48

Choose a tag to compare

2.15.1 (2026-06-09)

Fix

  • fix: remove *extended and MAEB+ (#4786)

  • fix: remove *extended and MAEB+

These are mostly for reproducibility so we have moved them to their respectives script repos.

embeddings-benchmark/maeb-paper#3
embeddings-benchmark/mveb-paper#1

  • remove script to maeb

  • fix imports (f795558)

Unknown

  • Merge multimodal sentence transformers (#4785)

  • merge multimodal

  • add warning & replace usages (d5430cc)

  • model: Add video support to qwen3-vl embedding (#4699)

  • add video to qwen3

  • upd implementation

  • upd revision

  • fix MultimodalInstructSentenceTransformerModel

  • disable double sampling

  • move imports inside

  • simpliffy wrapper

  • fix typing (beee210)

2.15.0

09 Jun 16:06

Choose a tag to compare

2.15.0 (2026-06-09)

Ci

  • ci: Update actions (#4780)

  • update github

  • remove remove

  • upd

  • remove free space

  • rename step

  • fix conda wargning

  • add concurrency to lb

  • remove concurrences (52a6f73)

Feature

  • feat: Add evaluation runtime for indexing and retrieval (#4639)

  • Add evaluation runtime for indexing and retrieval

  • change timer from maintaining state to passed as an attribute

  • add timer argument to load_data

  • add timer argument to evaluate

  • removed timer from kwargs

  • update plots for split/subsets

  • fix tests

  • fix typing errors

  • typing errors

  • change typing

  • correct typing

  • apply changes from review

  • update to handle overwritten load_data

  • changes from review

  • added evaluation phases merging logic and fix typecheck

  • added * seprator at all places in load_data

  • changes from review

  • added split/subset at all places

  • change split/subset in other functions as well

  • small typecheck update

  • changes from review

  • reordering

  • implement for clustering task

  • implement for classification task

  • remove logger statement from clustering

  • implement for pair classification task

  • remove logger statement for pair classification task

  • implement for bitext mining task

  • implement for STS task

  • implement for summarization task

  • implement for sklearn evaluator

  • fix lintter

  • Delete .specstory/history/2026-04-23_09-55Z-testing-mechanism-for-new-datasets.md

  • minor changes from review

  • fix typecheck

  • update stacklevel=2

  • add TimingStack as default argument

  • fix evaluators tests

  • update deprecate_evaluator

  • minor fix

  • modified implementation to handle override task

  • added utlity function in TaskResult to plot timings

  • Added docs

  • changes from review

  • added commnts

  • simplify plot calling and update docs with example

  • changed phases naming format in plots

  • add tests and minor changes

  • update to handle indexing and searching phase

  • rollback changes in deprecated_evaluator

  • move import to top level and minor fix in tests

  • fix import

  • update tests

  • change Scoring to aggregate level in Classification task

  • remove unwanted file

  • fix lintter and typecheck errors after merge

  • revert changes in other classification task

  • changes from copilot review and add new test

  • changes from review

  • update condition

  • make lint

  • updated docs


Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (6a3e816)

2.14.9

08 Jun 22:08

Choose a tag to compare

2.14.9 (2026-06-08)

Fix

  • fix: Update lock and remove python limit fo pylate and colbert_engine (#4783)

  • update lock

  • fix typing

  • fix typing

  • fix tests

  • remove torchcodec from missing imports

  • fix zero shot

  • pin skelearn (d93a2c0)