2.16.1 (2026-06-23)

Fix

fix: use text task prefixes for gemini-embedding-2; improve doc prefix (#4851) (b243378)

Unknown

Add vultr/VultronRetrieverCore-Qwen3.5-4.5B (#4850)

Add vultr/VultronRetrieverCore-Qwen3.5-4.5B ModelMeta (04d5dd6)

Change leaderboard to refresh (#4848) (64576f4)
Add vultr/VultronRetrieverFlash-Qwen3.5-0.8B (#4845)

Add vultr/VultronRetrieverFlash-Qwen3.5-0.8B ModelMeta

Signed-off-by: Athrael Soju <athrael.soju@gmail.com> (fe5f0d0)

fix types in api schemas (#4843)
fix types in api schemas
fix typing
retrigger ci (d7b54b0)

2.16.0 (2026-06-21)

Feature

feat: Add mteb/api FastAPI service for new leaderboard (#4760)
change to polars leaderboard
WIP changes
speedup
upds
fixes
rewrite mostly to polars
fixes
more speedup
lint a bit
fix init
fix tests
fix tests
fix typing
refactor to private
Add mteb/api FastAPI service and HF Space Dockerfile

New mteb/api subpackage exposes the leaderboard data as a FastAPI
service backed by ResultCache + the existing polars summary builders.
Routes mirror the SvelteKit frontend's data needs: benchmark menu,
benchmark detail, and prerendered summary tables. CORS origins,
preload, and cache locations come from settings.

Dockerfile clones mteb@api, installs .[api], and serves uvicorn on
:7860 as UID 1000 — drop-in for a Hugging Face Space.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix typing
Annotate cors_origins with NoDecode so env strings parse

pydantic-settings' EnvSettingsSource tries to json.loads any field it
considers complex before invoking field_validators, which made the
documented comma-separated MTEB_API_CORS_ORIGINS format crash with
JSONDecodeError at app startup inside the HF Space. NoDecode skips
that pre-parse step and lets the existing field_validator split on
commas as advertised.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bust Docker layer cache when the api branch advances

RUN git clone always produces the same layer hash because the command
string never changes, so HF Spaces was rebuilding the image on top of a
stale checkout — the cors_origins NoDecode fix never made it into the
running container. Pull the latest commit SHA from GitHub via ADD just
before the clone; ADD invalidates the layer whenever the response body
changes, which forces a fresh clone per push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Move leaderboard_parquet_path onto ResultCache

The api module needed only this one-line helper from
mteb.leaderboard.app, but importing it pulled in gradio, pandas, and
cachetools — none of which belong in the [api] extra. Promoting it to
a property on ResultCache lets every consumer (api, leaderboard,
bench script) reach the path without dragging the Gradio stack into
the API container.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pre-fetch mteb/results HF dataset at build time

Drops the cold-start cost of cloning the GitHub results repo on first
request by pulling the same data from huggingface.co/datasets/mteb/results
during image build. Goes into the default huggingface_hub cache under
HF_HOME so callers reach it via the standard hub APIs. The download is
guarded with || true so it stays non-fatal while the dataset is still
being populated upstream — the API just falls back to the GitHub clone
on first request.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Switch leaderboard cache loader to per-config HF dataset layout

The results-repo sync now pushes one HF dataset config per benchmark
(plus a default config holding every result, deduped). Rewires the
API consumer to match:

_load_from_hub enumerates configs and load_dataset(name=cfg, split='train') each. A failure on one config no longer poisons the
whole load.
_load_per_benchmark_frames collapses to two paths — hub or cold
rebuild — and returns a (per_benchmark, all_results) tuple
instead of the _LoadedFrames dataclass. The two named wrappers
(get_all_benchmark_frames / get_all_results_df) go away;
callers destructure inline.
Hub-supplied default config short-circuits the per-benchmark
concat for the unified view.

Other follow-ups:

BenchmarkResults gains load_leaderboard_frame and
split_leaderboard_frame so loading the raw combined frame can be
decoupled from splitting it. The new
_split_by_benchmark_tasks filters via an inner join on
(task_name, split, subset) tuples — off-spec subsets/splits no
longer leak through to _create_summary_table's
group_by(model_name, task_name).mean().
MTEB_API_CACHE_REPO moves to Settings alongside
cors_origins / preload; consumers go through
settings.cache_repo().
/robots.txt added to silence Space probes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix import
fix repo name
fix results
benchmarks: add MVEB benchmark suite + leaderboard Video menu

Adds the MVEB (Massive Video Embedding Benchmark) benchmark objects to
main so the leaderboard and get_benchmark() can resolve them. The
underlying tasks are already on main; this adds only the curated
benchmark groupings and their registration.

benchmarks.py: MVEB (23 tasks), MVEB(text, video) (19), MVEB(video)
(9), MVEB(beta, extended) (184, alias MVEB(extended)).
benchmarks/init.py: import + all registration.
_leaderboard_menu.py: new "Video" group under General Purpose.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

add more parameters to schemas
add train on to cache results
Lifespan warmup, /scores route, per-task num_models

Replace the deprecated @on_event("startup") hook with a FastAPI
lifespan context manager. warmup_blocking is dispatched via
asyncio.to_thread so its sync polars work runs without blocking the
event loop, and uvicorn holds the listener until it returns — the
first request lands on a fully warm cache. Heavy summary preloading
stays gated behind MTEB_API_PRELOAD=1 in a daemon thread.

Add num_models to TaskMetaSchema, derived from the unified
results frame by routes._task_num_models_map() and overlaid on
every /tasks and /tasks/{name} response. Drives the new "Models
evaluated" stat + sort on the frontend /tasks cards.

Rename /benchmarks/{name}/summary to /scores as the canonical
path (keep /summary as a hidden alias) and refresh the README.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

benchmarks: mark MVEB suite beta; drop extended from leaderboard menu

Addresses review:

Rename to beta variants: MVEB(beta), MVEB(video, beta),
MVEB(text, video, beta) (consistent with MAEB/RTEB). Old names kept
as aliases so get_benchmark("MVEB") etc. still resolve.
Leaderboard "Video" menu no longer displays MVEB(beta, extended).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

api: expose Benchmark.language_view through the schema

Surfaces the per-language column list the frontend's "Performance
per language" tab needs. Mirrors the existing pattern for
benchmark.languages — codes are resolved via language_label() so
the frontend renders "German" rather than "deu-Latn". Empty default
collapses to None so the frontend treats "no language view" as a
single missing-value check (and hides the tab entirely on benchmarks
that didn't opt in).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

api: pick "Rank" over "Rank (Borda)" when both columns exist

The mean-task-type summary builder (MIEB, ViDoRe-style) writes BOTH
"Rank" (the primary, assigned in sort-by-Mean(Task) order) and
"Rank (Borda)" (kept for back-compat). The adapter was falling back
to "Rank (Borda)" first, so MIEB rows ended up labelled with Borda
ranks even though the rows themselves were mean-sorted — the actual
top model (jina-embeddings-v5-omni-small at mean 0.65) showed up
as rank #11 on the home leaderboard.

Switching the lookup order to ("Rank", "Rank (Mean Task)", "Rank
(Borda)") so the explicit primary rank wins when present. Standard
builders only emit "Rank (Borda)" so they still pick that up via
fallback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

benchmarks: customisable sort column + relabel MIEB's mean to TaskType

Add Benchmark.summary_sort_column: ClassVar[str | None] = None — a
new opt-in hook on every benchmark for choosing which polars column
the summary frame is sorted by (and which populates the displayed
Rank). None keeps the historical builder default; subclasses set
it explicitly when they want a different sort.

_create_summary_table_mean_task_type gains a sort_by arg that
overrides the default (sort by mean_column_name).

MIEB updated to declare its actual semantics:

aggregations = (MEAN_TASK_TYPE, TASK_TYPES) — the column previously
labelled "Mean (Task)" is computed as mean-of-per-type-means, which
is mathematically a Mean (TaskType). The misleading label was
causing the frontend to recompute and disagree with the canonical
value (jina on MIEB(eng) showed 58 instead of 65).
mean_column_name = "Mean (TaskType)" so the leaderboard column
matches the actual aggregation.
summary_sort_column = "Mean (TaskType)" — sort by the renamed
column (was sorting by it anyway, just plumbed through the new
hook).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix
api: /benchmarks/{name}/leaders for slim home-page tiles

Adds a new endpoint that returns one row per size bucket — the
highest-scoring model in each [min, max) megaparameter range.
Lets the leaderboard home page render its featured mini-tables
without pulling the full /scores payload for every primary tile
(several MB on multilingual benchmarks → a few hundred bytes).

GET /benchmarks/{name}/leaders?buckets=[[0,500],[500,1000],[1000,5000],[5000,null]]

buckets is a JSON-encoded array of [min, max] tuples in
millions of parameters (null or omitted second element = open-
ended top bucket). Backend converts each bucket's millions to
billions internally to compare against total_params_b.

Score selection prefers mean_task but falls back to
mean_task_type for benchmark builders that only populate the
latter (MIEB / ViDoRe-style) — wit...

2.15.6 (2026-06-20)

Fix

fix: results mteb version parse (#4840)

fix version parse (23c96a3)

2.15.5 (2026-06-19)

Ci

ci: revert leaderboard refresh trusted publishing (#4831)

revert leaderboard refresh (14cf6d6)

ci: Setup hf trusted publishing (#4804)
setup hf trusted publishing
remove comment (a1f0a62)

Fix

fix: Get modalities of models from config (#4789)

infer modalities (2c17992)

Unknown

model: update VultronRetrieverPrime-Qwen3.5-8B repo path to the vultr org (#4836)

model: Update VultronRetrieverPrime-Qwen3.5-8B metadata to reflect new repository path (e091c0d)

Add VultronRetrieverPrime-Qwen3.5-8B ModelMeta (#4833)

Late-interaction (ColBERT MaxSim) visual document retriever: ColQwen3.5, dim 320,
8.4B params, Apache-2.0, 6 languages. Reuses the existing ColQwen3_5Wrapper.
Official ViDoRe scores V1 0.9208 / V2 0.6818 / V3 0.6472 (results PR to follow).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (99a830c)

Update Querit model implementation: Supplementing the citation information of Querit (#4824)

Update Querit model implementation: Supplementing the citation information of the Querit

Co-authored-by: zhongyunfei <zhongyunfei@baidu.com> (cfce4f2)

model: Add VIRTUE multimodal embedding models (Sony VIRTUE-2B/7B-SCaR) (#4822)
Add VIRTUE multimodal embedding models (Sony VIRTUE-2B/7B-SCaR)
address review feedback (41f6c7e)
Fix Querit model implementation: Supplementing the base model information of the Querit/Querit-4B (#4819)
Fix Querit model implementation: Supplementing the base model information of the Querit/Querit-4B
Fix Querit model implementation: Supplementing the base model information of the Querit/Querit-4B

Co-authored-by: zhongyunfei <zhongyunfei@baidu.com> (240da5b)

mveb: fix and unify domain tags across all 50 source datasets (#4738)
mveb: fix and unify domain tags across all 50 source datasets

The MVEB+ video task set had inconsistent and partially-wrong domains
tags. Issues fixed:

MSR-VTT had no domain tags at all (empty list). Now tagged ["Web"].
AVMeme-Exam was tagged with "Music" (it's internet memes, not music
content). Now ["Entertainment", "Web"].
AudioCaps_AV was tagged "Encyclopaedic" (it's audio captioning). Now
["AudioScene", "Web"].
VGGSound was tagged just ["Web"] despite being audio-visual events.
Now ["AudioScene", "Web"]. Same fix for VGGSound_AV_RETRIEVAL.
AV-SpeakerBench was tagged ("Web") on the base task and ("Spoken")
on the PC variant --- same source data, inconsistent tags. Unified
to ("Spoken").
WorldSense_1min was over-tagged with Entertainment+Music in some
files and just ["Web"] in others. Unified to ["AudioScene", "Scene",
"Web"].
Several datasets tagged "Spoken" without speech-driven content
(DiDeMo, MSVD, ActivityNetCaptions, VATEX, panda-70m, TUNA-Bench).
Removed the Spoken tag from those.
AVE-Dataset clustering tasks tagged with ["Music", "Scene", "Spoken"]
(clearly wrong). Now aligned with the rest of AVE-Dataset:
["AudioScene", "Web"].
MELD was tagged just ["Entertainment"] across base and clustering
variants; MELD is the Friends sitcom, so dialogue is central.
Added "Spoken" -> ["Entertainment", "Spoken"].
UCF101 missing "Sport" tag. UCF101 has substantial Sport content.
Now ["Scene", "Sport", "Web"].
Human-Animal-Cartoon missing "Entertainment" tag despite the cartoon
domain. Now ["Entertainment", "Scene", "Web"].
PerceptionTest missing "Scene" tag despite being a scene-perception
benchmark. Now ["Scene", "Web"].
Video-MME missing "Spoken" tag despite the narration-heavy content.
Now ["Spoken", "Web"].
HMDB51 missing "Web" tag (sourced largely from web video). Now
["Scene", "Web"].
VideoCon, Vinoground (zachz/*) missing "Web" tag. Added.
RAVDESS tag list kept at ["Spoken"] (speech-emotion primary).
AVQA tag list extended with "AudioScene" (it's an audio-visual QA
benchmark).

All 50 unique source datasets across 184 video tasks now have
consistent, non-empty domain tags. Verified by re-importing every
task: 184 tasks load cleanly.

Tags use only the existing TaskDomain Literal vocabulary in
task_metadata.py; no new domains added.

mveb: enrich domain taxonomy + fix mislabeled action/egocentric/meme datasets

Adds 5 video content domains to TaskDomain (Activity, Instructional,
Egocentric, Nature, Animation) and re-tags datasets that were mislabeled
or under-characterized, so the domain set actually reflects benchmark
content:

Action recognition (Kinetics-400/600/700, HMDB51, UCF101, SSv2,
ActivityNet, VATEX, NExT-QA, Vinoground, VideoCon) -> Activity
(was the catch-all "Scene", which means visual place/setting).
Breakfast, YouCook2 -> Instructional (cooking / how-to).
Diving48 -> Activity + Sport.
EgoSchema -> Egocentric (was bare "Web").
Human-Animal-Cartoon -> Activity + Animation + Nature.
AVMeme-Exam -> + Social (internet memes).
PerceptionTest -> drop misapplied "Scene".

Scene is now reserved for genuine visual-scene content (WorldSense).
All 184 video tasks load; every domain validates against TaskDomain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (343df1a)

Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced (#4808)
Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced.
Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced.
Update Querit model implementation: 4B version of Querit-Reranker newly open-sourced.
Apply suggestions from code review

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

Apply suggestions from code review

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

Co-authored-by: zhongyunfei <zhongyunfei@baidu.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (4976113)

Rename number_texts_intersect_with_train to samples_in_train (#4809) (e76f291)
polish MVEB leaderboard names + icons (#4803)

benchmarks: polish MVEB leaderboard names + icons

display_name: align scope names with MAEB/MIEB convention —
"Video" -> "MVEB Video-Only" (cf. "MAEB Audio-Only"),
"Text+Video" -> "MVEB Video-Text" (cf. MIEB "Image-Text").
MVEB(beta) stays "MVEB" as the headline benchmark.
icon: switch the three displayed MVEB benchmarks from the monochrome,
unpinned master/svg/libre-gui-activity icon to the colored,
commit-pinned svg-color/libre-gui-video icon, matching the convention
used by the other benchmarks.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> (5039c18)

2.15.4 (2026-06-12)

Ci

ci: Update healthcheck for new leaderboard (#4802)

update healthcheck (225b98e)

Fix

fix: update revision for NightOwl-CodeEmbedding (#4799)

fix: update revision for nightowl code embedding model (8e00d7e)

Unknown

model: add Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 (#4796)
Add ColTurk-VDR-Qwen3VL-4B-v1.0 (late-interaction visual document retriever)
Apply review suggestions: n_embedding_parameters + adapted_from (7ea3685)
fix filter by model size (#4794) (dc31aec)

2.15.3 (2026-06-10)

Fix

fix: Support scikit-learn 1.9 in ZeroShotClassification (#4790)

scikit-learn 1.9 raises "ValueError: Mix of label input types" when
classification metrics receive string y_true with numeric y_pred.
Zeroshot predictions are always integer indices into the candidate
labels, so string dataset labels are now mapped to their candidate
index before scoring. Unmappable string labels raise a clear error
instead of silently scoring 0.0, which is what scikit-learn < 1.9 did.

Removes the <1.9.0 pin introduced as a stopgap in #4783.

Fixes #4784

Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> (5e8f0e1)

2.15.2 (2026-06-10)

Fix

fix: normalize benchmark definitions (#4792)
fix: remove *extended and MAEB+

These are mostly for reproducibility so we have moved them to their respectives script repos.

embeddings-benchmark/maeb-paper#3
embeddings-benchmark/mveb-paper#1

remove script to maeb
fix imports
fix: normalize benchamrk definitions

Normalize to:

defintion of what is measures
rationale/notable characteristics
version updates (7071da1)

Unknown

model: NightOwl-CodeEmbedding (#4791)
model: add NightOwl CodeEmbedding metadata
fix: remove programming language codes from model metadata
fix: update memory usage for NightOwl CodeEmbedding model
fix: update NightOwl CodeEmbedding model metadata
fix: update revision for NightOwl CodeEmbedding model (b691769)

2.15.1 (2026-06-09)

Fix

fix: remove *extended and MAEB+ (#4786)
fix: remove *extended and MAEB+

These are mostly for reproducibility so we have moved them to their respectives script repos.

embeddings-benchmark/maeb-paper#3
embeddings-benchmark/mveb-paper#1

remove script to maeb
fix imports (f795558)

Unknown

Merge multimodal sentence transformers (#4785)
merge multimodal
add warning & replace usages (d5430cc)
model: Add video support to qwen3-vl embedding (#4699)
add video to qwen3
upd implementation
upd revision
fix MultimodalInstructSentenceTransformerModel
disable double sampling
move imports inside
simpliffy wrapper
fix typing (beee210)

2.15.0 (2026-06-09)

Ci

ci: Update actions (#4780)
update github
remove remove
upd
remove free space
rename step
fix conda wargning
add concurrency to lb
remove concurrences (52a6f73)

Feature

feat: Add evaluation runtime for indexing and retrieval (#4639)
Add evaluation runtime for indexing and retrieval
change timer from maintaining state to passed as an attribute
add timer argument to load_data
add timer argument to evaluate
removed timer from kwargs
update plots for split/subsets
fix tests
fix typing errors
typing errors
change typing
correct typing
apply changes from review
update to handle overwritten load_data
changes from review
added evaluation phases merging logic and fix typecheck
added * seprator at all places in load_data
changes from review
added split/subset at all places
change split/subset in other functions as well
small typecheck update
changes from review
reordering
implement for clustering task
implement for classification task
remove logger statement from clustering
implement for pair classification task
remove logger statement for pair classification task
implement for bitext mining task
implement for STS task
implement for summarization task
implement for sklearn evaluator
fix lintter
Delete .specstory/history/2026-04-23_09-55Z-testing-mechanism-for-new-datasets.md
minor changes from review
fix typecheck
update stacklevel=2
add TimingStack as default argument
fix evaluators tests
update deprecate_evaluator
minor fix
modified implementation to handle override task
added utlity function in TaskResult to plot timings
Added docs
changes from review
added commnts
simplify plot calling and update docs with example
changed phases naming format in plots
add tests and minor changes
update to handle indexing and searching phase
rollback changes in deprecated_evaluator
move import to top level and minor fix in tests
fix import
update tests
change Scoring to aggregate level in Classification task
remove unwanted file
fix lintter and typecheck errors after merge
revert changes in other classification task
changes from copilot review and add new test
changes from review
update condition
make lint
updated docs

Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (6a3e816)

2.14.9 (2026-06-08)

Fix

fix: Update lock and remove python limit fo pylate and colbert_engine (#4783)
update lock
fix typing
fix typing
fix tests
remove torchcodec from missing imports
fix zero shot
pin skelearn (d93a2c0)

Uh oh!

Releases: embeddings-benchmark/mteb

2.16.1

2.16.1 (2026-06-23)

Fix

Unknown

Uh oh!

2.16.0

2.16.0 (2026-06-21)

Feature

Uh oh!

2.15.6

2.15.6 (2026-06-20)

Fix

Uh oh!

2.15.5

2.15.5 (2026-06-19)

Ci

Fix

Unknown

Uh oh!

2.15.4

2.15.4 (2026-06-12)

Ci

Fix

Unknown

Uh oh!

2.15.3

2.15.3 (2026-06-10)

Fix

Uh oh!

2.15.2

2.15.2 (2026-06-10)

Fix

Unknown

Uh oh!

2.15.1

2.15.1 (2026-06-09)

Fix

Unknown

Uh oh!

2.15.0

2.15.0 (2026-06-09)

Ci

Feature

Uh oh!

2.14.9

2.14.9 (2026-06-08)

Fix

Uh oh!