Releases · huggingface/sentence-transformers

@tomaarsen

This minor version is a correctness- and robustness-focused release. It fixes a silent scoring bug for causal-LM rerankers, corrects several hard-negative mining and GIST loss edge cases, restores TSDAE on transformers v5, and adds Apple Silicon (MPS) support for the cached losses.

The headline fix affects chat-template models that read the final token position, i.e. causal-LM rerankers (like Qwen3-Reranker) and last-token-pooling embedders: when an over-long input was truncated, the chat template's trailing suffix (e.g. the assistant prefill the model scores from) was silently dropped, producing wrong scores with no error. There's also a forward-looking deprecation: loading local custom code without trust_remote_code=True now warns, and will require it from v6.0.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.6.0

# Inference only, use one of:
pip install sentence-transformers==5.6.0
pip install sentence-transformers[onnx-gpu]==5.6.0
pip install sentence-transformers[onnx]==5.6.0
pip install sentence-transformers[openvino]==5.6.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.6.0
pip install sentence-transformers[audio]==5.6.0
pip install sentence-transformers[video]==5.6.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.6.0

Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)

Chat-template models render the full conversation to a flat string before tokenizing, so when the rendered input is longer than the tokenizer's model_max_length, the tokenizer truncates it from the right and drops the template's trailing suffix: the fixed tokens a template appends after the content, e.g. a prompt, instruction, [/INST], or a trailing EOS. For models that read the final token position, this silently corrupted the result:

causal-LM rerankers (e.g. Qwen/Qwen3-Reranker-0.6B) score a pair from the last token's yes/no logits, and
last-token-pooling embedders read the final hidden state.

When the suffix was truncated away, that final position landed mid-document instead of after the prefill, so the score or embedding came from the wrong place.

Transformer.preprocess now detects when truncation drops the suffix and splices it back onto the tail of each truncated row. Because the fix lives in the shared base Transformer, it applies across SentenceTransformer, CrossEncoder, and SparseEncoder. It's enabled by default and saved to the model configuration. Pass processing_kwargs={"chat_template": {"restore_suffix": False}} to opt back into raw truncation.

Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)

A trio of correctness and scalability fixes for hard-negative mining and the GIST losses:

Sign-independent relative margin: mine_hard_negatives(relative_margin=...) and the margin_strategy="relative" branch of GISTEmbedLoss / CachedGISTEmbedLoss used a multiplicative threshold (positive * (1 - margin)) that only behaves correctly when the positive-pair similarity is positive. When that similarity was negative, the threshold moved the wrong way and let through false negatives: candidates more similar to the anchor than the true positive. The threshold is now positive - |positive| * margin, identical to before for positive scores but correct for negative ones.
Distributed positive masking in the GIST losses: with gather_across_devices=True and a non-zero margin, the false-negative suppression mask protected the wrong columns on ranks beyond the first (it ignored the per-rank offset into the gathered batch), which set the true positive's logit to -inf and produced a +inf loss. The mask now accounts for the cross-rank offset, so multi-GPU GIST training stays finite.
Memory-bounded mining without FAISS: mine_hard_negatives(use_faiss=False) (the default) materialized the full (queries × corpus) similarity matrix at once, which could OOM on large corpora. It now batches over the query axis (controlled by faiss_batch_size, default 16384), bounding peak memory while producing identical results.

TSDAE weight tying restored on `transformers` v5 (#3781)

transformers v5 removed the private PreTrainedModel._tie_encoder_decoder_weights helper that DenoisingAutoEncoderLoss (TSDAE) used to tie its separate encoder and decoder. As a stopgap, v5.5 raised a RuntimeError for the default tie_encoder_decoder=True on transformers >= 5.0.0, effectively breaking TSDAE there unless you pinned an older transformers or disabled tying. TSDAE now ships its own tying routine that shares storage between encoder and decoder, so it works on both transformers <5 and >=5 with the default settings.

Deprecation: loading local custom code without `trust_remote_code` (#3807)

Sentence Transformers has historically treated any local model directory as implicitly trusted: local custom code (e.g. modeling_*.py) loaded even with trust_remote_code=False, unlike transformers. This discrepancy might be unexpected, so loading local custom code this way now emits a FutureWarning, and from v6.0 it will require trust_remote_code=True like in transformers.

Apple Silicon (MPS) support (#3812, #3818)

Two fixes for training on Apple Silicon:

Cached losses on MPS: CachedMultipleNegativesRankingLoss and CachedGISTEmbedLoss crashed at construction on MPS because their RandContext used a CUDA-only RNG path. They now run on MPS with deterministic replay preserved.
Legacy fit path and SparseEncoder sparsity on MPS: the legacy model.fit(..., use_amp=True) path hard-coded CUDA's AMP GradScaler / autocast, and SparseEncoder sparsity statistics called to_sparse_csr(), which is unimplemented on MPS. Both now work on Apple Silicon.

Bug Fixes

Cast learning-to-rank loss logits to float32 in #3800: the listwise learning-to-rank losses scatter the model's logits into a float32 matrix, which crashed with a dtype mismatch when the model itself was in bf16/fp16 and trained without bf16=True/fp16=True (with those enabled, autocast outputs float32 logits, so the common path was unaffected). Logits are now upcast to float32 in the loss.
Don't override device_map placement with the device argument in #3823: loading with model_kwargs={"device_map": ...} previously placed the backbone via accelerate, then immediately moved it to the default device, defeating device_map. It now keeps the backbone in place (moving the other modules onto its device), and warns if both device and device_map are passed.
Collapse single-key multimodal dicts to the bare modality in #3779: a single-key input like {"image": img} was classified as a combined modality and rejected by models that support only that one modality (e.g. BGE-VL), with a self-contradicting error. It is now treated as the bare "image" input, which unblocks vision-retrieval benchmarks like MTEB that pass {"image": ...}.
Clarify unsupported-modality error messages in #3792: mixed or combined multimodal inputs on models that can't fuse modalities produced confusing errors (e.g. Modality 'message' is not supported). The errors are now scenario-specific and suggest what to do, such as encoding each modality separately.
Guard distributed APIs in get_device_name in #3798: on PyTorch builds where torch.distributed is present but unavailable (some ROCm and CPU-only builds), get_device_name() crashed with AttributeError: module 'torch.distributed' has no attribute 'is_initialized'. It now checks is_available() first, across all distributed call sites.
Fix OpenVINO static quantization for optimum-intel 2.0 / OpenVINO 2026 in #3814: export_static_quantized_openvino_model defaulted its calibration dataset to the bare id "glue", which the stricter Hub repo-id validation now rejects. The default is now the namespaced "nyu-mll/glue".

Examples, Documentation, and Notebooks

A batch of example and documentation modernization, mostly migrating example scripts off deprecated datasets script-loaders and bare ids onto maintained Hugging Face datasets so they run on datasets 4.x:

Fix example datasets that crash on datasets 4.x in #3782 (e.g. quora, nq_open, yahoo_answers_topics).
Migrate the MS MARCO examples to datasets in #3783.
Migrate the multilingual parallel-sentences data-prep scripts to datasets in #3784.
Migrate the Quora semantic-search and clustering examples to datasets in #3785.
Modernize the AugSBERT data-augmentation STSb scripts in #3806, also fixing the QQP cross-domain script to use the quora-duplicates labels it previously ignored.
Refresh the CLIP / image-search notebooks in #3780, and modernize the CLIP training notebook in #3805.
Fix a reStructuredText table in the ContrastiveTensionLoss docstring in #3788, and add a low-VRAM hardware note to the efficiency docs in #3802.
Fix package build warnings and errors in #3809, and a batch of Sphinx doc-build problems in #3811.

All Changes

[chore] Increment dev version by @tomaarsen in #3775
[fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779
chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in #3786
Bump the actions group with 3 updates by @dependabot[bot] in #3790
Fix example datasets that crash on datasets 4.x by @omkar-334 in #3782
CLIP Notebooks Refresh by @lbourdois in #3780
fix TSDAE ...

@tomaarsen

This patch release fixes a small quirk with multimodal inference when using single-key multimodal inputs like model.encode({"image": ...}).

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.5.1

# Inference only, use one of:
pip install sentence-transformers==5.5.1
pip install sentence-transformers[onnx-gpu]==5.5.1
pip install sentence-transformers[onnx]==5.5.1
pip install sentence-transformers[openvino]==5.5.1

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.5.1
pip install sentence-transformers[audio]==5.5.1
pip install sentence-transformers[video]==5.5.1

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.5.1

Bug fixed

Previously, inference like model.encode({"image": ...}) or model.encode([{"image": ...}, ...]) would be inferred as the ("image",) modality, which differed from the inferred modality of "image" for just model.encode(my_image) or model.encode([my_image, my_image_2, ...]).

This results in confusing errors if the model doesn't have a modality_config mapping for ("image",) in addition to "image", so now a single-key multimodal dict is collapsed to the bare modality (just "image" in this example).

This affected this code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/BGE-VL-base', trust_remote_code=True)
embedding = model.encode({"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/ettin-reranker/mteb_ndcg10_all-MiniLM-L6-v2.png"})
print(embedding.shape)

Which previously failed as the model only implements a path for "text", "image", and ("image", "text").

All Changes

[fix] Collapse single-key multimodal dicts to bare modality by @tomaarsen in #3779

Full Changelog: v5.5.0...v5.5.1

@tomaarsen

This release ships the train-sentence-transformers Agent Skill, adds two new training losses, and brings a long list of robustness and correctness fixes.

The new train-sentence-transformers Agent Skill lets AI coding agents (Claude Code, Codex, Cursor, Gemini CLI, ...) drive end-to-end training and fine-tuning across all three model types. EmbedDistillLoss is a new embedding-level knowledge distillation loss for SentenceTransformer: it aligns a student model's embeddings with pre-computed teacher embeddings, an alternative to the score-based distillation provided by MarginMSELoss and DistillKLDivLoss. ADRMSELoss is a new listwise learning-to-rank loss for CrossEncoder from the Rank-DistiLLM paper. encode() and predict() also gain a per-call processing_kwargs override, and more.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.5.0

# Inference only, use one of:
pip install sentence-transformers==5.5.0
pip install sentence-transformers[onnx-gpu]==5.5.0
pip install sentence-transformers[onnx]==5.5.0
pip install sentence-transformers[openvino]==5.5.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.5.0
pip install sentence-transformers[audio]==5.5.0
pip install sentence-transformers[video]==5.5.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.5.0

The `train-sentence-transformers` Agent Skill (#3752)

If you use an AI coding agent (Claude Code, Codex, Cursor, Gemini CLI, OpenCode, ...), you can now install the train-sentence-transformers Agent Skill and ask your agent to fine-tune a model on your data:

hf skills add train-sentence-transformers              # installs under ./.agents/skills/
hf skills add train-sentence-transformers --global     # installs under ~/.agents/skills/
hf skills add train-sentence-transformers --claude     # also symlinks into .claude/skills/

The skill gives the agent curated, version-aware guidance for training SentenceTransformer (bi-encoder), CrossEncoder (reranker), and SparseEncoder/SPLADE models, covering base model selection, loss and evaluator choice, hard-negative mining, distillation, LoRA, Matryoshka, multilingual training, static embeddings, plus a set of production-ready training template scripts. Then you can prompt your agent with things like:

"Train a multilingual sentence-transformer on Dutch legal pairs."

"Fine-tune a cross-encoder reranker on (question, answer) pairs from my dataset, mine hard negatives, and push to my Hub repo."

"Train a German sparse embedding model with high sparsity."

"Can you train a static embedding model on 100k code triplets?"

The skill lives in the repository under skills/train-sentence-transformers/ and is mirrored to the huggingface/skills marketplace on each release.

New loss: EmbedDistillLoss (#3665)

Introduces EmbedDistillLoss (Kim et al., 2023), an embedding-level knowledge distillation loss for SentenceTransformer. Rather than distilling teacher scores (MarginMSELoss, DistillKLDivLoss), it directly aligns the student's sentence_embedding with a pre-computed teacher embedding passed via the dataset's label column. The comparison uses a configurable distance_metric, one of "cosine" (the default), "l2", or "mse". When the student and teacher dimensions differ, pass projection_dim=<teacher_dim> to add a learnable projection from the student's embedding space into the teacher's. That projection lives on the loss rather than on the saved model, so use loss.save_projection(...) / loss.load_projection(...) to reuse it across stages (e.g. like done in Arkam et al. for Jina v5). As part of this change, MSELoss is now a thin subclass of EmbedDistillLoss with distance_metric="mse", and also gains the optional projection_dim argument.

from datasets import Dataset
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.sentence_transformer.losses import EmbedDistillLoss

student_model = SentenceTransformer("microsoft/mpnet-base")
teacher_model = SentenceTransformer("all-mpnet-base-v2")

train_dataset = Dataset.from_dict({
    "sentence": ["It's nice weather outside today.", "He drove to work."],
})

# Pre-compute teacher embeddings once and store them as the `label` column
def add_teacher_embeddings(batch):
    return {"label": teacher_model.encode(batch["sentence"]).tolist()}

train_dataset = train_dataset.map(add_teacher_embeddings, batched=True)

loss = EmbedDistillLoss(student_model, distance_metric="cosine")
# If the student and teacher dimensions differ, add a learnable projection:
# loss = EmbedDistillLoss(student_model, distance_metric="cosine", projection_dim=768)

trainer = SentenceTransformerTrainer(
    model=student_model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

See the updated model distillation examples and the loss overview for more.

New loss: ADRMSELoss for Cross Encoders (#3690)

Introduces ADRMSELoss (Approx Discounted Rank Mean Squared Error), a listwise learning-to-rank loss for CrossEncoder from the Rank-DistiLLM paper (Schlatt et al., ECIR 2025). It computes a differentiable approximation of each document's rank via pairwise sigmoids and minimizes the nDCG-discounted squared error against the true ranks derived from the labels. It expects listwise inputs: a (query, [doc1, ..., docN]) pair plus a [score1, ..., scoreN] label list per sample (binary or continuous labels, variable document counts allowed). It's designed for LLM-distillation reranking, where the per-document scores come from a strong LLM's ordering.

from datasets import Dataset
from sentence_transformers import CrossEncoder, CrossEncoderTrainer
from sentence_transformers.cross_encoder.losses import ADRMSELoss

model = CrossEncoder("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
    "query": ["What are pandas?", "What is the capital of France?"],
    "docs": [
        ["Pandas are a kind of bear.", "Pandas are kind of like fish."],
        ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."],
    ],
    "scores": [[0.95, 0.1], [0.98, 0.92, 0.2]],
})
loss = ADRMSELoss(model)

trainer = CrossEncoderTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

There's a full MS MARCO example at training_ms_marco_adrmse.py. Note that LambdaLoss generally remains the strongest loss in the listwise family. See the Cross Encoder loss overview for guidance on picking a loss.

Per-call `processing_kwargs` override (#3753)

SentenceTransformer.encode() / encode_query() / encode_document(), SparseEncoder.encode(), CrossEncoder.predict(), and model.preprocess() now accept a processing_kwargs argument that overrides the processor/tokenizer kwargs configured at construction time, for a single call. It has the same nested structure as the processing_kwargs constructor argument (top-level keys text, audio, image, video, common, chat_template) and is shallow-merged on top of the instance-level settings, so you can override just one setting (e.g. max_length) and leave the rest intact.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Override processor kwargs (e.g. max_length, truncation) for this call only:
embeddings = model.encode(
    ["a short text", "a much longer text that you want truncated more aggressively ..."],
    processing_kwargs={"text": {"max_length": 256, "truncation": True}},
)

This is especially handy for vision-language models, where you can change the image resolution per call, e.g. model.encode(images, processing_kwargs={"image": {"max_pixels": 256 * 256}}).

Smaller Features

Allow CrossEncoder module stacks that don't start with a Transformer, and recognize a trailing Dense(module_output_name="scores") as the scoring head, by @tomaarsen in #3742: num_labels now reads that head's out_features, and model.config / model.model return None when there's no underlying transformers model.
Infer that a model is an IR model on its generated model card when an InformationRetrievalEvaluator / NanoBEIREvaluator (or their sparse variants) was used during training, by @tomaarsen in #3741: the usage snippet then shows encode_query / encode_document, even without IR prompt names or a Router architecture.
Warn at model-load time when the installed transformers version is too old to honor use_bidirectional_attention / is_causal flags in a model's config (e.g. for google/embeddinggemma-300m), rather than silently ignoring them, by @tomaarsen in #3726.

Bug Fixes

Use the first non-pad token for CLS pooling with left-padding tokenizers by @tomaarsen in #3767: pooling_mode="cls" previously returned the embedding at position 0, which is a [PAD] token for left-padded inputs (common with decoder-only models), silently producing incorrect sentence embeddings. It now ...

@tomaarsen

This patch release allows encode() and predict() to accept 1D numpy string arrays as inputs.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.4.1

# Inference only, use one of:
pip install sentence-transformers==5.4.1
pip install sentence-transformers[onnx-gpu]==5.4.1
pip install sentence-transformers[onnx]==5.4.1
pip install sentence-transformers[openvino]==5.4.1

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.4.1
pip install sentence-transformers[audio]==5.4.1
pip install sentence-transformers[video]==5.4.1

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.4.1

Numpy string/object arrays as batches (#3720)

encode() and predict() now correctly recognize 1D numpy string/object arrays as batches rather than singular inputs. Previously, something like model.encode(df["text"].to_numpy()) was silently treated as a single input and produced incorrect output. 1D numpy arrays with dtype.kind in ("U", "O") are now unpacked like lists, and 2D+ arrays are treated as batches of pairs (for CrossEncoder).

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Previously treated as one input; now correctly encoded as 3 separate texts
embeddings = model.encode(np.array(["first", "second", "third"]))
print(embeddings.shape)
# (3, 384)

For CrossEncoder, a 1D numpy string array is still treated as a single [query, document] pair to match the existing list behavior, while a 2D array of shape (N, 2) is a batch of N pairs.

Safer activation function loading in `Dense` (#3714)

The Dense module stores its activation function as a dotted import path in its saved config (e.g. "torch.nn.modules.activation.Tanh"), which was then resolved via import_from_string whenever the module was loaded. Because any importable Python callable could be referenced, a maliciously crafted config.json on the Hub could trigger arbitrary imports at model load time.

The loader now only resolves activation functions whose import path starts with torch.. Anything else is skipped with a warning and replaced by the default activation (Tanh). To load a model with a custom (non-torch) activation function, opt in explicitly with trust_remote_code=True:

from sentence_transformers import SentenceTransformer

# Torch-provided activations load as before
model = SentenceTransformer("some/model-with-torch-activation")

# Non-torch activations now require explicit opt-in
model = SentenceTransformer("some/model-with-custom-activation", trust_remote_code=True)

This mirrors the opt-in trust model already used by transformers for custom code, and ensures untrusted model repositories cannot smuggle arbitrary imports through the Dense activation config.

What's Changed

[tests] Fix test_trainer_prompts for SE and ST after prompt handling moved into Transformer.preprocess by @tomaarsen in #3710
[chore] Increment dev version after v5.4 release by @tomaarsen in #3711
[docs] No revision needed anymore for nvidia nemotron by @tomaarsen in #3712
[chore] Replace evaluation_strategy with eval_strategy in a few more places by @tomaarsen in #3713
[security] Only load activation functions starting with 'torch' in the Dense module by @tomaarsen in #3714
[fix] Treat numpy string/object arrays as batches in encode/predict by @tomaarsen in #3720

Full Changelog: v5.4.0...v5.4.1

This large release introduces first-class multimodal support for both SentenceTransformer and CrossEncoder, making it easy to compute embeddings and rerank across text, images, audio, and video. The CrossEncoder class has been fully modularized, allowing for generative rerankers (CausalLM-based models) via a new LogitScore module. Flash Attention 2 now automatically skips padding for text-only inputs, providing significant speedups & memory reductions, especially when input lengths vary.

Blog post: Multimodal Embedding & Reranker Models with Sentence Transformers: a walkthrough of the new multimodal capabilities with some practical examples.

Migration guide: Migrating from v5.x to v5.4+: covers updated import paths, renamed parameters, and other softly breaking changes with deprecation warnings. Note that there are no hard deprecations, all existing code should continue to work with warnings at worst.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.4.0

# Inference only, use one of:
pip install sentence-transformers==5.4.0
pip install sentence-transformers[onnx-gpu]==5.4.0
pip install sentence-transformers[onnx]==5.4.0
pip install sentence-transformers[openvino]==5.4.0

# Multimodal dependencies (optional):
pip install sentence-transformers[image]==5.4.0
pip install sentence-transformers[audio]==5.4.0
pip install sentence-transformers[video]==5.4.0

# Or combine as needed:
pip install sentence-transformers[train,onnx,image]==5.4.0

Multimodal Embeddings with SentenceTransformer (#3554)

SentenceTransformer now natively supports vision-language models (VLMs) and other multimodal architectures. You can encode and compare across text, images, audio, videos, or combinations of these, with automatic modality detection and preprocessing. Models advertise which modalities they support via the new model.modalities property and model.supports() method.

Using a pretrained multimodal embedding model

from PIL import Image
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
    revision="refs/pr/23",
)

# Check supported modalities
print(model.modalities)
# ['text', 'image', 'video', 'message']
print(model.supports("image"))
# True

# Encode text
text_embeddings = model.encode(["A photo of a cat", "A pollinator on a flower"])

# Encode images (PIL images, file paths, or URLs all work)
image_embeddings = model.encode([
    Image.open("cat.jpg"),
    "https://example.com/flower.jpg",
])

# Encode mixed text+image inputs
multimodal_embeddings = model.encode([
    {"text": "Describe this image", "image": Image.open("cat.jpg")},
])

# Compute cross-modal similarity
similarity = model.similarity(text_embeddings, image_embeddings)

Building multimodal models with Router

You can also compose separate encoders for different modalities using the new Router module. Unlike the single-backbone VLM approach, Router lets you combine any existing text and image encoders and route inputs based on detected modality:

from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.modules import Dense, Pooling, Router, Transformer

# Text encoder: MiniLM with mean pooling, projected to 768 dims to match image encoder
text_encoder = Transformer("sentence-transformers/all-MiniLM-L6-v2")
text_pooling = Pooling(text_encoder.get_embedding_dimension(), pooling_mode="mean")
text_projection = Dense(text_encoder.get_embedding_dimension(), 768)

# Image encoder: SigLIP outputs pooled embeddings directly
image_encoder = Transformer("google/siglip2-base-patch16-224")

# Route inputs to the appropriate encoder based on detected modality
router = Router(
    sub_modules={
        "text": [text_encoder, text_pooling, text_projection],
        "image": [image_encoder],
    },
)

model = SentenceTransformer(modules=[router])

# Text and image inputs are automatically routed to the correct encoder
text_embeddings = model.encode(["A photo of a cat"])
image_embeddings = model.encode(["https://example.com/cat.jpg"])
similarity = model.similarity(text_embeddings, image_embeddings)

Multimodal Reranking with CrossEncoder (#3554)

CrossEncoder now supports multimodal inputs for reranking, enabling cross-modal scoring of query-document pairs where either side can be text, images, audio, video, or mixed-modality content. This works with both generative rerankers (CausalLM-based, via the new LogitScore module) and encoder-based models. See the pretrained multimodal rerankers for models you can use right away.

from sentence_transformers import CrossEncoder

# Load a multimodal reranker
model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B", revision="refs/pr/11")

# Rank text documents against an image query (or vice versa)
results = model.rank(
    query="https://example.com/product.jpg",
    documents=["A red sneaker", "A blue dress", "A leather bag"],
)

Two training approaches are provided in the multimodal training examples:

Any-to-Any + LogitScore: Uses the full causal LM to generate a single token, scoring via log-odds of "1" vs "0".
Feature Extraction + Pooling + Dense: More memory-efficient alternative that skips the LM head.

Modular CrossEncoder Architecture (#3554)

CrossEncoder has been fully modularized, inheriting from BaseModel (which is a torch.nn.Sequential). You can now inspect, customize, and compose module chains, just like SentenceTransformer. See the custom models guide for full details.

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-Reranker-0.6B", revision="refs/pr/11")
print(model)
"""
CrossEncoder(
  (0): Transformer({'transformer_task': 'text-generation', ...})
  (1): LogitScore({'true_token_id': 9693, 'false_token_id': 2152, ...})
)
"""

Generative reranker support

Thanks to the modular architecture, generative rerankers like mixedbread-ai/mxbai-rerank-base-v2 now work out of the box. These models ship with a modules.json that configures the Transformer + LogitScore chain automatically:

from sentence_transformers import CrossEncoder

model = CrossEncoder("mixedbread-ai/mxbai-rerank-base-v2")
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 in 2022."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# array([ 9. , -0.5], dtype=float32)

Module chain patterns

The Transformer module now supports multiple task types that determine how the underlying model is loaded and what outputs it produces:

"sequence-classification": Loads via AutoModelForSequenceClassification, returns classification logits directly.
"text-generation": Loads via AutoModelForCausalLM, returns raw logits from the language model head.
"any-to-any": Loads via AutoModelForMultimodalLM (transformers v5+), for multimodal causal LMs that accept interleaved image/text inputs.
"feature-extraction": Loads via AutoModel (no task-specific head), returns hidden states.

Various module chains are possible now, here's some common ones:

Encoder-based (Sequence Classification): A single Transformer module with transformer_task="sequence-classification", the traditional BERT/RoBERTa approach. This was previously the only option for CrossEncoder models.
CausalLM-based (Text Generation + LogitScore): For generative rerankers (Qwen, Llama, mxbai-rerank-v2, etc.), a Transformer with transformer_task="text-generation" followed by a LogitScore module that computes logit["yes"] - logit["no"] at the last token position. For multimodal rerankers, transformer_task="any-to-any" is used instead.
Feature Extraction + Pooling + Dense: A memory-efficient alternative that uses the base model without LM head, pools the last token, and projects to a single score via a Dense layer.

When loading a model without a modules.json, CrossEncoder automatically selects the right chain: if the architecture ends with ForCausalLM, it uses text-generation + LogitScore (with "yes"/"no" tokens); otherwise it uses sequence-classification. You can also construct custom module chains explicitly:

from sentence_transformers import CrossEncoder
from sentence_transformers.cross_encoder.modules import Transformer, LogitScore

transformer = Transformer("Qwen/Qwen3-Reranker-0.6B", transformer_task="text-generation", revision="refs/pr/11")
true_id = transformer.tokenizer.convert_tokens_to_ids("1")...

@tomaarsen

This minor version brings several improvements to contrastive learning: MultipleNegativesRankingLoss now supports alternative InfoNCE formulations (symmetric, GTE-style) and optional hardness weighting for harder negatives. Two new losses are introduced, GlobalOrthogonalRegularizationLoss for embedding space regularization and CachedSpladeLoss for memory-efficient SPLADE training. The release also adds a faster hashed batch sampler, fixes GroupByLabelBatchSampler for triplet losses, and ensures full compatibility with the latest Transformers v5 versions.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.3.0

# Inference only, use one of:
pip install sentence-transformers==5.3.0
pip install sentence-transformers[onnx-gpu]==5.3.0
pip install sentence-transformers[onnx]==5.3.0
pip install sentence-transformers[openvino]==5.3.0

Updated MultipleNegativesRankingLoss (a.k.a. InfoNCE)

MultipleNegativesRankingLoss received two major upgrades: support for alternative InfoNCE formulations from the literature, and optional hardness weighting to up-weight harder negatives.

Support other InfoNCE variants (#3607)

MultipleNegativesRankingLoss now supports several well-known contrastive loss variants from the literature through new directions and partition_mode parameters. Previously, this loss only supported the standard forward direction (query → doc). You can now configure which similarity interactions are included in the loss:

"query_to_doc" (default): For each query, its matched document should score higher than all other documents.
"doc_to_query": The symmetric reverse — for each document, its matched query should score higher than all other queries.
"query_to_query": For each query, all other queries should score lower than its matched document.
"doc_to_doc": For each document, all other documents should score lower than its matched query.

The partition_mode controls how scores are normalized: "joint" computes a single softmax over all directions, while "per_direction" computes a separate softmax per direction and averages the losses.

These combine to reproduce several loss formulations from the literature:

Standard InfoNCE (default, unchanged behavior):

loss = MultipleNegativesRankingLoss(model)
# equivalent to directions=("query_to_doc",), partition_mode="joint"

Symmetric InfoNCE (Günther et al. 2024) — adds the reverse direction so both queries and documents are trained to find their match:

loss = MultipleNegativesRankingLoss(
    model,
    directions=("query_to_doc", "doc_to_query"),
    partition_mode="per_direction",
)

GTE improved contrastive loss (Li et al. 2023) — adds same-type negatives (query <-> query, doc <-> doc) for a stronger training signal, especially useful with pairs-only data:

loss = MultipleNegativesRankingLoss(
    model,
    directions=("query_to_doc", "query_to_query", "doc_to_query", "doc_to_doc"),
    partition_mode="joint",
)

Hardness-weighted contrastive learning (#3667)

Adds optional hardness weighting to MultipleNegativesRankingLoss and CachedMultipleNegativesRankingLoss, inspired by Lan et al. 2025 (LLaVE). This up-weights harder negatives in the softmax by adding hardness_strength * stop_grad(cos_sim) to selected negative logits. The feature is off by default (hardness_mode=None), so existing behavior is unchanged.

The hardness_mode parameter controls which negatives receive the penalty:

"in_batch_negatives": Penalizes in-batch negatives only (positives and hard negatives from other samples). Works with all data formats including pairs-only.
"hard_negatives": Penalizes explicit hard negatives only (columns beyond the first two). Only active when hard negatives are provided.
"all_negatives": Penalizes both in-batch and hard negatives, leaving only the positive unpenalized.

from sentence_transformers.losses import MultipleNegativesRankingLoss

loss = MultipleNegativesRankingLoss(
    model,
    hardness_mode="in_batch_negatives",
    hardness_strength=9.0,
)

New loss: GlobalOrthogonalRegularizationLoss (#3654)

Introduces GlobalOrthogonalRegularizationLoss (Zhang et al. 2017), a regularization loss that encourages embeddings to be well-distributed in the embedding space. It penalizes two things: (1) high mean pairwise similarity across unrelated embeddings, and (2) high second moment of similarities (which indicates clustering). This loss is meant to be combined with a primary contrastive loss like MultipleNegativesRankingLoss. By wrapping both losses in a single module, you can share embeddings and only require one forward pass:

import torch
from datasets import Dataset
from torch import Tensor
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import GlobalOrthogonalRegularizationLoss, MultipleNegativesRankingLoss
from sentence_transformers.util import cos_sim

model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
    "anchor": ["It's nice weather outside today.", "He drove to work."],
    "positive": ["It's so sunny.", "He took the car to the office."],
})

class InfoNCEGORLoss(torch.nn.Module):
    def __init__(self, model: SentenceTransformer, similarity_fct=cos_sim, scale=20.0) -> None:
        super().__init__()
        self.model = model
        self.info_nce_loss = MultipleNegativesRankingLoss(model, similarity_fct=similarity_fct, scale=scale)
        self.gor_loss = GlobalOrthogonalRegularizationLoss(model, similarity_fct=similarity_fct)

    def forward(self, sentence_features: list[dict[str, Tensor]], labels: Tensor | None = None) -> Tensor:
        embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
        info_nce_loss: dict[str, Tensor] = {
            "info_nce": self.info_nce_loss.compute_loss_from_embeddings(embeddings, labels)
        }
        gor_loss: dict[str, Tensor] = self.gor_loss.compute_loss_from_embeddings(embeddings, labels)
        return {**info_nce_loss, **gor_loss}

loss = InfoNCEGORLoss(model)
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

New loss: CachedSpladeLoss for memory-efficient SPLADE training (#3670)

Introduces CachedSpladeLoss, a gradient-cached version of SpladeLoss that enables training SPLADE models with larger batch sizes without additional GPU memory. It applies the GradCache technique at the SpladeLoss wrapper level, so both the base loss and regularizers receive pre-computed embeddings — no changes to existing base losses or regularizers are needed.

from datasets import Dataset
from sentence_transformers.sparse_encoder import SparseEncoder, SparseEncoderTrainer
from sentence_transformers.sparse_encoder.losses import CachedSpladeLoss, SparseMultipleNegativesRankingLoss

model = SparseEncoder("distilbert/distilbert-base-uncased")
train_dataset = Dataset.from_dict({
    "anchor": ["It's nice weather outside today.", "He drove to work."],
    "positive": ["It's so sunny.", "He took the car to the office."],
})
loss = CachedSpladeLoss(
    model=model,
    loss=SparseMultipleNegativesRankingLoss(model),
    document_regularizer_weight=3e-5,
    query_regularizer_weight=5e-5,
    mini_batch_size=32,
)

trainer = SparseEncoderTrainer(model=model, train_dataset=train_dataset, loss=loss)
trainer.train()

Faster NoDuplicatesBatchSampler with hashing (#3611)

Adds a NO_DUPLICATES_HASHED batch sampler option, which uses the existing NoDuplicatesBatchSampler with precompute_hashes=True. This pre-computes xxhash 64-bit values for each sample, providing significant speedups for large batch sizes at a small memory cost. Requires the xxhash library.

from sentence_transformers import SentenceTransformerTrainingArguments

args = SentenceTransformerTrainingArguments(
    batch_sampler="NO_DUPLICATES_HASHED"  # Pre-computes hashes for faster duplicate checking
)

GroupByLabelBatchSampler improvements for triplet losses (#3668)

Fixes a critical issue where GroupByLabelBatchSampler produced ~99% single-class batches, causing zero gradients with triplet losses. The sampler now uses round-robin interleaving where each label emits 2 samples per round, with the label visit order reshuffled every round. This guarantees every batch contains multiple distinct labels, each with at least 2 samples.

Transformers v5 compatibility

This release includes full compatibility updates for Transformers v5:

Compatibility with transformers 5.0.0rc01 and later versions (#3597, #3615)
Support for T5Gemma and T5Gemma2 models (#3644)
Transformers v5.2 compatibility for the trainer's _nested_gather method (#3664)
Support for both warmup_steps and warmup_ratio until Transformers v4 support is dropped (#3645)
Updated CI to test against full Transformers v5 (#3615)

Minor Features

Add triplets/n-tuple support to AnglE by @tomaarsen in #3609
Replace requests dependency with optional httpx dependency by @tomaarsen in #3618
Specify numpy manually in dependencies by @tomaarsen in #3608
Support excluding prompt tokens with pooling with left-padding tokenizer by @tomaarsen in #3598

Bug Fixes

Fix InformationRetrievalEvaluator prediction export when output_path does not exist by @ignasgr in #3659
Add padding for odd embedding dimensions in tensors (sparse encoders) by @jadermcs in #3623
Fix IndexError in CrossEnco...

@tomaarsen

This patch release introduces compatibility with Transformers v5.2.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.2.3

# Inference only, use one of:
pip install sentence-transformers==5.2.3
pip install sentence-transformers[onnx-gpu]==5.2.3
pip install sentence-transformers[onnx]==5.2.3
pip install sentence-transformers[openvino]==5.2.3

Transformers v5.2 Support

Transformers v5.2 has just released, and it updated its Trainer in such a way that training with Sentence Transformers would start failing on the logging step. The #3664 pull request has resolved this issue.

If you're not training with Sentence Transformers, then older versions of Sentence Transformers are also compatible with Transformers v5.2.

All Changes

[compat] Introduce Transformers v5.2 compatibility: trainer _nested_gather moved by @tomaarsen (#3664)

Full Changelog: v5.2.2...v5.2.3

@tomaarsen

This patch release replaces mandatory requests dependency with an optional httpx dependency.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.2.2

# Inference only, use one of:
pip install sentence-transformers==5.2.2
pip install sentence-transformers[onnx-gpu]==5.2.2
pip install sentence-transformers[onnx]==5.2.2
pip install sentence-transformers[openvino]==5.2.2

Transformers v5 Support

Transformers v5.0 and its required huggingface_hub versions have dropped support of requests in favor of httpx. The former was also used in sentence-transformers, but not listed explicitly as a dependency. This patch removes the use of requests in favor of httpx, although it's now optional and not automatically imported. This should also save some import time.

Importing Sentence Transformers should now not crash if requests is not installed.

All Changes

[deps] Replace requests dependency with optional httpx dependency by @tomaarsen (#3618)

Full Changelog: v5.2.1...v5.2.2

@tomaarsen

This patch release adds support for the full Transformers v5 release.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.2.1

# Inference only, use one of:
pip install sentence-transformers==5.2.1
pip install sentence-transformers[onnx-gpu]==5.2.1
pip install sentence-transformers[onnx]==5.2.1
pip install sentence-transformers[openvino]==5.2.1

Transformers v5 Support

Sentence Transformers v5.2.0 already introduced support for the Transformers v5.0 release candidates, but this release is adding support for the full release. The intention is to maintain backward compatibility with v4.x. The library includes dual CI testing for both version for now, allowing users to upgrade to the newest Transformers features when ready. In future versions, Sentence Transformers may start requiring Transformers v5.0 or higher.

All Changes

Introduce compatibility with transformers 5.0.0rc01 by @tomaarsen (#3597)
Specify numpy manually in dependencies, as it's directly used/imported by @tomaarsen (#3608)
Expand test suite to full transformers v5 by @tomaarsen (#3615)

Full Changelog: v5.2.0...v5.2.1

@tomaarsen

This minor release introduces multi-processing for CrossEncoder (rerankers), multilingual NanoBEIR evaluators, similarity score outputs in mine_hard_negatives, Transformers v5 support, Python 3.9 deprecations, and more.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==5.2.0

# Inference only, use one of:
pip install sentence-transformers==5.2.0
pip install sentence-transformers[onnx-gpu]==5.2.0
pip install sentence-transformers[onnx]==5.2.0
pip install sentence-transformers[openvino]==5.2.0

CrossEncoder Multi-processing

The CrossEncoder class now supports multiprocessing for faster inference on CPU and multi-GPU setups. This brings CrossEncoder functionality in line with the existing multiprocessing capabilities of SentenceTransformer models, allowing you to use multiple CPU cores or GPUs to speed up both the predict and rank methods when processing large batches of sentence pairs.

The implementation introduces these new methods, mirroring the SentenceTransformer approach:

start_multi_process_pool() - Initialize a pool of worker processes
stop_multi_process_pool() - Clean up the worker pool

Usage is straightforward with the new pool parameter:

from sentence_transformers.cross_encoder import CrossEncoder

def main():
	model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')
	
	# Start a pool of workers
	pool = model.start_multi_process_pool()
	
	# Use the pool for faster inference
	scores = model.predict(sentence_pairs, pool=pool)
	rankings = model.rank(query, documents, pool=pool)
	
	# Clean up when done
	model.stop_multi_process_pool(pool)

if __name__ == "__main__":
    main()

Or simply pass a list of devices to device to have predict and rank automatically create a pool behind the scenes.

from sentence_transformers.cross_encoder import CrossEncoder

def main():
	model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2', device="cpu")
	
	# Use 4 processes
	scores = model.predict(sentence_pairs, device=["cpu"] * 4)
	rankings = model.rank(query, documents, device=["cpu"] * 4)

if __name__ == "__main__":
    main()

This enhancement is particularly beneficial for CPU-based deployments and enables multi-GPU reranking in the mine_hard_negatives function, making hard negative mining faster for large datasets.

Multilingual NanoBEIR Support

The NanoBEIR evaluators now support custom dataset IDs, allowing for evaluation on non-English NanoBEIR collections. All three NanoBEIR evaluators (dense, sparse, and cross-encoder) support this functionality with a simple dataset_id parameter.

For example:

import logging
from pprint import pprint

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import NanoBEIREvaluator

logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)

# Load a model to evaluate
model = SentenceTransformer("google/embeddinggemma-300m")
# Use a Serbian translation of NanoBEIR
evaluator = NanoBEIREvaluator(
    ["msmarco", "nq"],
    dataset_id="Serbian-AI-Society/NanoBEIR-sr"
)
results = evaluator(model)
print(results[evaluator.primary_metric])
pprint({key: value for key, value in results.items() if "ndcg@10" in key})
"""
{'NanoBEIR_mean_cosine_ndcg@10': 0.44754032737278326,
 'NanoMSMARCO_cosine_ndcg@10': 0.4424192627754922,
 'NanoNQ_cosine_ndcg@10': 0.45266139197007427}
"""

There are already supported translations for French, Arabic, German, Spanish, Italian, Portuguese, Norwegian, Swedish, Serbian, Korean, Japanese, and 22 Bharat languages in the NanoBEIR collection. Contact me (@tomaarsen) if you have found or created another translation and would like to get it added to the collection!

Similarity Scores in Hard Negatives Mining

The mine_hard_negatives function now includes an output_scores parameter that allows you to export similarity scores alongside the mined negatives. When output_scores=False (default), these are the output formats for various output_formats:

"triplet": (anchor, positive, negative)
"n-tuple": (anchor, positive, negative_1, ..., negative_n)
"labeled-pair": (anchor, passage, label)
"labeled-list": (anchor, [passages], [labels])

And when output_scores=True, the format becomes:

"triplet": (anchor, positive, negative, [scores])
"n-tuple": (anchor, positive, negative_1, ..., negative_n, [scores])
"labeled-pair": (anchor, passage, score)
"labeled-list": (anchor, [passages], [scores])

For context, labels are binary options denoting whether the relevant pair was labeled as a positive or not, whereas scores are similarity scores from the SentenceTransformer or CrossEncoder model.
Additionally:

The deprecated n-tuple-scores format has been replaced with the cleaner output_format="n-tuple" combined with output_scores=True.
Several issues with datasets supporting multiple positives have been resolved

For example:

from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# Load a Sentence Transformer model
model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1")

# Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(10000))
print(dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 10000
})
"""

# Mine hard negatives into num_negatives + 3 columns:
# 'query', 'answer', 'negative_1', 'negative_2', ..., 'score'
# where 'score' is a list of similarity scores for the query-answer plus each query-negative pair.
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    num_negatives=5,
    sampling_strategy="top",
    relative_margin=0.05,
    batch_size=128,
    use_faiss=True,
    output_format="labeled-list",
    output_scores=True,
)
"""
Negative candidates mined, preparing dataset...
Metric       Positive       Negative     Difference
Count          10,000         49,241
Mean           0.5884         0.3909         0.2033
Median         0.6005         0.3766         0.1837
Std            0.1467         0.1050         0.1337
Min            0.0272         0.1595         0.0088
25%            0.4918         0.3127         0.0903
50%            0.6005         0.3766         0.1837
75%            0.6974         0.4558         0.2924
Max            0.9679         0.8505         0.7281
Skipped 25,451 potential negatives (4.89%) due to the relative_margin of 0.05.
Could not find enough negatives for 148 samples (1.48%). Consider adjusting the range_max and relative_margin parameters if you'd like to find more valid negatives.
"""
print(dataset)
"""
Dataset({
    features: ['query', 'answer', 'scores'],
    num_rows: 9852
})
"""
print(dataset[0])
{
    "query": "when did richmond last play in a preliminary final",
    "answer": [
        "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final since 1986. The Crows led at quarter time and led by as many as 13, but the Tigers took over the game as it progressed and scored seven straight goals at one point. They eventually would win by 48 points – 16.12 (108) to Adelaide's 8.12 (60) – to end their 37-year flag drought.[22] Dustin Martin also became the first player to win a Premiership medal, the Brownlow Medal and the Norm Smith Medal in the same season, while Damien Hardwick was named AFL Coaches Association Coach of the Year. Richmond's jump from 13th to premiers also marked the biggest jump from one AFL season to the next.",
        "2017 AFL Grand Final The 2017 AFL Grand Final was an Australian rules football game contested between the Adelaide Crows and the Richmond Tigers, held at the Melbourne Cricket Ground on 30 September 2017. It was the 121st annual grand final of the Australian Football League (formerly the Victorian Football League), staged to determine the premiers for the 2017 AFL season.[1]. Richmond defeated Adelaide by 48 points, marking the club's eleventh premiership and first since 1980. Richmond's Dustin Martin won the Norm Smith Medal as the best player on the ground. The match was attended by 100,021 people, the largest crowd since the 1986 Grand Final.",
        "Raid of Richmond The Richmond Campaign was a group of British military actions against the capital of Virginia, Richmond, and the surrounding area, during the American Revolutionary War. Led by American turncoat Benedict Arnold, the Richmond Campaign is considered one of his greatest successes while serving under the British Army, and one of the most notorious actions that Arnold ever performed.",
        "2001 AFL Grand Final The 2001 AFL Grand Final was an Australian rules football game contested between the Essendon Football Club and the Brisbane...

Uh oh!

Releases: huggingface/sentence-transformers

v5.6.0 - Fixes for Causal LM Rerankers, Hard-Negative Mining, and More

Fixed silently wrong scores when truncation drops chat-template suffixes (#3787)

Hard-negative mining and GIST loss correctness (#3821, #3817, #3816)

TSDAE weight tying restored on transformers v5 (#3781)

Deprecation: loading local custom code without trust_remote_code (#3807)

Apple Silicon (MPS) support (#3812, #3818)

Bug Fixes

Examples, Documentation, and Notebooks

All Changes

Contributors

Uh oh!

v5.5.1 - Small Multimodal patch

Bug fixed

All Changes

Contributors

Uh oh!

v5.5.0 - Training Agent Skill, EmbedDistillLoss, and ADRMSELoss

The train-sentence-transformers Agent Skill (#3752)

New loss: EmbedDistillLoss (#3665)

New loss: ADRMSELoss for Cross Encoders (#3690)

Per-call processing_kwargs override (#3753)

Smaller Features

Bug Fixes

Contributors

Uh oh!

v5.4.1 - Numpy string arrays

Numpy string/object arrays as batches (#3720)

Safer activation function loading in Dense (#3714)

What's Changed

Contributors

Uh oh!

v5.4.0 - Multimodal Embeddings and Reranking, Modular CrossEncoder, Flash Attention Input Flattening

Multimodal Embeddings with SentenceTransformer (#3554)

Using a pretrained multimodal embedding model

Building multimodal models with Router

Multimodal Reranking with CrossEncoder (#3554)

Modular CrossEncoder Architecture (#3554)

Generative reranker support

Module chain patterns

Contributors

Uh oh!

v5.3.0 - Improved Contrastive Learning, New Losses, and Transformers v5 Compatibility

Updated MultipleNegativesRankingLoss (a.k.a. InfoNCE)

Support other InfoNCE variants (#3607)

Hardness-weighted contrastive learning (#3667)

New loss: GlobalOrthogonalRegularizationLoss (#3654)

New loss: CachedSpladeLoss for memory-efficient SPLADE training (#3670)

Faster NoDuplicatesBatchSampler with hashing (#3611)

GroupByLabelBatchSampler improvements for triplet losses (#3668)

Transformers v5 compatibility

Minor Features

Bug Fixes

Contributors

Uh oh!

v5.2.3 - Compatibility with Transformers v5.2 training

Transformers v5.2 Support

All Changes

Contributors

Uh oh!

v5.2.2 - Replace mandatory `requests` dependency with optional `httpx` dependency

Transformers v5 Support

All Changes

Contributors

Uh oh!

v5.2.1 - Joint Transformers v4 and v5 compatibility

Transformers v5 Support

All Changes

Contributors

Uh oh!

v5.2.0 - CrossEncoder multi-processing, multilingual NanoBEIR evaluators, similarity score in `mine_hard_negatives`, Transformers v5 support

CrossEncoder Multi-processing

Multilingual NanoBEIR Support

Similarity Scores in Hard Negatives Mining

Contributors

Uh oh!

TSDAE weight tying restored on `transformers` v5 (#3781)

Deprecation: loading local custom code without `trust_remote_code` (#3807)

The `train-sentence-transformers` Agent Skill (#3752)

Per-call `processing_kwargs` override (#3753)

Safer activation function loading in `Dense` (#3714)