Fix causal LM reranker scoring when max_length truncates chat-template suffix by hotchpotch · Pull Request #3787 · huggingface/sentence-transformers

hotchpotch · 2026-05-27T07:51:05Z

Hello!

Pull Request overview

This PR fixes an issue where, for causal LM rerankers, setting max_length can truncate the tail of the chat template, remove tokens required for scoring, and make reranking scores suddenly unreliable.
Causal LM rerankers such as Qwen3 Reranker compute scores from the logits at the final input token position, rather than from a CLS token or mean pooling. This makes it important to preserve the assistant prefill / scoring suffix at the end of the chat template.
For text-only causal LM inputs, this PR preserves important final tokens when truncating chat-template inputs.
It also adds a focused unit test for the text-generation chat-template truncation path.

Details

Causal LM rerankers such as Qwen/Qwen3-Reranker-0.6B do not score query-document pairs like typical sequence-classification rerankers that use a CLS token or mean pooling. Instead, they compute the score from the logits at the final input token position.

For Qwen3 Reranker, the chat template ends with an assistant prefill, and the model compares the next-token logits for "yes" and "no" immediately after that. As a result, the formatted input must still end with the assistant prefill.

However, when max_length is set and the query/document pair is long, apply_chat_template(..., truncation=True, max_length=...) may truncate the tail. The resulting input can still look like a valid truncated prompt, but the assistant/scoring context immediately before the last-token logits is gone.

As a result, even a highly relevant query-document pair can suddenly receive a very low reranking score once the formatted input crosses max_length, making the score difficult to use for ranking. I reproduced this locally with a highly relevant pair of roughly 200 tokens and max_length=128: without truncation, the score was high, but with truncation the assistant prefill was removed and the score dropped sharply.

This PR adds a small fallback for the tokenizer-only chat-template path:

only for text-generation / any-to-any
only for text-only messages
only when truncation and max_length are active
first tokenize the full chat template without truncation
if the sequence exceeds max_length, keep the head while reserving a fixed number of final tokens from the tail
then pad with the regular processor padding path

The default number of reserved final tokens is 16. Qwen3 Reranker currently needs 9 tokens for the assistant prefill, but that value is not obvious from the Sentence Transformers interface, so this PR keeps a small buffer. Users can override it with processing_kwargs["chat_template"]["preserve_final_tokens"].

This PR uses processing_kwargs["chat_template"] because the existing Sentence Transformers implementation already routes kwargs for apply_chat_template through that path. That said, this may not be the best user-facing API for this setting.

I kept the change intentionally narrow. It does not touch the multimodal ProcessorMixin path, and it does not change truncation behavior for encoder or sequence-classification models.

Example

The issue this PR is intended to prevent looks like this:

import torch
from sentence_transformers import CrossEncoder

query = (
    "Find a passage that explains why the Qwen3 Reranker needs the final "
    "assistant prefill before checking the yes/no next-token logits."
)
doc = (
    "Qwen3 Reranker is implemented as a causal language model reranker. "
    "After the query and document are formatted with the chat template, the "
    "model does not use a CLS token or mean pooling for scoring. Instead, it "
    "reads the logits at the final input position and compares the logits for "
    "the yes and no tokens. This means the end of the chat template, including "
    "the assistant prefill, must still be present after truncation. If max_length "
    "cuts off that suffix, the reranking score is computed from the wrong final "
    "position and can become very low even for a clearly relevant query-document "
    "pair. In production systems the max_length may often be 1024 or 2048, but "
    "the same issue appears whenever long query/document inputs cross that limit. "
    "The document is directly relevant because it explains the scoring mechanism, "
    "the role of the assistant prefill, and why tail truncation breaks the final "
    "logit comparison for a causal language model reranker."
)

model_without_max_length = CrossEncoder(
    "Qwen/Qwen3-Reranker-0.6B",
    activation_fn=torch.nn.Identity(),
    model_kwargs={"torch_dtype": torch.float32},
)
score_without_max_length = model_without_max_length.predict([[query, doc]])[0]

model_with_max_length = CrossEncoder(
    "Qwen/Qwen3-Reranker-0.6B",
    max_length=128,
    activation_fn=torch.nn.Identity(),
    model_kwargs={"torch_dtype": torch.float32},
)
score_with_max_length = model_with_max_length.predict([[query, doc]])[0]

# In my local check with this PR:
# - no max_length: raw score = 8.789059, sigmoid(score) = 0.999848
# - max_length=128: raw score = 5.351816, sigmoid(score) = 0.995283
#
# Before this PR, once the formatted input exceeded max_length, the tail of the
# chat template could be removed. For last-token causal LM rerankers, that meant
# the score was computed at the wrong final token. In my earlier reproduction,
# a relevant pair with max_length=128 dropped to raw score = -3.046875
# (sigmoid(score) = 0.045410).
print(score_without_max_length, score_with_max_length)

This example uses max_length=128 to make the issue easy to reproduce. In production, 1024 or 2048 are also commonly used, and the same issue can occur when long query/document inputs cross those limits.

In my local check before this change, the decoded input ended in the middle of the document and the assistant prefill was missing. With this PR, the chat-template suffix is still preserved with max_length=128, and the score no longer suddenly breaks.

Configuration

The default number of final reserved tokens is 16. It can be overridden per call:

score = model.predict(
    [[query, doc]],
    processing_kwargs={
        "chat_template": {"preserve_final_tokens": 9},
    },
)

This uses processing_kwargs["chat_template"] so the setting goes through the same path as kwargs passed to Transformers' apply_chat_template.

It can also be stored in a saved model config. In the current Sentence Transformers save format, this belongs to the Transformer module config (sentence_bert_config.json), not the root config_sentence_transformers.json:

{
  "transformer_task": "text-generation",
  "processing_kwargs": {
    "chat_template": {
      "preserve_final_tokens": 9
    }
  }
}

However, preserve_final_tokens is not actually a standard Transformers apply_chat_template argument; it controls a Sentence Transformers fallback. Because of that, putting it under processing_kwargs["chat_template"] may be confusing. For example, it might be better as a CrossEncoder initialization argument for causal LM reranker truncation/suffix behavior.

Testing

pytest tests/base/modules/test_transformer.py::TestProcessChatMessages -q
ruff check sentence_transformers/base/modules/transformer.py tests/base/modules/test_transformer.py

I also manually checked Qwen3 Reranker with max_length=128. After this change, the formatted input keeps the assistant prefill at the end, and the score no longer drops sharply when the formatted input exceeds max_length.

This PR is a fairly narrow fix for the causal LM reranker + max_length scoring issue.

There may be a better Sentence Transformers abstraction for this, such as detecting last-token scoring models, reading the required suffix token count explicitly from the model/config, exposing this as a CrossEncoder initialization argument, or aligning the behavior with a Transformers-side API. I would be happy to adjust this PR or move toward a different implementation if that fits the project better.

Please let me know what you think, or if you have suggestions for a better direction.

tomaarsen · 2026-06-10T07:33:04Z

Hello!

The issue that you describe (i.e. truncating cutting off chat template suffixes) was a recurring issue that I tried to tackle in a few ways when introducing the new CrossEncoder implementation. I tried a lot of different solutions, but nothing worked elegantly. I ended up asking vLLM how they tackle it, and they mention that they let truncation happen if it happens. Your proposal here is also roughly one of the ideas I had (detect tail, tokenize tail, lower max-length by tail length, preprocess inputs, add tail tokens), but it was very messy.

I tried a lot of approaches on the tokenizer side as well, e.g. make the tail part of an automatic suffix so the normal self.processor.apply_chat_template works out of the box, but tokenizers didn't support that either. I asked the transformers team as well, and they don't really have an elegant solution for this either, but they don't necessarily need one as the processing/tokenizing is the users' responsibility (unlike in ST).

I think it's important to revisit this, but I'm not sure what the best fix is yet.

Tom Aarsen

It only works for text-only inputs I think, but it should work nicely. It mostly affects causal rerankers, embedding models seem barely affected.

tomaarsen · 2026-06-10T11:35:15Z

Okay, I pushed some follow-up changes in https://github.com/huggingface/sentence-transformers/tree/pr-3787 (4693ae2 in particular), please have a look, I'm curious what you think.

In short, I'm revisiting the approach, but now by separately tokenizing the tail (with caching) and replacing the last non-padding tokens of every batch item with the tail. It seems to work pretty well, although I'm still not the biggest fan.

Tom Aarsen

hotchpotch · 2026-06-10T22:51:13Z

Hello!

Thank you for taking a look and for pushing the follow-up implementation.

I think your version is much more general than my rather ad-hoc initial approach. In particular, automatically deriving the chat template suffix instead of relying on a fixed number of final tokens should make this work better across other causal LM rerankers and related last-token use cases.

As you mentioned, a really clean solution probably requires support at a lower level, such as in tokenizers / transformers, so that the content can be truncated while preserving the chat template suffix explicitly. But given the current APIs, I think your post-processing approach is a very reasonable way for Sentence Transformers to absorb this issue.

From a user perspective, I think this is quite valuable. Without this kind of handling, reranking scores can silently become much worse once the formatted input crosses max_length, and that failure mode is not obvious. I only noticed it when benchmarking with max_length=2048 the scores looked unexpectedly bad for some longer inputs, and it took some digging to realize that the assistant/scoring suffix had been truncated away.

So even if the solution is not perfectly elegant, I think it addresses a real and hard-to-notice problem for Sentence Transformers users of causal LM rerankers.

Thanks again for the thoughtful explanation and implementation. I really appreciate the care you put into this.

tomaarsen · 2026-06-11T06:55:01Z

I think we're on the same page here. The truncation is a bigger issue than I was giving it credit for. It's also very possible that my https://huggingface.co/datasets/cross-encoder/ettin-reranker-v1-data data is wrong for the larger inputs (if anything exceeded 8k?) due to this. I'll try and continue working on https://github.com/huggingface/sentence-transformers/tree/pr-3787 to get it shipped in a minor release (I think I can justify that, i.e. not having to wait until a major release).

Tom Aarsen

hotchpotch · 2026-06-11T07:14:48Z

Thank you, that sounds great.

I agree that this is important enough to be worth handling in a minor release if you think it fits the release policy. The failure mode is quite subtle from the user side, so having Sentence Transformers handle it directly would be very helpful.

Thanks as always for looking into this so carefully and for moving it forward.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

tomaarsen · 2026-06-11T10:53:39Z

If it's okay with you, I'll bring my changes into this PR as it was already built on your proposals.

Pull Request overview

Preserve the trailing chat-template suffix (e.g. the assistant prefill) when max_length truncation would otherwise cut it off.
Lives in the base Transformer, so it applies across SentenceTransformer, CrossEncoder and SparseEncoder.

Details

Causal LM rerankers like Qwen/Qwen3-Reranker-0.6B don't score a query-document pair from a CLS token or mean pooling; they read the logits at the final token position and compare the yes/no next-token logits. Last-token-pooling embedders are the same, reading the final hidden state. For either to work, the formatted input has to keep the tail of the chat template - the assistant prefill, a trailing EOS, etc.

The problem is that when max_length is set and the pair is long, apply_chat_template(..., truncation=True, max_length=...) truncates from the right and removes that tail. The input still looks like a valid (truncated) prompt, but the final position is now somewhere in the middle of the document, so the score or embedding is read from the wrong place and can collapse - a clearly relevant pair suddenly scores very low. With Qwen3-Reranker-0.6B, max_length=128 and a relevant ~200 token pair, the raw score dropped to around -0.8; with this PR it returns to the +6 to +9 range, in line with the untruncated score.

from sentence_transformers import CrossEncoder
import torch

# query + a relevant long document, formatted length > max_length
model = CrossEncoder("Qwen/Qwen3-Reranker-0.6B", max_length=128, activation_fn=torch.nn.Identity())
model.predict([[query, long_document]])[0]
# before: ~ -0.8   (final position landed mid-document)
# after:  ~ +7     (assistant prefill restored)

@hotchpotch originally tackled this with a head/tail token-reservation approach; I've reworked it to derive the suffix automatically, run on both the tokenizer and ProcessorMixin paths, and handle last-token-pooling embedders, not just causal-LM rerankers. I tried a few ways to handle it during tokenization, but apply_chat_template renders to a flat string that the tokenizer then truncates blind to the template structure, and neither tokenizers nor transformers exposes a way to trim only the content while keeping the suffix. So this runs post-hoc: truncation happens as usual, and afterwards I splice the chat-template suffix back onto any row whose tail was cut. Rows that fit already end with the suffix and are left untouched, as is truncation=False.

The suffix is derived automatically rather than hard-coded per model: I render the conversation's role layout twice with two different fillers and take the longest common token suffix, which is exactly the run of tokens the template appends after the content. It's cached (LRU-bounded), works for both the tokenizer and ProcessorMixin paths, locates each row's real tail from the attention mask (so it's agnostic to the padding side and pad_to_multiple_of), and is applied per row so a batch mixing different layouts is handled correctly. system messages (i.e. the prompt/instruction) are kept fixed during derivation, so an instruction that a template renders in the tail is restored too.

This only applies to text-only content. Image/audio/video placeholder tokens can't be truncated without desyncing them from their pixel/feature values (the model would error), so those rows are left as-is; text-only inputs to a multimodal model still qualify.

I also added restore_suffix in the chat template kwargs, defaulting to False. To keep the raw behaviour for a call (i.e. just standard truncation without restoration), pass processing_kwargs={"chat_template": {"restore_suffix": False}}.

Tom Aarsen

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

hotchpotch · 2026-06-11T11:02:43Z

Of course, please feel free to bring your changes into this PR. Your implementation is much more robust and general than my initial version, so I’d be very happy to have this PR updated with that approach.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

hotchpotch · 2026-06-17T00:39:46Z

Thank you for merging the PR!

hotchpotch added 2 commits May 27, 2026 16:27

Preserve chat template tail when truncating

5aeda31

Define default chat template tail reserve

954e1d8

Auto-compute chat template tail length

4693ae2

It only works for text-only inputs I think, but it should work nicely. It mostly affects causal rerankers, embedding models seem barely affected.

Simplify _restore_chat_template_suffix a bit, more tests

c5f144c

Restore prompts rendered in the chat template tail

9ca633e

tomaarsen requested a review from Copilot June 11, 2026 10:31

Copilot started reviewing on behalf of tomaarsen June 11, 2026 10:31 View session

Copilot AI reviewed Jun 11, 2026

Allow disabling the chat template suffix restore

9a41bdf

tomaarsen requested a review from Copilot June 11, 2026 10:53

Copilot started reviewing on behalf of tomaarsen June 11, 2026 10:54 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread sentence_transformers/base/modules/transformer.py Outdated

Comment thread tests/base/modules/test_transformer.py Outdated

tomaarsen added 2 commits June 11, 2026 13:05

Fix test comment

0b02454

Simplify cache limit logic

1d81ecb

tomaarsen requested a review from Copilot June 11, 2026 11:38

Copilot started reviewing on behalf of tomaarsen June 11, 2026 11:38 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread sentence_transformers/base/modules/transformer.py Outdated

Comment thread sentence_transformers/base/modules/transformer.py Outdated

Comment thread tests/base/modules/test_transformer.py

tomaarsen added 3 commits June 12, 2026 09:14

Skip the chat template suffix restore when nothing was truncated

6cf7df8

Shrink comments

ad31bb0

Harden and simplify the chat template suffix restore

dfaf35d

tomaarsen requested a review from Copilot June 12, 2026 10:57

Copilot started reviewing on behalf of tomaarsen June 12, 2026 10:57 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread sentence_transformers/base/modules/transformer.py Outdated

Comment thread sentence_transformers/base/modules/transformer.py Outdated

Comment thread tests/base/modules/test_transformer.py

Comment thread sentence_transformers/base/modules/transformer.py

Fix truncation/max_length passed via the chat_template kwargs

a5785f9

tomaarsen requested a review from Copilot June 12, 2026 11:52

Copilot started reviewing on behalf of tomaarsen June 12, 2026 11:52 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread sentence_transformers/base/modules/transformer.py Outdated

Comment thread sentence_transformers/base/modules/transformer.py

Comment thread tests/base/modules/test_transformer.py

Apply Copilot comments

87e1e11

tomaarsen requested a review from Copilot June 12, 2026 12:04

Copilot started reviewing on behalf of tomaarsen June 12, 2026 12:04 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread sentence_transformers/base/modules/transformer.py

tomaarsen merged commit a38a6bf into huggingface:main Jun 16, 2026
18 checks passed

Uh oh!

Conversation

hotchpotch commented May 27, 2026

Pull Request overview

Details

Example

Configuration

Testing

Uh oh!

tomaarsen commented Jun 10, 2026

Uh oh!

tomaarsen commented Jun 10, 2026

Uh oh!

hotchpotch commented Jun 10, 2026

Uh oh!

tomaarsen commented Jun 11, 2026

Uh oh!

hotchpotch commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

tomaarsen commented Jun 11, 2026

Pull Request overview

Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hotchpotch commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hotchpotch commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants