Skip to content

Fix causal LM reranker scoring when max_length truncates chat-template suffix#3787

Merged
tomaarsen merged 13 commits into
huggingface:mainfrom
hotchpotch:last_token
Jun 16, 2026
Merged

Fix causal LM reranker scoring when max_length truncates chat-template suffix#3787
tomaarsen merged 13 commits into
huggingface:mainfrom
hotchpotch:last_token

Conversation

@hotchpotch

Copy link
Copy Markdown
Contributor

Hello!

Pull Request overview

  • This PR fixes an issue where, for causal LM rerankers, setting max_length can truncate the tail of the chat template, remove tokens required for scoring, and make reranking scores suddenly unreliable.
  • Causal LM rerankers such as Qwen3 Reranker compute scores from the logits at the final input token position, rather than from a CLS token or mean pooling. This makes it important to preserve the assistant prefill / scoring suffix at the end of the chat template.
  • For text-only causal LM inputs, this PR preserves important final tokens when truncating chat-template inputs.
  • It also adds a focused unit test for the text-generation chat-template truncation path.

Details

Causal LM rerankers such as Qwen/Qwen3-Reranker-0.6B do not score query-document pairs like typical sequence-classification rerankers that use a CLS token or mean pooling. Instead, they compute the score from the logits at the final input token position.

For Qwen3 Reranker, the chat template ends with an assistant prefill, and the model compares the next-token logits for "yes" and "no" immediately after that. As a result, the formatted input must still end with the assistant prefill.

However, when max_length is set and the query/document pair is long, apply_chat_template(..., truncation=True, max_length=...) may truncate the tail. The resulting input can still look like a valid truncated prompt, but the assistant/scoring context immediately before the last-token logits is gone.

As a result, even a highly relevant query-document pair can suddenly receive a very low reranking score once the formatted input crosses max_length, making the score difficult to use for ranking. I reproduced this locally with a highly relevant pair of roughly 200 tokens and max_length=128: without truncation, the score was high, but with truncation the assistant prefill was removed and the score dropped sharply.

This PR adds a small fallback for the tokenizer-only chat-template path:

  • only for text-generation / any-to-any
  • only for text-only messages
  • only when truncation and max_length are active
  • first tokenize the full chat template without truncation
  • if the sequence exceeds max_length, keep the head while reserving a fixed number of final tokens from the tail
  • then pad with the regular processor padding path

The default number of reserved final tokens is 16. Qwen3 Reranker currently needs 9 tokens for the assistant prefill, but that value is not obvious from the Sentence Transformers interface, so this PR keeps a small buffer. Users can override it with processing_kwargs["chat_template"]["preserve_final_tokens"].

This PR uses processing_kwargs["chat_template"] because the existing Sentence Transformers implementation already routes kwargs for apply_chat_template through that path. That said, this may not be the best user-facing API for this setting.

I kept the change intentionally narrow. It does not touch the multimodal ProcessorMixin path, and it does not change truncation behavior for encoder or sequence-classification models.

Example

The issue this PR is intended to prevent looks like this:

import torch
from sentence_transformers import CrossEncoder

query = (
    "Find a passage that explains why the Qwen3 Reranker needs the final "
    "assistant prefill before checking the yes/no next-token logits."
)
doc = (
    "Qwen3 Reranker is implemented as a causal language model reranker. "
    "After the query and document are formatted with the chat template, the "
    "model does not use a CLS token or mean pooling for scoring. Instead, it "
    "reads the logits at the final input position and compares the logits for "
    "the yes and no tokens. This means the end of the chat template, including "
    "the assistant prefill, must still be present after truncation. If max_length "
    "cuts off that suffix, the reranking score is computed from the wrong final "
    "position and can become very low even for a clearly relevant query-document "
    "pair. In production systems the max_length may often be 1024 or 2048, but "
    "the same issue appears whenever long query/document inputs cross that limit. "
    "The document is directly relevant because it explains the scoring mechanism, "
    "the role of the assistant prefill, and why tail truncation breaks the final "
    "logit comparison for a causal language model reranker."
)

model_without_max_length = CrossEncoder(
    "Qwen/Qwen3-Reranker-0.6B",
    activation_fn=torch.nn.Identity(),
    model_kwargs={"torch_dtype": torch.float32},
)
score_without_max_length = model_without_max_length.predict([[query, doc]])[0]

model_with_max_length = CrossEncoder(
    "Qwen/Qwen3-Reranker-0.6B",
    max_length=128,
    activation_fn=torch.nn.Identity(),
    model_kwargs={"torch_dtype": torch.float32},
)
score_with_max_length = model_with_max_length.predict([[query, doc]])[0]

# In my local check with this PR:
# - no max_length: raw score = 8.789059, sigmoid(score) = 0.999848
# - max_length=128: raw score = 5.351816, sigmoid(score) = 0.995283
#
# Before this PR, once the formatted input exceeded max_length, the tail of the
# chat template could be removed. For last-token causal LM rerankers, that meant
# the score was computed at the wrong final token. In my earlier reproduction,
# a relevant pair with max_length=128 dropped to raw score = -3.046875
# (sigmoid(score) = 0.045410).
print(score_without_max_length, score_with_max_length)

This example uses max_length=128 to make the issue easy to reproduce. In production, 1024 or 2048 are also commonly used, and the same issue can occur when long query/document inputs cross those limits.

In my local check before this change, the decoded input ended in the middle of the document and the assistant prefill was missing. With this PR, the chat-template suffix is still preserved with max_length=128, and the score no longer suddenly breaks.

Configuration

The default number of final reserved tokens is 16. It can be overridden per call:

score = model.predict(
    [[query, doc]],
    processing_kwargs={
        "chat_template": {"preserve_final_tokens": 9},
    },
)

This uses processing_kwargs["chat_template"] so the setting goes through the same path as kwargs passed to Transformers' apply_chat_template.

It can also be stored in a saved model config. In the current Sentence Transformers save format, this belongs to the Transformer module config (sentence_bert_config.json), not the root config_sentence_transformers.json:

{
  "transformer_task": "text-generation",
  "processing_kwargs": {
    "chat_template": {
      "preserve_final_tokens": 9
    }
  }
}

However, preserve_final_tokens is not actually a standard Transformers apply_chat_template argument; it controls a Sentence Transformers fallback. Because of that, putting it under processing_kwargs["chat_template"] may be confusing. For example, it might be better as a CrossEncoder initialization argument for causal LM reranker truncation/suffix behavior.

Testing

pytest tests/base/modules/test_transformer.py::TestProcessChatMessages -q
ruff check sentence_transformers/base/modules/transformer.py tests/base/modules/test_transformer.py

I also manually checked Qwen3 Reranker with max_length=128. After this change, the formatted input keeps the assistant prefill at the end, and the score no longer drops sharply when the formatted input exceeds max_length.


This PR is a fairly narrow fix for the causal LM reranker + max_length scoring issue.

There may be a better Sentence Transformers abstraction for this, such as detecting last-token scoring models, reading the required suffix token count explicitly from the model/config, exposing this as a CrossEncoder initialization argument, or aligning the behavior with a Transformers-side API. I would be happy to adjust this PR or move toward a different implementation if that fits the project better.

Please let me know what you think, or if you have suggestions for a better direction.

@tomaarsen

Copy link
Copy Markdown
Member

Hello!

The issue that you describe (i.e. truncating cutting off chat template suffixes) was a recurring issue that I tried to tackle in a few ways when introducing the new CrossEncoder implementation. I tried a lot of different solutions, but nothing worked elegantly. I ended up asking vLLM how they tackle it, and they mention that they let truncation happen if it happens. Your proposal here is also roughly one of the ideas I had (detect tail, tokenize tail, lower max-length by tail length, preprocess inputs, add tail tokens), but it was very messy.

I tried a lot of approaches on the tokenizer side as well, e.g. make the tail part of an automatic suffix so the normal self.processor.apply_chat_template works out of the box, but tokenizers didn't support that either. I asked the transformers team as well, and they don't really have an elegant solution for this either, but they don't necessarily need one as the processing/tokenizing is the users' responsibility (unlike in ST).

I think it's important to revisit this, but I'm not sure what the best fix is yet.

  • Tom Aarsen

It only works for text-only inputs I think, but it should work nicely. It mostly affects causal rerankers, embedding models seem barely affected.
@tomaarsen

Copy link
Copy Markdown
Member

Okay, I pushed some follow-up changes in https://github.com/huggingface/sentence-transformers/tree/pr-3787 (4693ae2 in particular), please have a look, I'm curious what you think.

In short, I'm revisiting the approach, but now by separately tokenizing the tail (with caching) and replacing the last non-padding tokens of every batch item with the tail. It seems to work pretty well, although I'm still not the biggest fan.

  • Tom Aarsen

@hotchpotch

Copy link
Copy Markdown
Contributor Author

Hello!

Thank you for taking a look and for pushing the follow-up implementation.

I think your version is much more general than my rather ad-hoc initial approach. In particular, automatically deriving the chat template suffix instead of relying on a fixed number of final tokens should make this work better across other causal LM rerankers and related last-token use cases.

As you mentioned, a really clean solution probably requires support at a lower level, such as in tokenizers / transformers, so that the content can be truncated while preserving the chat template suffix explicitly. But given the current APIs, I think your post-processing approach is a very reasonable way for Sentence Transformers to absorb this issue.

From a user perspective, I think this is quite valuable. Without this kind of handling, reranking scores can silently become much worse once the formatted input crosses max_length, and that failure mode is not obvious. I only noticed it when benchmarking with max_length=2048 the scores looked unexpectedly bad for some longer inputs, and it took some digging to realize that the assistant/scoring suffix had been truncated away.

So even if the solution is not perfectly elegant, I think it addresses a real and hard-to-notice problem for Sentence Transformers users of causal LM rerankers.

Thanks again for the thoughtful explanation and implementation. I really appreciate the care you put into this.

@tomaarsen

Copy link
Copy Markdown
Member

I think we're on the same page here. The truncation is a bigger issue than I was giving it credit for. It's also very possible that my https://huggingface.co/datasets/cross-encoder/ettin-reranker-v1-data data is wrong for the larger inputs (if anything exceeded 8k?) due to this. I'll try and continue working on https://github.com/huggingface/sentence-transformers/tree/pr-3787 to get it shipped in a minor release (I think I can justify that, i.e. not having to wait until a major release).

  • Tom Aarsen

@hotchpotch

Copy link
Copy Markdown
Contributor Author

Thank you, that sounds great.

I agree that this is important enough to be worth handling in a minor release if you think it fits the release policy. The failure mode is quite subtle from the user side, so having Sentence Transformers handle it directly would be very helpful.

Thanks as always for looking into this so carefully and for moving it forward.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tomaarsen

Copy link
Copy Markdown
Member

If it's okay with you, I'll bring my changes into this PR as it was already built on your proposals.

Pull Request overview

  • Preserve the trailing chat-template suffix (e.g. the assistant prefill) when max_length truncation would otherwise cut it off.
  • Lives in the base Transformer, so it applies across SentenceTransformer, CrossEncoder and SparseEncoder.

Details

Causal LM rerankers like Qwen/Qwen3-Reranker-0.6B don't score a query-document pair from a CLS token or mean pooling; they read the logits at the final token position and compare the yes/no next-token logits. Last-token-pooling embedders are the same, reading the final hidden state. For either to work, the formatted input has to keep the tail of the chat template - the assistant prefill, a trailing EOS, etc.

The problem is that when max_length is set and the pair is long, apply_chat_template(..., truncation=True, max_length=...) truncates from the right and removes that tail. The input still looks like a valid (truncated) prompt, but the final position is now somewhere in the middle of the document, so the score or embedding is read from the wrong place and can collapse - a clearly relevant pair suddenly scores very low. With Qwen3-Reranker-0.6B, max_length=128 and a relevant ~200 token pair, the raw score dropped to around -0.8; with this PR it returns to the +6 to +9 range, in line with the untruncated score.

from sentence_transformers import CrossEncoder
import torch

# query + a relevant long document, formatted length > max_length
model = CrossEncoder("Qwen/Qwen3-Reranker-0.6B", max_length=128, activation_fn=torch.nn.Identity())
model.predict([[query, long_document]])[0]
# before: ~ -0.8   (final position landed mid-document)
# after:  ~ +7     (assistant prefill restored)

@hotchpotch originally tackled this with a head/tail token-reservation approach; I've reworked it to derive the suffix automatically, run on both the tokenizer and ProcessorMixin paths, and handle last-token-pooling embedders, not just causal-LM rerankers. I tried a few ways to handle it during tokenization, but apply_chat_template renders to a flat string that the tokenizer then truncates blind to the template structure, and neither tokenizers nor transformers exposes a way to trim only the content while keeping the suffix. So this runs post-hoc: truncation happens as usual, and afterwards I splice the chat-template suffix back onto any row whose tail was cut. Rows that fit already end with the suffix and are left untouched, as is truncation=False.

The suffix is derived automatically rather than hard-coded per model: I render the conversation's role layout twice with two different fillers and take the longest common token suffix, which is exactly the run of tokens the template appends after the content. It's cached (LRU-bounded), works for both the tokenizer and ProcessorMixin paths, locates each row's real tail from the attention mask (so it's agnostic to the padding side and pad_to_multiple_of), and is applied per row so a batch mixing different layouts is handled correctly. system messages (i.e. the prompt/instruction) are kept fixed during derivation, so an instruction that a template renders in the tail is restored too.

This only applies to text-only content. Image/audio/video placeholder tokens can't be truncated without desyncing them from their pixel/feature values (the model would error), so those rows are left as-is; text-only inputs to a multimodal model still qualify.

I also added restore_suffix in the chat template kwargs, defaulting to False. To keep the raw behaviour for a call (i.e. just standard truncation without restoration), pass processing_kwargs={"chat_template": {"restore_suffix": False}}.

  • Tom Aarsen

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread sentence_transformers/base/modules/transformer.py Outdated
Comment thread tests/base/modules/test_transformer.py Outdated
@hotchpotch

Copy link
Copy Markdown
Contributor Author

Of course, please feel free to bring your changes into this PR. Your implementation is much more robust and general than my initial version, so I’d be very happy to have this PR updated with that approach.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread sentence_transformers/base/modules/transformer.py Outdated
Comment thread sentence_transformers/base/modules/transformer.py Outdated
Comment thread tests/base/modules/test_transformer.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread sentence_transformers/base/modules/transformer.py Outdated
Comment thread sentence_transformers/base/modules/transformer.py Outdated
Comment thread tests/base/modules/test_transformer.py
Comment thread sentence_transformers/base/modules/transformer.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread sentence_transformers/base/modules/transformer.py Outdated
Comment thread sentence_transformers/base/modules/transformer.py
Comment thread tests/base/modules/test_transformer.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread sentence_transformers/base/modules/transformer.py
@tomaarsen tomaarsen merged commit a38a6bf into huggingface:main Jun 16, 2026
18 checks passed
@hotchpotch

Copy link
Copy Markdown
Contributor Author

Thank you for merging the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants