Fix causal LM reranker scoring when max_length truncates chat-template suffix#3787
Conversation
|
Hello! The issue that you describe (i.e. truncating cutting off chat template suffixes) was a recurring issue that I tried to tackle in a few ways when introducing the new CrossEncoder implementation. I tried a lot of different solutions, but nothing worked elegantly. I ended up asking vLLM how they tackle it, and they mention that they let truncation happen if it happens. Your proposal here is also roughly one of the ideas I had (detect tail, tokenize tail, lower max-length by tail length, preprocess inputs, add tail tokens), but it was very messy. I tried a lot of approaches on the tokenizer side as well, e.g. make the tail part of an automatic suffix so the normal I think it's important to revisit this, but I'm not sure what the best fix is yet.
|
It only works for text-only inputs I think, but it should work nicely. It mostly affects causal rerankers, embedding models seem barely affected.
|
Okay, I pushed some follow-up changes in https://github.com/huggingface/sentence-transformers/tree/pr-3787 (4693ae2 in particular), please have a look, I'm curious what you think. In short, I'm revisiting the approach, but now by separately tokenizing the tail (with caching) and replacing the last non-padding tokens of every batch item with the tail. It seems to work pretty well, although I'm still not the biggest fan.
|
|
Hello! Thank you for taking a look and for pushing the follow-up implementation. I think your version is much more general than my rather ad-hoc initial approach. In particular, automatically deriving the chat template suffix instead of relying on a fixed number of final tokens should make this work better across other causal LM rerankers and related last-token use cases. As you mentioned, a really clean solution probably requires support at a lower level, such as in From a user perspective, I think this is quite valuable. Without this kind of handling, reranking scores can silently become much worse once the formatted input crosses So even if the solution is not perfectly elegant, I think it addresses a real and hard-to-notice problem for Sentence Transformers users of causal LM rerankers. Thanks again for the thoughtful explanation and implementation. I really appreciate the care you put into this. |
|
I think we're on the same page here. The truncation is a bigger issue than I was giving it credit for. It's also very possible that my https://huggingface.co/datasets/cross-encoder/ettin-reranker-v1-data data is wrong for the larger inputs (if anything exceeded 8k?) due to this. I'll try and continue working on https://github.com/huggingface/sentence-transformers/tree/pr-3787 to get it shipped in a minor release (I think I can justify that, i.e. not having to wait until a major release).
|
|
Thank you, that sounds great. I agree that this is important enough to be worth handling in a minor release if you think it fits the release policy. The failure mode is quite subtle from the user side, so having Sentence Transformers handle it directly would be very helpful. Thanks as always for looking into this so carefully and for moving it forward. |
|
If it's okay with you, I'll bring my changes into this PR as it was already built on your proposals. Pull Request overview
DetailsCausal LM rerankers like The problem is that when from sentence_transformers import CrossEncoder
import torch
# query + a relevant long document, formatted length > max_length
model = CrossEncoder("Qwen/Qwen3-Reranker-0.6B", max_length=128, activation_fn=torch.nn.Identity())
model.predict([[query, long_document]])[0]
# before: ~ -0.8 (final position landed mid-document)
# after: ~ +7 (assistant prefill restored)@hotchpotch originally tackled this with a head/tail token-reservation approach; I've reworked it to derive the suffix automatically, run on both the tokenizer and ProcessorMixin paths, and handle last-token-pooling embedders, not just causal-LM rerankers. I tried a few ways to handle it during tokenization, but The suffix is derived automatically rather than hard-coded per model: I render the conversation's role layout twice with two different fillers and take the longest common token suffix, which is exactly the run of tokens the template appends after the content. It's cached (LRU-bounded), works for both the tokenizer and This only applies to text-only content. Image/audio/video placeholder tokens can't be truncated without desyncing them from their pixel/feature values (the model would error), so those rows are left as-is; text-only inputs to a multimodal model still qualify. I also added
|
|
Of course, please feel free to bring your changes into this PR. Your implementation is much more robust and general than my initial version, so I’d be very happy to have this PR updated with that approach. |
|
Thank you for merging the PR! |
Hello!
Pull Request overview
max_lengthcan truncate the tail of the chat template, remove tokens required for scoring, and make reranking scores suddenly unreliable.Details
Causal LM rerankers such as
Qwen/Qwen3-Reranker-0.6Bdo not score query-document pairs like typical sequence-classification rerankers that use a CLS token or mean pooling. Instead, they compute the score from the logits at the final input token position.For Qwen3 Reranker, the chat template ends with an assistant prefill, and the model compares the next-token logits for
"yes"and"no"immediately after that. As a result, the formatted input must still end with the assistant prefill.However, when
max_lengthis set and the query/document pair is long,apply_chat_template(..., truncation=True, max_length=...)may truncate the tail. The resulting input can still look like a valid truncated prompt, but the assistant/scoring context immediately before the last-token logits is gone.As a result, even a highly relevant query-document pair can suddenly receive a very low reranking score once the formatted input crosses
max_length, making the score difficult to use for ranking. I reproduced this locally with a highly relevant pair of roughly 200 tokens andmax_length=128: without truncation, the score was high, but with truncation the assistant prefill was removed and the score dropped sharply.This PR adds a small fallback for the tokenizer-only chat-template path:
text-generation/any-to-anymax_lengthare activemax_length, keep the head while reserving a fixed number of final tokens from the tailThe default number of reserved final tokens is
16. Qwen3 Reranker currently needs 9 tokens for the assistant prefill, but that value is not obvious from the Sentence Transformers interface, so this PR keeps a small buffer. Users can override it withprocessing_kwargs["chat_template"]["preserve_final_tokens"].This PR uses
processing_kwargs["chat_template"]because the existing Sentence Transformers implementation already routes kwargs forapply_chat_templatethrough that path. That said, this may not be the best user-facing API for this setting.I kept the change intentionally narrow. It does not touch the multimodal
ProcessorMixinpath, and it does not change truncation behavior for encoder or sequence-classification models.Example
The issue this PR is intended to prevent looks like this:
This example uses
max_length=128to make the issue easy to reproduce. In production,1024or2048are also commonly used, and the same issue can occur when long query/document inputs cross those limits.In my local check before this change, the decoded input ended in the middle of the document and the assistant prefill was missing. With this PR, the chat-template suffix is still preserved with
max_length=128, and the score no longer suddenly breaks.Configuration
The default number of final reserved tokens is 16. It can be overridden per call:
This uses
processing_kwargs["chat_template"]so the setting goes through the same path as kwargs passed to Transformers'apply_chat_template.It can also be stored in a saved model config. In the current Sentence Transformers save format, this belongs to the Transformer module config (
sentence_bert_config.json), not the rootconfig_sentence_transformers.json:{ "transformer_task": "text-generation", "processing_kwargs": { "chat_template": { "preserve_final_tokens": 9 } } }However,
preserve_final_tokensis not actually a standard Transformersapply_chat_templateargument; it controls a Sentence Transformers fallback. Because of that, putting it underprocessing_kwargs["chat_template"]may be confusing. For example, it might be better as aCrossEncoderinitialization argument for causal LM reranker truncation/suffix behavior.Testing
I also manually checked Qwen3 Reranker with
max_length=128. After this change, the formatted input keeps the assistant prefill at the end, and the score no longer drops sharply when the formatted input exceedsmax_length.This PR is a fairly narrow fix for the causal LM reranker +
max_lengthscoring issue.There may be a better Sentence Transformers abstraction for this, such as detecting last-token scoring models, reading the required suffix token count explicitly from the model/config, exposing this as a
CrossEncoderinitialization argument, or aligning the behavior with a Transformers-side API. I would be happy to adjust this PR or move toward a different implementation if that fits the project better.Please let me know what you think, or if you have suggestions for a better direction.