[Frontend] Support using chat template as custom score template for reranking models#30550
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--30550.org.readthedocs.build/en/30550/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a --score-template CLI argument, allowing users to provide a custom Jinja2 template for score/rerank models. This is a valuable feature for decoupling prompt formatting from model-specific code. The implementation is mostly solid, with new CLI arguments, documentation, and tests. However, I've identified a high-severity issue related to code reuse that impacts maintainability and user experience. Specifically, chat-template-specific utilities are being reused for score templates, which can lead to confusing error messages. I've suggested a refactoring to create more generic template-handling functions.
|
Hi @jzakrzew, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1f64fa9 to
9258e17
Compare
|
hello @Samoed How does MTEB handle score templates? We are looking to align our implementation with MTEB. More links related to score templates:
Let's find a way to resolve it once for all. |
|
Hi! We have a separate class for handling instruction-based models that process instructions, with an example for Qwen3. However, this approach is a bit naive, since there's no standard way of doing this yet. Maybe @tomaarsen has some thoughts on standardizing prompt templates for cross-encoders For me, always unclear why there are no models that defines prompts in some jinja templates that could be used more automatically |
|
hello @tomaarsen Please take a look at this thread. |
|
Thanks for pinging me @noooop & @Samoed.
As this modern format is becoming a lot more prevalent. For my codebase, there were always two main concerns:
I think working on this support is so important that I'm working on a major refactor of the Sentence Transformers codebase, notably around the CrossEncoder, to help modularize it. This allows me to very easily support models that don't rely on For concern 2, a very simple solution is to rely on the [
{
"role": "query",
"content": "What is the capital of France?",
},
{
"role": "document",
"content": "Paris is the capital of France.",
}
]
tokenized = tokenizer.apply_chat_template(messages, ...)(This one matches the format required for https://huggingface.co/Qwen/Qwen3-Reranker-0.6B, I believe) An additional benefit here is that we can take advantage of a "system prompt" of sorts as an instruction/prompt for the reranker. In the above chat template, I hardcoded Some of the obvious advantages is that the But, my primary hesitation at the current stage is that This tempts me to write a more "manual" templating implementation, where I can apply the truncation on the second input (often the 'document' in a query-document setting). Recurring issues that I've found with my initial attempts are that you can't fully separately tokenize the template from the actual texts, as many template tokens will want to "merge" with actual text tokens (e.g. having Those are my thoughts for now. @noooop , where do you stand regarding:
|
Generally I think this should be like it, but for now there are now models. Even Qwen just inherit template from original LLM.
Generally I think good approach, but I'm afraid some libraries won't allow custom role names. Probably you can use |
|
What are your thoughts?
This is also my concern, which is why I'd like to seek your advice. After all, Sentence Transformers and MTEB are upstream of vLLM, and vLLM only supports a very limited number of CrossEncoder models. Just to mention Currently, vLLM does not perform truncation by default, following the OpenAI API behavior for /v1/embeddings. |
This makes sense. If the HF Hub repo has an incorrect chat template, you can override in in vLLM via passing
As @noooop , since we don't allow truncation by default, it should not be a problem. |
|
Ok, I'll modify the PR, so that it uses |
|
I think that's the right move. I'll also move to
Does vLLM support prompts/instructions? Edit: As mentioned by @Samoed, the above approaches are not very robust to Listwise rerankers which has multiple documents.
|
|
Also, support based on chunk content can be added like {
"role": "user",
"content": [
{
"type": "query",
"text": { # query/text
"value": "How does AI work? Explain it in simple terms.",
"annotations": []
}
},
{
"type": "document",
"text": { # document/text
"value": "AI works like ...",
}
],
}But I'm not sure is it possible to handle from jinja and if it work with other libraries
By the way, this won't work for |
|
I think we need ask someone from tokenizers/chat template mainaters for better way to handle this |
|
cc @hmellor |
9258e17 to
40808e9
Compare
|
Hi @jzakrzew, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
DarkLight1337
left a comment
There was a problem hiding this comment.
LGTM now, thanks for the detailed discussion!
Head branch was pushed to by a user without write access
|
@noooop Sorry, just wanted to clarify one comment, I did not notice you enabled automerge. |
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> (cherry picked from commit 23daef5)
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
TLDR
Purpose
This PR allows users to specify a custom prompt template for score/rerank models by providing the
--chat-templateCLI argument or settingchat_templateintokenizer_config.json.Motivation: The current mechanism for setting custom score templates (
SupportsScoreTemplate) is architecture-specific—it requires modifying the model class itself. This change decouples the prompt template from the model class, enabling support for any model requiring a custom score template without model-specific code changes.Immediate use case: The nvidia/llama-nemotron-rerank-1b-v2 model, which uses Llama architecture, but with a custom score template, can now be made to run correctly on vLLM with minor config.json modifications.
Running nvidia/llama-nemotron-rerank-1b-v2 with examples provided in the model's README, using FP32 precision:
Running without the custom template:
Running with a custom template:
Without a custom template:
With a custom template:
Test Plan
tests/entrypoints/pooling/score/test_utils.py
tests/models/language/pooling_mteb_test/test_nemotron.py
Test Result
pass
TODO
since vllm don't allow truncation by default, it should not be a problem.