[Frontend] Support using chat template as custom score template for reranking models by jzakrzew · Pull Request #30550 · vllm-project/vllm

jzakrzew · 2025-12-12T12:16:35Z

TLDR

nvidia/llama-nemotron-embed-1b-v2
nvidia/llama-nemotron-rerank-1b-v2
- examples/pooling/score/offline_using_template.py
- examples/pooling/score/online_using_template.py

Purpose

This PR allows users to specify a custom prompt template for score/rerank models by providing the --chat-template CLI argument or setting chat_template in tokenizer_config.json.

Motivation: The current mechanism for setting custom score templates (SupportsScoreTemplate) is architecture-specific—it requires modifying the model class itself. This change decouples the prompt template from the model class, enabling support for any model requiring a custom score template without model-specific code changes.

Immediate use case: The nvidia/llama-nemotron-rerank-1b-v2 model, which uses Llama architecture, but with a custom score template, can now be made to run correctly on vLLM with minor config.json modifications.

Running nvidia/llama-nemotron-rerank-1b-v2 with examples provided in the model's README, using FP32 precision:

Running without the custom template:

vllm serve  nvidia/llama-nemotron-rerank-1b-v2  --runner pooling --dtype float32 --port 8000 --pooler-config '{"pooling_type": "MEAN"}'

Running with a custom template:

echo -ne 'question:{{ messages[0]["query"] }} \n \n passage:{{ messages[1]["query"] }}' > score_template.jinja
vllm serve nvidia/llama-nemotron-rerank-1b-v2 --runner pooling --dtype float32 --port 8000 --pooler-config '{"pooling_type": "MEAN"}' --chat-template score_template.jinja

Without a custom template:

  Query: how much protein should a female eat?
  Document: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams...
  Score: 20.7482
  vllm Score: 6.0918
  Query: how much protein should a female eat?
  Document: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top o...
  Score: -23.0923
  vllm Score: -7.7225
  Query: how much protein should a female eat?
  Document: Calorie intake should not fall below 1,200 a day in women or 1,500 a day in men, except under the su...
  Score: -0.3436
  vllm Score: -1.1680

With a custom template:

  Query: how much protein should a female eat?
  Document: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams...
  Score: 20.7482
  vllm Score: 20.7482
  Query: how much protein should a female eat?
  Document: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top o...
  Score: -23.0923
  vllm Score: -23.0923
  Query: how much protein should a female eat?
  Document: Calorie intake should not fall below 1,200 a day in women or 1,500 a day in men, except under the su...
  Score: -0.3436
  vllm Score: -0.3436

Test Plan

tests/entrypoints/pooling/score/test_utils.py
tests/models/language/pooling_mteb_test/test_nemotron.py

Test Result

pass

TODO

Template aware prompt truncation to avoid cutting off important instructions.

#30550 (comment)

since vllm don't allow truncation by default, it should not be a problem.

#30550 (comment)

Step to standardize template scheme and inputs for reranking

#30550 (comment)
#30550 (comment)
#30550 (comment)
#30550 (comment)
#30550 (comment)

score_template should be explicitly specified in sbert_config.json for example.

Attempting to use tokenizer_config.json templates would most likely break these models, as often they just inherit the template from the original LLM.
#30550 (comment)

Template for embedding models

#30550 (comment)

It's a bit confusing to mix chat_template and score_template at the moment in vllm code.

#30550 (comment)

Add more models to Testing and Examples

bge_reranker_v2_gemma
mxbai_rerank
qwen3_reranker
nemotron_rerank √

github-actions · 2025-12-12T12:16:43Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2025-12-12T12:17:10Z

Documentation preview: https://vllm--30550.org.readthedocs.build/en/30550/

gemini-code-assist

Code Review

This pull request introduces a --score-template CLI argument, allowing users to provide a custom Jinja2 template for score/rerank models. This is a valuable feature for decoupling prompt formatting from model-specific code. The implementation is mostly solid, with new CLI arguments, documentation, and tests. However, I've identified a high-severity issue related to code reuse that impacts maintainability and user experience. Specifically, chat-template-specific utilities are being reused for score templates, which can lead to confusing error messages. I've suggested a refactoring to create more generic template-handling functions.

vllm/entrypoints/openai/cli_args.py

mergify · 2025-12-12T12:21:03Z

Hi @jzakrzew, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vllm/entrypoints/utils.py

noooop · 2025-12-12T14:58:23Z

hello @Samoed

How does MTEB handle score templates? We are looking to align our implementation with MTEB.

More links related to score templates:

Let's find a way to resolve it once for all.

Samoed · 2025-12-12T15:53:47Z

Hi! We have a separate class for handling instruction-based models that process instructions, with an example for Qwen3. However, this approach is a bit naive, since there's no standard way of doing this yet. Maybe @tomaarsen has some thoughts on standardizing prompt templates for cross-encoders

For me, always unclear why there are no models that defines prompts in some jinja templates that could be used more automatically

noooop · 2025-12-12T16:42:43Z

hello @tomaarsen

Please take a look at this thread.

tomaarsen · 2025-12-12T19:36:24Z

Thanks for pinging me @noooop & @Samoed.
In Sentence Transformers, I've been wanting to support the modern Causal-style rerankers, specifically:

As this modern format is becoming a lot more prevalent. For my codebase, there were always two main concerns:

The CrossEncoder (a.k.a. reranker) class relies on the transformers AutoModelForSequenceClassification. Models loaded with this factory are actually rather similar to AutoModelForCausalLM models, but their classifier head predicts one class akin to regression, rather than predicting scores for all tokens in the vocabulary and 1) taking the score for yes or 1 or 2) taking the difference of the scores of yes and no or 1 and 0 (which is also why you can make https://huggingface.co/tomaarsen/Qwen3-Reranker-0.6B-seq-cls out of https://huggingface.co/Qwen/Qwen3-Reranker-0.6B).
These models rely on a template, rather than the traditional text pairs input that was possible for AutoModelForSequenceClassification.

I think working on this support is so important that I'm working on a major refactor of the Sentence Transformers codebase, notably around the CrossEncoder, to help modularize it. This allows me to very easily support models that don't rely on AutoModelForSequenceClassification via built-in or custom modules that are executed sequentially. That solves concern 1 that I had, which is rather unrelated to this vLLM issue, but might give some more context to my recent work on the rerankers.

For concern 2, a very simple solution is to rely on the transformers chat_template, where the two texts passed to rerank are passed as two "messages" in a single text-chain. The format could then look something akin to:

<|im_start|>system
Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: Given a web search query, retrieve relevant passages that answer the query
<Query>: {{ messages[0]["content"] }}
<Document>: {{ messages[1]["content"] }}<|im_end|>
<|im_start|>assistant
<think>\n\n</think>\n\n\n

[
    {
        "role": "query",
        "content": "What is the capital of France?",
    },
    {
        "role": "document",
        "content": "Paris is the capital of France.",
    }
]
tokenized = tokenizer.apply_chat_template(messages, ...)

(This one matches the format required for https://huggingface.co/Qwen/Qwen3-Reranker-0.6B, I believe)

An additional benefit here is that we can take advantage of a "system prompt" of sorts as an instruction/prompt for the reranker. In the above chat template, I hardcoded Given a web search query, retrieve relevant passages that answer the query, but that is the instruction that users could modify to match their use cases.

Some of the obvious advantages is that the transformers chat_template is pretty commonly used and known, that most LLM-based models would already have a chat-template copied from the base model that can very reasonably be updated (the chat-template for generation is useless on a reranker after all), etc.

But, my primary hesitation at the current stage is that apply_chat_template doesn't allow for any template-aware truncation. I suppose this is more often an issue on the LLM side, but the current template-unaware truncation will strip off crucial template tokens (e.g. <think>\n\n</think>\n\n\n), at which point the reranker scores become absolutely useless. This wouldn't be much of an issue if reranker models didn't impose any sequence length limits, but most do (e.g. 32k for Qwen3, 1024 for BAAI, 8k/32k for Mixedbread, etc.).

This tempts me to write a more "manual" templating implementation, where I can apply the truncation on the second input (often the 'document' in a query-document setting). Recurring issues that I've found with my initial attempts are that you can't fully separately tokenize the template from the actual texts, as many template tokens will want to "merge" with actual text tokens (e.g. having : Hello as a token in ... Query: {{ messages[0]["query"] }} ...)

Those are my thoughts for now. @noooop , where do you stand regarding:

using transformers' chat_template as the source of truth for the templating?
whether truncation limits are a problem, or whether you think we can ignore it? Your experience with LLMs that more frequently rely on chat templates might be useful there.

Tom Aarsen

Samoed · 2025-12-12T19:46:35Z

using transformers' chat_template as the source of truth for the templating?

Generally I think this should be like it, but for now there are now models. Even Qwen just inherit template from original LLM.

"role": "document",

Generally I think good approach, but I'm afraid some libraries won't allow custom role names. Probably you can use name field for this, but this is a bit unintuitive too

noooop · 2025-12-13T06:09:41Z

@DarkLight1337 @hmellor

What are your thoughts?

Those are my thoughts for now. @noooop , where do you stand regarding:

using transformers' chat_template as the source of truth for the templating?

whether truncation limits are a problem, or whether you think we can ignore it? Your experience with LLMs that more frequently rely on chat templates might be useful there.

This is also my concern, which is why I'd like to seek your advice. After all, Sentence Transformers and MTEB are upstream of vLLM, and vLLM only supports a very limited number of CrossEncoder models.

Just to mention

Currently, vLLM does not perform truncation by default, following the OpenAI API behavior for /v1/embeddings.

#24235 (comment)

DarkLight1337 · 2025-12-13T07:25:09Z

using transformers' chat_template as the source of truth for the templating?

This makes sense. If the HF Hub repo has an incorrect chat template, you can override in in vLLM via passing --chat-template.

whether truncation limits are a problem, or whether you think we can ignore it? Your experience with LLMs that more frequently rely on chat templates might be useful there.

As @noooop , since we don't allow truncation by default, it should not be a problem.

jzakrzew · 2025-12-15T09:36:49Z

Ok, I'll modify the PR, so that it uses --chat-template and apply_hf_chat_template from vllm.entrypoints.chat_utils, instead of calling jinja2 directly.

tomaarsen · 2025-12-15T13:22:00Z

I think that's the right move. I'll also move to chat_template. We should aim for a bit of a format that works conveniently, e.g.

Using "query", "document", and perhaps "prompt" for the roles of the messages perhaps? For reference, in Sentence Transformers/MTEB, models are called with nested lists like [["What is the capital of China?", "The capital of China is Beijing."], ...], so I have to know how to convert that to the messages structure.
Or assume that message[0]["content"] is the query and message[1]["content"] is the document? This becomes very tricky if you want flexibility in the prompt/instruction as will be possible with Sentence Transformers (model.predict([["What is the capital of China?", "The capital of China is Beijing."], ...], prompt_name="default")). Index 0 for prompts, 1 for query, and 2 for the document is already stronger, but still too arbitrary in my opinion.

Does vLLM support prompts/instructions?

Edit: As mentioned by @Samoed, the above approaches are not very robust to Listwise rerankers which has multiple documents.

Tom Aarsen

Samoed · 2025-12-15T13:34:20Z

Also, support based on chunk content can be added like

{
  "role": "user",
  "content": [
    {
      "type": "query",
      "text": {  # query/text
        "value": "How does AI work? Explain it in simple terms.",
        "annotations": []
      }
    },
    {
       "type": "document",
       "text": {  # document/text
        "value": "AI works like ...",
      }
  ],
}

But I'm not sure is it possible to handle from jinja and if it work with other libraries

For reference, in Sentence Transformers/MTEB, models are called with nested lists like [["What is the capital of China?", "The capital of China is Beijing."], ...],

By the way, this won't work for ListWise reranking and it's support should be added (created issue in mteb embeddings-benchmark/mteb#3744)

Samoed · 2025-12-15T13:55:04Z

I think we need ask someone from tokenizers/chat template mainaters for better way to handle this

DarkLight1337 · 2025-12-15T13:57:00Z

cc @hmellor

mergify · 2025-12-15T15:15:23Z

Hi @jzakrzew, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

DarkLight1337

LGTM now, thanks for the detailed discussion!

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

jzakrzew · 2025-12-23T09:31:09Z

@noooop Sorry, just wanted to clarify one comment, I did not notice you enabled automerge.

…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>

…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>

…eranking models (vllm-project#30550) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> (cherry picked from commit 23daef5)

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

jzakrzew requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, noooop and robertgshaw2-redhat as code owners December 12, 2025 12:16

mergify bot added documentation Improvements or additions to documentation frontend labels Dec 12, 2025

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

vllm/entrypoints/openai/cli_args.py Outdated Show resolved Hide resolved

jzakrzew force-pushed the score-template-cli-arg branch from 1f64fa9 to 9258e17 Compare December 12, 2025 12:22

DarkLight1337 reviewed Dec 12, 2025

View reviewed changes

vllm/entrypoints/utils.py Outdated Show resolved Hide resolved

noooop self-assigned this Dec 12, 2025

jzakrzew force-pushed the score-template-cli-arg branch from 9258e17 to 40808e9 Compare December 15, 2025 15:11

jzakrzew changed the title ~~[Frontend] Support passing custom score template as a CLI argument to vllm serve~~ [Frontend] Support using chat template as custom score template for reranking models Dec 15, 2025

fix

da04212

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

DarkLight1337 approved these changes Dec 22, 2025

View reviewed changes

fix

fe56ed1

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

vllm-project deleted a comment from mergify bot Dec 22, 2025

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 22, 2025

noooop added 2 commits December 22, 2025 14:20

Merge branch 'main' into score-template-cli-arg

1dae748

Merge branch 'main' into score-template-cli-arg

34cd9c2

noooop enabled auto-merge (squash) December 23, 2025 08:23

auto-merge was automatically disabled December 23, 2025 09:29
Head branch was pushed to by a user without write access

+ Clarify comment

70d8811

noooop enabled auto-merge (squash) December 23, 2025 09:47

noooop merged commit 23daef5 into vllm-project:main Dec 23, 2025
58 checks passed

noooop mentioned this pull request Dec 25, 2025

[Model] Let more models to support the score template. #31335

Merged

5 tasks

This was referenced Dec 31, 2025

[Model] Add template for qwen3-reranker #31538

Closed

[Model] Support SentenceTransformers V6 reranker config #31563

Draft

noooop mentioned this pull request Jan 23, 2026

Change the default value of truncate_prompt_tokens in the embedding/rerank/pooling model to -1 #24235

Open

5 tasks

This was referenced Jan 30, 2026

feat(pooling/score): Enhance RerankRequest with optional instruction field and multimodal input support #33387

Open

feat: add max tokens per doc in rerank request #33315

Open

hmellor mentioned this pull request Mar 4, 2026

[Frontend][Model] Qwen3Rerank API Server backward compatibility #20239

Closed

4 tasks

This was referenced Mar 31, 2026

[RFC]: Add max_tokens_per_doc support for rerank and scoring endpoints #38651

Open

feat: add max_tokens_per_doc in rerank request (rebase of #33315) #38827

Open

Uh oh!

Conversation

jzakrzew commented Dec 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR

Purpose

Test Plan

Test Result

TODO

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

mergify bot commented Dec 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Dec 12, 2025

Uh oh!

Uh oh!

noooop commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Dec 12, 2025

Uh oh!

tomaarsen commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Dec 13, 2025

Uh oh!

DarkLight1337 commented Dec 13, 2025

Uh oh!

jzakrzew commented Dec 15, 2025

Uh oh!

tomaarsen commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Dec 15, 2025

Uh oh!

DarkLight1337 commented Dec 15, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

jzakrzew commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jzakrzew commented Dec 12, 2025 •

edited by github-actions bot

Loading

noooop commented Dec 12, 2025 •

edited

Loading

Samoed commented Dec 12, 2025 •

edited

Loading

tomaarsen commented Dec 12, 2025 •

edited

Loading

Samoed commented Dec 12, 2025 •

edited

Loading

tomaarsen commented Dec 15, 2025 •

edited

Loading

Samoed commented Dec 15, 2025 •

edited

Loading