[Bugfix] [Frontend] Responses API, fix merging of messages by yzong-rh · Pull Request #42189 · vllm-project/vllm

yzong-rh · 2026-05-10T00:16:56Z

Co-Authored-By: bfroemel

Based on #37294
With help from: @bfroemel, @weiguangli-io and @aayushbaluni

Purpose

Addresses #37167 (responses API, combining of message and tool call) by merging contiguous assistant-side Responses items into a single Chat Completions assistant message on the non-Harmony path.

Merging with the previous message only happens when the previously constructed message is an assistant message; otherwise a new assistant message is started.

Item type	This PR	`llama.cpp` PR #19773	`vLLM` PR #37294
`output_text`	Populate assistant `content` if no previous `content`.	Populate or append to assistant `content`.	Populate assistant `content` if it has `reasoning` only.
`function_call`	Populate or append to assistant `tool_calls`.	Populate or append to assistant `tool_calls`.	Populate or append to assistant `tool_calls`.
`reasoning`	Populate assistant `reasoning` if no previous `reasoning`.	Populate or override assistant `reasoning_content`.	Start a new assistant message with `reasoning`.

Note:
Converting Responses items into Chat Completion for non-Harmony models is inherently lossy. Partial-completion may apply to multiple merged Responses items. These are necessary tradeoffs in order to preserve single assistant turn for template rendering.

Existing code also does not handle refusal or multi-part content. This PR doesn't remedy this.

Test Plan

Unit test:

pytest tests/entrypoints/openai/responses/test_responses_utils.py

Token-level comparisons:

# Gemma 4 disables reasoning by default, so test wo reasoning
vllm serve google/gemma-4-31B-it \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --enable-log-requests \
  --enable-log-outputs

# Qwen3.6 enables reasoning by default
vllm serve Qwen/Qwen3.6-27B \
  --port 8000 \
  --max-model-len 262144 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-log-requests \
  --enable-log-outputs

Send a Responses API request with send_response.py using tools.json and gemma4.step1.response.json or qwen36.step1.response.json. Note that they are real conversations generated by the model.

python send_response.py \
  --base-url http://127.0.0.1:8000/v1 \
  --model google/gemma-4-31B-it \
  --input-file gemma4.step1.response.json \
  --tools-file tools.json \
  --output-file gemma4.step2.json

python send_response.py \
  --base-url http://127.0.0.1:8000/v1 \
  --model Qwen/Qwen3.6-27B \
  --input-file qwen36.step1.response.json \
  --tools-file tools.json \
  --output-file qwen36.step2.json

And compare the decoded prompt token ids.

BFCL evals

BFCL_MODEL="google/gemma-4-31B-it" \
BFCL_MAX_MODEL_LEN="16384" \
BFCL_API_TYPE="responses" \
BFCL_TOOL_CALL_PARSER="gemma4" \
BFCL_REASONING_PARSER="" \
BFCL_TEST_CATEGORY="multi_turn" \
BFCL_EXTRA_ARGS="--data-parallel-size 2 --chat-template examples/tool_chat_template_gemma4.jinja" \
bash .buildkite/scripts/tool_call/run-bfcl-eval.sh

BFCL_MODEL="Qwen/Qwen3.6-27B" \
BFCL_MAX_MODEL_LEN="16384" \
BFCL_API_TYPE="responses" \
BFCL_TOOL_CALL_PARSER="qwen3_coder" \
BFCL_REASONING_PARSER="qwen3" \
BFCL_TEST_CATEGORY="multi_turn" \
BFCL_EXTRA_ARGS="--data-parallel-size 2"
bash .buildkite/scripts/tool_call/run-bfcl-eval.sh

Test Result

Unit test pass.

Prompt token comparison:

Qwen3.6:
- Before: qwen3_before_prompt.txt
- After: qwen3_after_prompt.txt
Gemma 4
- Before: crash due to tool_call_id mismatch, see [Bugfix] Gemma 4 chat template crash with missing tool name and tool id #42188
- After: gemma4_after_prompt.txt

BFCL results:

Qwen3.6:

Run	`multi_turn_miss_func`	`multi_turn_miss_param`	`multi_turn_long_context`	`multi_turn_base`
Before	61.00%	47.50%	36.50%	62.50%
After	62.50%	48.50%	46.50%	74.50%

Gemma 4:

Run	`multi_turn_miss_func`	`multi_turn_miss_param`	`multi_turn_long_context`	`multi_turn_base`
Before	17.00%	17.00%	15.00%	28.50%
After	42.50%	50.00%	52.50%	79.50%

cc @qandrew @chaunceyjiang

Made with Cursor

Signed-off-by: Yifan Zong <yzong@redhat.com>

gemini-code-assist

Code Review

This pull request refactors the chat message construction logic to merge consecutive assistant-side response items, such as reasoning, tool calls, and output text, into single assistant messages. It also updates the test suite with comprehensive merging policy checks. Feedback suggests extending the merging logic to account for existing message history to prevent template issues and replacing assertions with robust type checks when modifying tool call lists to ensure compatibility with immutable types.

Signed-off-by: Yifan Zong <yzong@redhat.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

bfroemel · 2026-05-10T07:51:36Z

@yzong-rh thanks, but I am not a coauthor ;) (no code seems to be directly taken from my abandoned PR + coauthorship potentially could complicate DCO)

Your PR looks good, two minor remarks:

cognitively I preferred the explicit merging function; now it might be (slightly harder) to recognize that _construct_message_from_response_item() also does merging
not sure about merging reasoning into a previous assistant message that also has reasoning, and merging content into a previous assistant message that also has content; both imo indicate that there likely is a problem, because (according to schema and how reasoning is handled in practice) chat completion models cannot generate an assistant chat completion message containing multiple reasoning or content fields. Usually, models are quite sensitive whether their own text generations and the rendered prompt is within their expectation/in distribution; ofc there could be benefits of your merging strategy, if you switched mid-session from a model like gpt-oss to a "native" chat completion model, but on the other hand, if a client supports such a model switch, it might handle that conversation/item merging internally (or ask the harmony model first for a concise summary and hand that over to the non-harmony model).

Signed-off-by: Yifan Zong <yzong@redhat.com>

yzong-rh · 2026-05-10T19:32:47Z

I am not a coauthor ;) (no code seems to be directly taken from my abandoned PR + coauthorship potentially could complicate DCO)

I pulled from you PR to work on top of it. Enough changes may have been made that not much is left of the original code. Happy to amend DCO and give you proper attribution

cognitively I preferred the explicit merging function; now it might be (slightly harder) to recognize that _construct_message_from_response_item() also does merging

I removed the explicit merge function to avoid code duplication in the if / else chain, although you are right that _construct_message_from_response_item() can now unexpectedly do merge. I've update the fn signature and comment to highligh this better.

not sure about merging reasoning into a previous assistant message that also has reasoning, and merging content into a previous assistant message that also has content; both imo indicate that there likely is a problem, because (according to schema and how reasoning is handled in practice) chat completion models cannot generate an assistant chat completion message containing multiple reasoning or content fields.

Agree with you that both would indicate a problem given the current models. Other than switching from a gpt-oss to a native model mid-session you mentioned, another case is that the user turn somehow got lost in the conversation, in which case having two consecutive assistant turns might be preferable over a single assistant turn with concatenated content and reasoning.

I originally aimed to replicate the llama.cpp PR but a less greedy approach might be less surprising here. Plus, the current content concatenation path isn't that robust, and the code will be much simpler. Updated the logic and BFCL as sanity check (no perf regression, as expected)

Signed-off-by: Yifan <yzong@redhat.com>

chaunceyjiang

LGTM

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com>

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com> (cherry picked from commit 6ff7405)

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com>

Fix

bcdcce8

Signed-off-by: Yifan Zong <yzong@redhat.com>

mergify Bot added frontend bug Something isn't working labels May 10, 2026

yzong-rh mentioned this pull request May 10, 2026

[Bugfix] Gemma 4 chat template crash with missing tool name and tool id #42188

Merged

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/responses/utils.py Outdated

Comment thread vllm/entrypoints/openai/responses/utils.py Outdated

Addr comment

89c1446

Signed-off-by: Yifan Zong <yzong@redhat.com>

yzong-rh marked this pull request as ready for review May 10, 2026 00:46

yzong-rh requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners May 10, 2026 00:46

claude Bot reviewed May 10, 2026

View reviewed changes

Addr comments

bf3a252

Signed-off-by: Yifan Zong <yzong@redhat.com>

chaunceyjiang reviewed May 11, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/responses/utils.py Outdated

chaunceyjiang reviewed May 11, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/responses/utils.py Outdated

Remove extra warnings

05f2f7a

Signed-off-by: Yifan <yzong@redhat.com>

yzong-rh requested a review from chaunceyjiang May 11, 2026 18:25

chaunceyjiang approved these changes May 12, 2026

View reviewed changes

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label May 12, 2026

chaunceyjiang enabled auto-merge (squash) May 12, 2026 12:04

chaunceyjiang disabled auto-merge May 12, 2026 12:42

chaunceyjiang enabled auto-merge (squash) May 12, 2026 12:43

Merge branch 'main' into yzong-rh/responses-combine-msg

cba166b

chaunceyjiang merged commit 6ff7405 into vllm-project:main May 12, 2026
42 checks passed

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[Bugfix] [Frontend] Responses API, fix merging of messages (vllm-proj…

e4c1d6d

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[Bugfix] [Frontend] Responses API, fix merging of messages (vllm-proj…

daae2cf

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com>

This was referenced May 20, 2026

[Bug]: previous_response_id drops function_call/function_call_output from stored context in Responses API #43244

Open

Fix previous_response_id dropping tool calls from stored context #43247

Open

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[Bugfix] [Frontend] Responses API, fix merging of messages (vllm-proj…

60ef6df

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[Bugfix] [Frontend] Responses API, fix merging of messages (vllm-proj…

47500fb

…ect#42189) Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yifan <yzong@redhat.com>

yzong-rh mentioned this pull request Jun 11, 2026

[Bugfix] Responses API assistant EasyInputMessageParam input #44361

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] [Frontend] Responses API, fix merging of messages#42189

[Bugfix] [Frontend] Responses API, fix merging of messages#42189
chaunceyjiang merged 5 commits into
vllm-project:mainfrom
yzong-rh:yzong-rh/responses-combine-msg

yzong-rh commented May 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

bfroemel commented May 10, 2026

Uh oh!

yzong-rh commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yzong-rh commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

bfroemel commented May 10, 2026

Uh oh!

yzong-rh commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yzong-rh commented May 10, 2026 •

edited

Loading

yzong-rh commented May 10, 2026 •

edited

Loading