Skip to content

[Bugfix] [Frontend] Responses API, fix merging of messages#42189

Merged
chaunceyjiang merged 5 commits into
vllm-project:mainfrom
yzong-rh:yzong-rh/responses-combine-msg
May 12, 2026
Merged

[Bugfix] [Frontend] Responses API, fix merging of messages#42189
chaunceyjiang merged 5 commits into
vllm-project:mainfrom
yzong-rh:yzong-rh/responses-combine-msg

Conversation

@yzong-rh

@yzong-rh yzong-rh commented May 10, 2026

Copy link
Copy Markdown
Contributor

Co-Authored-By: bfroemel

Based on #37294
With help from: @bfroemel, @weiguangli-io and @aayushbaluni

Purpose

Addresses #37167 (responses API, combining of message and tool call) by merging contiguous assistant-side Responses items into a single Chat Completions assistant message on the non-Harmony path.

Merging with the previous message only happens when the previously constructed message is an assistant message; otherwise a new assistant message is started.

Item type This PR llama.cpp PR #19773 vLLM PR #37294
output_text Populate assistant content if no previous content. Populate or append to assistant content. Populate assistant content if it has reasoning only.
function_call Populate or append to assistant tool_calls. Populate or append to assistant tool_calls. Populate or append to assistant tool_calls.
reasoning Populate assistant reasoning if no previous reasoning. Populate or override assistant reasoning_content. Start a new assistant message with reasoning.

Note:
Converting Responses items into Chat Completion for non-Harmony models is inherently lossy. Partial-completion may apply to multiple merged Responses items. These are necessary tradeoffs in order to preserve single assistant turn for template rendering.

Existing code also does not handle refusal or multi-part content. This PR doesn't remedy this.

Test Plan

Unit test:

pytest tests/entrypoints/openai/responses/test_responses_utils.py
Token-level comparisons:
# Gemma 4 disables reasoning by default, so test wo reasoning
vllm serve google/gemma-4-31B-it \
  --port 8000 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --enable-log-requests \
  --enable-log-outputs

# Qwen3.6 enables reasoning by default
vllm serve Qwen/Qwen3.6-27B \
  --port 8000 \
  --max-model-len 262144 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-log-requests \
  --enable-log-outputs

Send a Responses API request with send_response.py using tools.json and gemma4.step1.response.json or qwen36.step1.response.json. Note that they are real conversations generated by the model.

python send_response.py \
  --base-url http://127.0.0.1:8000/v1 \
  --model google/gemma-4-31B-it \
  --input-file gemma4.step1.response.json \
  --tools-file tools.json \
  --output-file gemma4.step2.json

python send_response.py \
  --base-url http://127.0.0.1:8000/v1 \
  --model Qwen/Qwen3.6-27B \
  --input-file qwen36.step1.response.json \
  --tools-file tools.json \
  --output-file qwen36.step2.json

And compare the decoded prompt token ids.

BFCL evals
BFCL_MODEL="google/gemma-4-31B-it" \
BFCL_MAX_MODEL_LEN="16384" \
BFCL_API_TYPE="responses" \
BFCL_TOOL_CALL_PARSER="gemma4" \
BFCL_REASONING_PARSER="" \
BFCL_TEST_CATEGORY="multi_turn" \
BFCL_EXTRA_ARGS="--data-parallel-size 2 --chat-template examples/tool_chat_template_gemma4.jinja" \
bash .buildkite/scripts/tool_call/run-bfcl-eval.sh

BFCL_MODEL="Qwen/Qwen3.6-27B" \
BFCL_MAX_MODEL_LEN="16384" \
BFCL_API_TYPE="responses" \
BFCL_TOOL_CALL_PARSER="qwen3_coder" \
BFCL_REASONING_PARSER="qwen3" \
BFCL_TEST_CATEGORY="multi_turn" \
BFCL_EXTRA_ARGS="--data-parallel-size 2"
bash .buildkite/scripts/tool_call/run-bfcl-eval.sh

Test Result

Unit test pass.

Prompt token comparison:

BFCL results:

Qwen3.6:

Run multi_turn_miss_func multi_turn_miss_param multi_turn_long_context multi_turn_base
Before 61.00% 47.50% 36.50% 62.50%
After 62.50% 48.50% 46.50% 74.50%

Gemma 4:

Run multi_turn_miss_func multi_turn_miss_param multi_turn_long_context multi_turn_base
Before 17.00% 17.00% 15.00% 28.50%
After 42.50% 50.00% 52.50% 79.50%

cc @qandrew @chaunceyjiang

Made with Cursor

Signed-off-by: Yifan Zong <yzong@redhat.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the chat message construction logic to merge consecutive assistant-side response items, such as reasoning, tool calls, and output text, into single assistant messages. It also updates the test suite with comprehensive merging policy checks. Feedback suggests extending the merging logic to account for existing message history to prevent template issues and replacing assertions with robust type checks when modifying tool call lists to ensure compatibility with immutable types.

Comment thread vllm/entrypoints/openai/responses/utils.py Outdated
Comment thread vllm/entrypoints/openai/responses/utils.py Outdated
Signed-off-by: Yifan Zong <yzong@redhat.com>
@yzong-rh yzong-rh marked this pull request as ready for review May 10, 2026 00:46

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@bfroemel

Copy link
Copy Markdown

@yzong-rh thanks, but I am not a coauthor ;) (no code seems to be directly taken from my abandoned PR + coauthorship potentially could complicate DCO)

Your PR looks good, two minor remarks:

  • cognitively I preferred the explicit merging function; now it might be (slightly harder) to recognize that _construct_message_from_response_item() also does merging
  • not sure about merging reasoning into a previous assistant message that also has reasoning, and merging content into a previous assistant message that also has content; both imo indicate that there likely is a problem, because (according to schema and how reasoning is handled in practice) chat completion models cannot generate an assistant chat completion message containing multiple reasoning or content fields. Usually, models are quite sensitive whether their own text generations and the rendered prompt is within their expectation/in distribution; ofc there could be benefits of your merging strategy, if you switched mid-session from a model like gpt-oss to a "native" chat completion model, but on the other hand, if a client supports such a model switch, it might handle that conversation/item merging internally (or ask the harmony model first for a concise summary and hand that over to the non-harmony model).

Signed-off-by: Yifan Zong <yzong@redhat.com>
@yzong-rh

yzong-rh commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

I am not a coauthor ;) (no code seems to be directly taken from my abandoned PR + coauthorship potentially could complicate DCO)

I pulled from you PR to work on top of it. Enough changes may have been made that not much is left of the original code. Happy to amend DCO and give you proper attribution

cognitively I preferred the explicit merging function; now it might be (slightly harder) to recognize that _construct_message_from_response_item() also does merging

I removed the explicit merge function to avoid code duplication in the if / else chain, although you are right that _construct_message_from_response_item() can now unexpectedly do merge. I've update the fn signature and comment to highligh this better.

not sure about merging reasoning into a previous assistant message that also has reasoning, and merging content into a previous assistant message that also has content; both imo indicate that there likely is a problem, because (according to schema and how reasoning is handled in practice) chat completion models cannot generate an assistant chat completion message containing multiple reasoning or content fields.

Agree with you that both would indicate a problem given the current models. Other than switching from a gpt-oss to a native model mid-session you mentioned, another case is that the user turn somehow got lost in the conversation, in which case having two consecutive assistant turns might be preferable over a single assistant turn with concatenated content and reasoning.

I originally aimed to replicate the llama.cpp PR but a less greedy approach might be less surprising here. Plus, the current content concatenation path isn't that robust, and the code will be much simpler. Updated the logic and BFCL as sanity check (no perf regression, as expected)

Comment thread vllm/entrypoints/openai/responses/utils.py Outdated
Comment thread vllm/entrypoints/openai/responses/utils.py Outdated
Signed-off-by: Yifan <yzong@redhat.com>
@yzong-rh yzong-rh requested a review from chaunceyjiang May 11, 2026 18:25

@chaunceyjiang chaunceyjiang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label May 12, 2026
@chaunceyjiang chaunceyjiang enabled auto-merge (squash) May 12, 2026 12:04
@chaunceyjiang chaunceyjiang disabled auto-merge May 12, 2026 12:42
@chaunceyjiang chaunceyjiang enabled auto-merge (squash) May 12, 2026 12:43
@chaunceyjiang chaunceyjiang merged commit 6ff7405 into vllm-project:main May 12, 2026
42 checks passed
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
…ect#42189)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…ect#42189)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026
…ect#42189)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
kainwinterheart pushed a commit to kainwinterheart/vllm that referenced this pull request May 30, 2026
…ect#42189)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
(cherry picked from commit 6ff7405)
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…ect#42189)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…ect#42189)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants