Skip to content

[Bugfix] [Frontend] Responses API, fix merging of message and tool call#37294

Closed
bfroemel wants to merge 9 commits into
vllm-project:mainfrom
bfroemel-ai:pr-fix-respapi-prevmsgcombining
Closed

[Bugfix] [Frontend] Responses API, fix merging of message and tool call#37294
bfroemel wants to merge 9 commits into
vllm-project:mainfrom
bfroemel-ai:pr-fix-respapi-prevmsgcombining

Conversation

@bfroemel

@bfroemel bfroemel commented Mar 17, 2026

Copy link
Copy Markdown

Purpose

Fixes #37167

Overall, I am aiming for chat completions API and responses API consistency to enable existing responses API clients (only aware of openai's codex right now) to work with most vllm-hosted models that are not openai-harmony models.

Test Plan

Tests have been updated + extended.

Test Result

Details
 # ./pytest tests/entrypoints/openai/test_serving_responses.py -v -x 
============================================================================================================================================================================ test session starts ============================================================================================================================================================================
platform linux -- Python 3.12.10, pytest-9.0.2, pluggy-1.6.0 -- /home/b/work/vllm/vllm/venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /home/b/work/vllm/vllm
configfile: pyproject.toml
plugins: anyio-4.10.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 18 items                                                                                                                                                                                                                                                                                                                                                          

tests/entrypoints/openai/test_serving_responses.py::test_extract_tool_types PASSED                                                                                                                                                                                                                                                                                    [  5%]
tests/entrypoints/openai/test_serving_responses.py::TestInitializeToolSessions::test_initialize_tool_sessions PASSED                                                                                                                                                                                                                                                  [ 11%]
tests/entrypoints/openai/test_serving_responses.py::TestInitializeToolSessions::test_validate_create_responses_input PASSED                                                                                                                                                                                                                                           [ 16%]
tests/entrypoints/openai/test_serving_responses.py::TestValidateGeneratorInput::test_validate_generator_input PASSED                                                                                                                                                                                                                                                  [ 22%]
tests/entrypoints/openai/test_serving_responses.py::test_reasoning_tokens_counted_for_text_reasoning_model PASSED                                                                                                                                                                                                                                                     [ 27%]
tests/entrypoints/openai/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_basic_formats PASSED                                                                                                                                                                                                                           [ 33%]
tests/entrypoints/openai/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_star_normalization PASSED                                                                                                                                                                                                                      [ 38%]
tests/entrypoints/openai/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_filters_non_mcp PASSED                                                                                                                                                                                                                         [ 44%]
tests/entrypoints/openai/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_delta_emits_text_events PASSED                                                                                                                                                                                                                                        [ 50%]
tests/entrypoints/openai/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_delta_second_token_no_added PASSED                                                                                                                                                                                                                                    [ 55%]
tests/entrypoints/openai/test_serving_responses.py::TestHarmonyPreambleStreaming::test_commentary_with_function_recipient_not_preamble PASSED                                                                                                                                                                                                                         [ 61%]
tests/entrypoints/openai/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_done_emits_text_done_events PASSED                                                                                                                                                                                                                                    [ 66%]
tests/entrypoints/openai/test_serving_responses.py::TestHarmonyPreambleStreaming::test_commentary_with_recipient_no_preamble_done PASSED                                                                                                                                                                                                                              [ 72%]
tests/entrypoints/openai/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_mixed_delta_reasoning_and_content_emits_reasoning_delta PASSED                                                                                                                                                                                                    [ 77%]
tests/entrypoints/openai/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_transition_without_mixed_delta_no_extra_reasoning_event PASSED                                                                                                                                                                                                    [ 83%]
tests/entrypoints/openai/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_reasoning_only_stream_no_content PASSED                                                                                                                                                                                                                           [ 88%]
tests/entrypoints/openai/test_serving_responses.py::TestContentBeforeToolCall::test_content_before_tool_call_done_event_has_content PASSED                                                                                                                                                                                                                            [ 94%]
tests/entrypoints/openai/test_serving_responses.py::TestContentBeforeToolCall::test_no_content_before_tool_call_empty_done PASSED                                                                                                                                                                                                                                     [100%]

============================================================================================================================================================================= warnings summary ==============================================================================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================================================================================================== 18 passed, 2 warnings in 5.93s =======================================================================================================================================================================

# ./pytest tests/entrypoints/test_responses_utils.py -v -x 
============================================================================================================================================================================ test session starts ============================================================================================================================================================================
platform linux -- Python 3.12.10, pytest-9.0.2, pluggy-1.6.0 -- /home/b/work/vllm/vllm/venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /home/b/work/vllm/vllm
configfile: pyproject.toml
plugins: anyio-4.10.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 33 items                                                                                                                                                                                                                                                                                                                                                          

tests/entrypoints/test_responses_utils.py::TestResponsesUtils::test_convert_tool_responses_to_completions_format PASSED                                                                                                                                                                                                                                               [  3%]
tests/entrypoints/test_responses_utils.py::TestResponsesUtils::test_construct_chat_messages_with_tool_call PASSED                                                                                                                                                                                                                                                     [  6%]
tests/entrypoints/test_responses_utils.py::TestResponsesUtils::test_construct_single_message_from_response_item PASSED                                                                                                                                                                                                                                                [  9%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_content_preferred_over_summary PASSED                                                                                                                                                                                                                                               [ 12%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_content_only PASSED                                                                                                                                                                                                                                                                 [ 15%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_summary_fallback_when_no_content PASSED                                                                                                                                                                                                                                             [ 18%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_summary_fallback_when_content_empty PASSED                                                                                                                                                                                                                                          [ 21%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_neither_content_nor_summary PASSED                                                                                                                                                                                                                                                  [ 24%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_encrypted_content_raises PASSED                                                                                                                                                                                                                                                     [ 27%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_summary_with_multiple_entries_uses_first PASSED                                                                                                                                                                                                                                     [ 30%]
tests/entrypoints/test_responses_utils.py::TestReasoningItemContentPriority::test_no_warning_when_content_used PASSED                                                                                                                                                                                                                                                 [ 33%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_string_input_returns_false PASSED                                                                                                                                                                                                                                                     [ 36%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_empty_list_returns_false PASSED                                                                                                                                                                                                                                                       [ 39%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_completed_message_returns_false PASSED                                                                                                                                                                                                                                                [ 42%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_in_progress_message_returns_true PASSED                                                                                                                                                                                                                                               [ 45%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_incomplete_message_returns_true PASSED                                                                                                                                                                                                                                                [ 48%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_in_progress_reasoning_returns_true PASSED                                                                                                                                                                                                                                             [ 51%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_incomplete_reasoning_returns_true PASSED                                                                                                                                                                                                                                              [ 54%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_completed_reasoning_returns_false PASSED                                                                                                                                                                                                                                              [ 57%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_reasoning_with_none_status_returns_false PASSED                                                                                                                                                                                                                                       [ 60%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_only_last_item_matters PASSED                                                                                                                                                                                                                                                         [ 63%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_tool_call_returns_false PASSED                                                                                                                                                                                                                                                        [ 66%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_dict_in_progress_message_returns_true PASSED                                                                                                                                                                                                                                          [ 69%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_dict_incomplete_message_returns_true PASSED                                                                                                                                                                                                                                           [ 72%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_dict_completed_message_returns_false PASSED                                                                                                                                                                                                                                           [ 75%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_dict_reasoning_in_progress_returns_true PASSED                                                                                                                                                                                                                                        [ 78%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_dict_without_status_returns_false PASSED                                                                                                                                                                                                                                              [ 81%]
tests/entrypoints/test_responses_utils.py::TestShouldContinueFinalMessage::test_dict_with_none_status_returns_false PASSED                                                                                                                                                                                                                                            [ 84%]
tests/entrypoints/test_responses_utils.py::TestMaybeCombinePrevmsgAndToolCall::test_combines_reasoning_and_tool_call PASSED                                                                                                                                                                                                                                           [ 87%]
tests/entrypoints/test_responses_utils.py::TestMaybeCombinePrevmsgAndToolCall::test_returns_none_for_non_function_tool_call_type PASSED                                                                                                                                                                                                                               [ 90%]
tests/entrypoints/test_responses_utils.py::TestMaybeCombinePrevmsgAndToolCall::test_combines_content_and_tool_call PASSED                                                                                                                                                                                                                                             [ 93%]
tests/entrypoints/test_responses_utils.py::TestMaybeCombinePrevmsgAndToolCall::test_appends_multiple_tool_calls PASSED                                                                                                                                                                                                                                                [ 96%]
tests/entrypoints/test_responses_utils.py::TestMaybeCombinePrevmsgAndToolCall::test_combines_three_tool_calls PASSED                                                                                                                                                                                                                                                  [100%]

============================================================================================================================================================================= warnings summary ==============================================================================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================================================================================================================== 33 passed, 2 warnings in 10.45s ======================================================================================================================================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify Bot added the frontend label Mar 17, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two key issues: it fixes the merging of messages with tool calls and ensures the correct generation of response.output_text.done events for streaming responses that include tool calls. The changes in vllm/entrypoints/openai/responses/utils.py generalize the logic to combine a tool call with any preceding assistant message, whether it contains content, reasoning, or other tool calls, and correctly appends multiple tool calls to a single message. In vllm/entrypoints/openai/responses/serving.py, the streaming logic is improved to properly accumulate content that appears before a tool call and emit the appropriate 'done' events, which fixes a bug in streaming responses. The new and updated tests in tests/entrypoints/openai/test_serving_responses.py and tests/entrypoints/test_responses_utils.py thoroughly validate these fixes and enhancements. The code is well-structured and the changes significantly improve the correctness and robustness of the Responses API.

@bfroemel

Copy link
Copy Markdown
Author

There was an additional issue regarding streaming and properly including accumulated model output text in response.output_text.done events. I verified with Qwen3.5-27B (non-thinking/instruct mode only and without use of a reasoning parser) and the actually rendered prompts are looking good now.

I try to also verify the thinking mode later, but there might be additional issues (not caused by my changes; strangely normal output text before tool calls seem to end up as reasoning output items if the qwen3 reasoning parser is configured? but need to investigate further).

Kindly requesting reviews :)
cc: @qandrew @chaunceyjiang

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
…e when streaming responses that include tool calls

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
Signed-off-by: Bernhard Froemel <bf@ctsw.at>
@bfroemel bfroemel force-pushed the pr-fix-respapi-prevmsgcombining branch from 4dc87c6 to 3ea9cfd Compare March 17, 2026 12:46
@bfroemel bfroemel changed the title Responses API, fix merging of message and tool call + generate correct response.output_text.done for streaming responses with tool calls [Bugfix] [Frontend] Responses API, fix merging of message and tool call + generate correct response.output_text.done for streaming responses with tool calls Mar 17, 2026
@mergify mergify Bot added the bug Something isn't working label Mar 17, 2026
@mergify

mergify Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor

Hi @bfroemel, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
):
"""Append tool call to previous message if applicable.

Many models treat tool calls, content, and reasoning as a single message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the model doesn't treat them as a single message, are we going to rely on tokenizer.apply_chat_template() to split the now [tool call, content, reasoning] into separate messages?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! So, I agree - I think it is reasonable to assume that a model can ensure separation of messages with its chat template.

elif delta_message.tool_calls[0].function.name:
# send done with current content part
# and add new function call item
# Collect accumulated content from previous delta messages

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this fix too! is it related to the original issue #37167? if not could we separate this into another PR?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will separate! I think I discovered another streaming related issue and will put it together with d0f8298 in an extra PR. (These streaming related fixes in that upcoming PR are important for manual end2end tests of #37167 with an actual responses API coding agent (codex). )

@bfroemel

bfroemel commented Mar 18, 2026

Copy link
Copy Markdown
Author

status update

I try to also verify the thinking mode later, but there might be additional issues (not caused by my changes; strangely normal output text before tool calls seem to end up as reasoning output items if the qwen3 reasoning parser is configured? but need to investigate further).

fyi, I was a bit confused regarding the enable_thinking feature of Qwen3.5 models which worked as expected in the chat completions API, but not in the responses API. There we currently do not pass on enable_thinking from CLI call arguments (--default-chat-template-kwargs '{"enable_thinking": false}') or process a custom enable_thinking entry (in the extra_body dict of a responses request) to the reasoning-parser; same situation regarding the jinja template. Hence, for example, the qwen3 reasoning-parser always defaults to true, while the chat templates of different models default to either true (e.g., Qwen/Qwen3.5-27B), or false (e.g., Qwen/Qwen3.5-2B); leading to weird behavior.

Anyway, chat-template-kwargs should be imo dealt with as similar/consistent with the chat completions API as possible, but out of scope here. I can open another issue about that.


Manual end-to-end tests of this PR (just with a slightly modified codex to deal with model quirks) under forced non-thinking conditions are looking good.

Manual e2e with reasoning parts is still showing issues:

  • normal content messages and reasoning items end up rendered in separate <|im_start|><|im_end|> blocks; imo we should merge here as well; otherwise there is the same inconsistency between chat completions and responses API as with tool calls. Corresponding changes and overall state of handling message/item merges in llama.cpp ( https://github.com/ggml-org/llama.cpp/pull/19773/changes ) would agree. How do you see this, @qandrew ?
  • also I sometimes observe the effects of apparently unmerged messages in the rendered prompts, demanding more debugging:
.
(APIServer pid=1318029) <|im_start|>assistant
(APIServer pid=1318029) <think>
(APIServer pid=1318029) Let me look at the Prompt struct.
(APIServer pid=1318029) </think>
(APIServer pid=1318029) 
(APIServer pid=1318029) <|im_end|>
(APIServer pid=1318029) <|im_start|>assistant
(APIServer pid=1318029) <think>
(APIServer pid=1318029) 
(APIServer pid=1318029) </think>
(APIServer pid=1318029) 
(APIServer pid=1318029) <tool_call>
(APIServer pid=1318029) <function=exec_command>
(APIServer pid=1318029) <parameter=cmd>
.

@bfroemel bfroemel changed the title [Bugfix] [Frontend] Responses API, fix merging of message and tool call + generate correct response.output_text.done for streaming responses with tool calls [Bugfix] [Frontend] Responses API, fix merging of message and tool call Mar 18, 2026
…item.done when streaming responses that include tool calls"

This reverts commit d0f8298.
@mergify

mergify Bot commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bfroemel.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@chaunceyjiang

Copy link
Copy Markdown
Collaborator

fyi, I was a bit confused regarding the enable_thinking feature of Qwen3.5 models which worked as expected in the chat completions API, but not in the responses API.

Because the Responses API hasn’t implemented this feature yet.

The reason it hasn’t been implemented is that we previously wanted to wait and see whether OpenAI would release a similar field.

…gcombining

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
@mergify mergify Bot removed the needs-rebase label Mar 26, 2026
…age + updated test cases

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
@bfroemel

bfroemel commented Mar 26, 2026

Copy link
Copy Markdown
Author

Hi @qandrew - I am currently happy with the state of this PR! Anything I could still do to make a merge happen? I'll continue testing/keep using this PR + try to merge with main every couple of days.

(Together with #38227 everything I care about works very very well now :) )

…gcombining

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
…gcombining

Signed-off-by: Bernhard Froemel <bf@ctsw.at>
@bfroemel

bfroemel commented Apr 9, 2026

Copy link
Copy Markdown
Author

@qandrew any comments/requests or concerns? can this move forward? ;) Thanks!

@Kimahriman

Copy link
Copy Markdown
Contributor

This fixes Gemma 4 with responses API and parallel tool calls as well

@bfroemel bfroemel closed this Apr 16, 2026
@bfroemel bfroemel deleted the pr-fix-respapi-prevmsgcombining branch April 16, 2026 08:52
@Kimahriman

Copy link
Copy Markdown
Contributor

:-/ Why'd you close this? Just because nobody was looking at it? It'd be great to fix the responses API so it works with more models

@bfroemel

Copy link
Copy Markdown
Author

:-/ Why'd you close this?

Appreciating your interest and intent, but my resources are limited and I had to concede :/ Feel free to take anything useful from this PR and submit your own.

I didn't feel that this was moving forward + need to have a (half-way) stable and future-proof local environment soon. Currently my only need for the responses API was OpenAI's codex harness and either using another harness, or adding the chat completions API (back) to codex solves all my problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: responses API, combining of message and tool call

4 participants