[Bugfix] [Frontend] Responses API, fix merging of message and tool call#37294
[Bugfix] [Frontend] Responses API, fix merging of message and tool call#37294bfroemel wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses two key issues: it fixes the merging of messages with tool calls and ensures the correct generation of response.output_text.done events for streaming responses that include tool calls. The changes in vllm/entrypoints/openai/responses/utils.py generalize the logic to combine a tool call with any preceding assistant message, whether it contains content, reasoning, or other tool calls, and correctly appends multiple tool calls to a single message. In vllm/entrypoints/openai/responses/serving.py, the streaming logic is improved to properly accumulate content that appears before a tool call and emit the appropriate 'done' events, which fixes a bug in streaming responses. The new and updated tests in tests/entrypoints/openai/test_serving_responses.py and tests/entrypoints/test_responses_utils.py thoroughly validate these fixes and enhancements. The code is well-structured and the changes significantly improve the correctness and robustness of the Responses API.
|
There was an additional issue regarding streaming and properly including accumulated model output text in response.output_text.done events. I verified with Qwen3.5-27B (non-thinking/instruct mode only and without use of a reasoning parser) and the actually rendered prompts are looking good now. I try to also verify the thinking mode later, but there might be additional issues (not caused by my changes; strangely normal output text before tool calls seem to end up as reasoning output items if the qwen3 reasoning parser is configured? but need to investigate further). Kindly requesting reviews :) |
Signed-off-by: Bernhard Froemel <bf@ctsw.at>
…e when streaming responses that include tool calls Signed-off-by: Bernhard Froemel <bf@ctsw.at>
Signed-off-by: Bernhard Froemel <bf@ctsw.at>
4dc87c6 to
3ea9cfd
Compare
|
Hi @bfroemel, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Bernhard Froemel <bf@ctsw.at>
| ): | ||
| """Append tool call to previous message if applicable. | ||
|
|
||
| Many models treat tool calls, content, and reasoning as a single message. |
There was a problem hiding this comment.
if the model doesn't treat them as a single message, are we going to rely on tokenizer.apply_chat_template() to split the now [tool call, content, reasoning] into separate messages?
There was a problem hiding this comment.
Good point! So, I agree - I think it is reasonable to assume that a model can ensure separation of messages with its chat template.
| elif delta_message.tool_calls[0].function.name: | ||
| # send done with current content part | ||
| # and add new function call item | ||
| # Collect accumulated content from previous delta messages |
There was a problem hiding this comment.
thanks for this fix too! is it related to the original issue #37167? if not could we separate this into another PR?
There was a problem hiding this comment.
|
status update
fyi, I was a bit confused regarding the Anyway, chat-template-kwargs should be imo dealt with as similar/consistent with the chat completions API as possible, but out of scope here. I can open another issue about that. Manual end-to-end tests of this PR (just with a slightly modified codex to deal with model quirks) under forced non-thinking conditions are looking good. Manual e2e with reasoning parts is still showing issues:
|
…item.done when streaming responses that include tool calls" This reverts commit d0f8298.
|
This pull request has merge conflicts that must be resolved before it can be |
Because the Responses API hasn’t implemented this feature yet. The reason it hasn’t been implemented is that we previously wanted to wait and see whether OpenAI would release a similar field. |
…gcombining Signed-off-by: Bernhard Froemel <bf@ctsw.at>
…age + updated test cases Signed-off-by: Bernhard Froemel <bf@ctsw.at>
…gcombining Signed-off-by: Bernhard Froemel <bf@ctsw.at>
…gcombining Signed-off-by: Bernhard Froemel <bf@ctsw.at>
|
@qandrew any comments/requests or concerns? can this move forward? ;) Thanks! |
|
This fixes Gemma 4 with responses API and parallel tool calls as well |
|
:-/ Why'd you close this? Just because nobody was looking at it? It'd be great to fix the responses API so it works with more models |
Appreciating your interest and intent, but my resources are limited and I had to concede :/ Feel free to take anything useful from this PR and submit your own. I didn't feel that this was moving forward + need to have a (half-way) stable and future-proof local environment soon. Currently my only need for the responses API was OpenAI's codex harness and either using another harness, or adding the chat completions API (back) to codex solves all my problems. |
Purpose
Fixes #37167
Overall, I am aiming for chat completions API and responses API consistency to enable existing responses API clients (only aware of openai's
codexright now) to work with most vllm-hosted models that are not openai-harmony models.Test Plan
Tests have been updated + extended.
Test Result
Details
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.