server : merge contiguous Responses input items into a single assistant message#19773
Conversation
|
Tested with codex & responses API:
For example: Only a minor discrepancy remains with respect to openrouter; the generated two newlines of tool call preambles are not part of the Thanks for the diligence/pushing back!! Too soon to tell, but model quality might have improved/feels a bit better than the non-merging fix where we just not trimmed assistant content (/preserved the generated newlines). btw: just noticed that parallel tool calls also seem to work (#19765 ); very nice! :) |
|
@bfroemel did you try to see if reasoning content is correctly working and being retained for non-gpt-oss models with interleaved thinking? |
|
@Mushoz no; only tested what I currently use. I kind of settled with codex for now and didn't look a lot at other agents in the past couple of months. Do we have other non-gpt-oss models that work (well) with a responses API client like codex? Do we even have other mature responses API coding agents by now? :)
Overall, I believe the responsibility remains with the client to include reasoning content in the request, if required by the model, otherwise the client can leave it out. llama.cpp won't remove it (anymore with this PR now regardless whether the reasoning content is followed by tool calls or not). Uhm and just to make sure: This PR introduces no changes for the chat completions API - there everything stays the same. |
|
GPT-OSS will render the same because their template format puts every element into a separate "message". It is a 1-to-1 mapping to Responses input items, which makes sense that OpenAI designed it that way. The difference from OpenRouter should now be cosmetic, depending on if the client trims the content or not. I appreciate you taking the time to test this out! Glad we got to the root of the issue. |
…nt message (ggml-org#19773) * server : merge contiguous input items into a single assistant message * cont : simplify tool call msg * cont : reduce and combine content * cont : fix merging content items
…nt message (ggml-org#19773) * server : merge contiguous input items into a single assistant message * cont : simplify tool call msg * cont : reduce and combine content * cont : fix merging content items
…nt message (ggml-org#19773) * server : merge contiguous input items into a single assistant message * cont : simplify tool call msg * cont : reduce and combine content * cont : fix merging content items
…nt message (ggml-org#19773) * server : merge contiguous input items into a single assistant message * cont : simplify tool call msg * cont : reduce and combine content * cont : fix merging content items
The Responses API endpoint constructs separate chat completion messages for each input item, except for reasoning. This causes problems with many chat templates that expect content, reasoning, and tool calls to appear in a single assistant message.
This PR merges contiguous assistant inputs into a single message before passing them to templates.
This also preserves reasoning content that isn't coupled with a tool call. A few models, such as Ministral 3, support interleaved reasoning within regular messages. Models that don't typically handle pruning the reasoning in their own templates.
ref: #19765 (comment)
fixes #19513
cc @bfroemel