Skip to content

server : merge contiguous Responses input items into a single assistant message#19773

Merged
pwilkin merged 4 commits intoggml-org:masterfrom
aldehir:merge-response-items-to-chatcmpl
Feb 22, 2026
Merged

server : merge contiguous Responses input items into a single assistant message#19773
pwilkin merged 4 commits intoggml-org:masterfrom
aldehir:merge-response-items-to-chatcmpl

Conversation

@aldehir
Copy link
Collaborator

@aldehir aldehir commented Feb 21, 2026

The Responses API endpoint constructs separate chat completion messages for each input item, except for reasoning. This causes problems with many chat templates that expect content, reasoning, and tool calls to appear in a single assistant message.

This PR merges contiguous assistant inputs into a single message before passing them to templates.

This also preserves reasoning content that isn't coupled with a tool call. A few models, such as Ministral 3, support interleaved reasoning within regular messages. Models that don't typically handle pruning the reasoning in their own templates.

ref: #19765 (comment)
fixes #19513

cc @bfroemel

@aldehir aldehir changed the title server : merge contiguous input items into a single assistant message server : merge contiguous Responses input items into a single assistant message Feb 21, 2026
visorcraft pushed a commit to visorcraft/llama.cpp that referenced this pull request Feb 21, 2026
@bfroemel
Copy link

bfroemel commented Feb 21, 2026

Tested with codex & responses API:

For example:

<|im_start|>assistant\nNow let me look at how the `needs_follow_up` is determined:\n\n<tool_call>\n<function=exec_command>\n<parameter=cmd>\ngrep -n \"needs_follow_up\" /home/b/work/codex-new/codex-rs/core/src/codex.rs | head -30\n</parameter>\n<parameter=justification>\nFind needs_follow_up handling\n</parameter>\n</function>\n</tool_call><|im_end|>

Only a minor discrepancy remains with respect to openrouter; the generated two newlines of tool call preambles are not part of the output_text content element in the response, e.g., on openrouter you really get back something like that:

"output":[{"role":"assistant","type":"message","status":"incomplete","content":[{"type":"output_text","text":"Now let me look at `handle_output_item_done` which processes output items and determines if follow-up is needed:\n\n","annotations":[],"logprobs":[]}],"id":"msg_tmp_vg3tku6apmm"},{"type":"function_call","call_id":"call_f16c662f06654160a2f0af2f","name":"exec_command","arguments":"{\"cmd\": \"grep -n \\\"handle_output_item_done\\\" /home/b/work/codex-new/codex-rs/core/src/codex.rs\", \"justification\": \"Find handle_output_item_done\"}","id":"call_f16c662f06654160a2f0af2f","status":"completed"}]

Thanks for the diligence/pushing back!! Too soon to tell, but model quality might have improved/feels a bit better than the non-merging fix where we just not trimmed assistant content (/preserved the generated newlines).

btw: just noticed that parallel tool calls also seem to work (#19765 ); very nice! :)

 curl http://<llama.cpp host>/v1/responses -d '{
    "model": "current",
    "instructions": "Issue multiple tool calls in a single turn if possible to reduce latency and improve user experience!",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [ {"type": "input_text",
        "text": "what\'s the weather in Paris and London?"}]
      }
    ],
    "tools": [
      {
        "type": "function",
        "name": "get_weather",
        "description": "Returns the weather of location.",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string"
            }
          },
          "required": ["location"],
          "additionalProperties": false
        }
      }
    ],
    "stream": false,
    "parallel_tool_calls": true
  }
  '
{"completed_at":1771668166,"created_at":1771668166,"id":"resp_HJZ7SNdrqK12w6i4TUJIqHkWDoe0BlgO","model":"qwen3-coder-next","object":"response","output":[{"type":"function_call","status":"completed","arguments":"{\"location\":\"Paris\"}","call_id":"fc_6ZZb88GAiND0rfZflEv52G8yqAHESV48","name":"get_weather"},{"type":"function_call","status":"completed","arguments":"{\"location\":\"London\"}","call_id":"fc_uzAq3J1ikJcxlu1YG21AcELop1Z4JNGd","name":"get_weather"}],"status":"completed","usage":{"input_tokens":291,"output_tokens":44,"total_tokens":335}}

@Mushoz
Copy link

Mushoz commented Feb 21, 2026

@bfroemel did you try to see if reasoning content is correctly working and being retained for non-gpt-oss models with interleaved thinking?

@bfroemel
Copy link

bfroemel commented Feb 21, 2026

@Mushoz no; only tested what I currently use. I kind of settled with codex for now and didn't look a lot at other agents in the past couple of months.

Do we have other non-gpt-oss models that work (well) with a responses API client like codex? Do we even have other mature responses API coding agents by now? :)

When looking at the code this PR might fix another subtle issue ( https://github.com/ggml-org/llama.cpp/pull/19773/changes#diff-562aee47dd99ff76cbd0c2a3e9b98d30cdf4d0111c0f09f345130e1f096d7ef4L1299 ) as it won't remove any reasoning content anymore. btw: regarding gpt-oss models, the template discards reasoning content only, if there is a final message (one without tool calls) generated in a assistant turn (otherwise the interleaved reasoning is kept). /edit: removal of that reasoning content wasn't relevant for gpt-oss, because the prompt ignores messages that only has reasoning content without tool calls.

Overall, I believe the responsibility remains with the client to include reasoning content in the request, if required by the model, otherwise the client can leave it out. llama.cpp won't remove it (anymore with this PR now regardless whether the reasoning content is followed by tool calls or not).

Uhm and just to make sure: This PR introduces no changes for the chat completions API - there everything stays the same.

@aldehir
Copy link
Collaborator Author

aldehir commented Feb 21, 2026

GPT-OSS will render the same because their template format puts every element into a separate "message". It is a 1-to-1 mapping to Responses input items, which makes sense that OpenAI designed it that way.

The difference from OpenRouter should now be cosmetic, depending on if the client trims the content or not.

I appreciate you taking the time to test this out! Glad we got to the root of the issue.

@pwilkin pwilkin merged commit 34ec1c3 into ggml-org:master Feb 22, 2026
77 of 78 checks passed
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…nt message (ggml-org#19773)

* server : merge contiguous input items into a single assistant message

* cont : simplify tool call msg

* cont : reduce and combine content

* cont : fix merging content items
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
…nt message (ggml-org#19773)

* server : merge contiguous input items into a single assistant message

* cont : simplify tool call msg

* cont : reduce and combine content

* cont : fix merging content items
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
…nt message (ggml-org#19773)

* server : merge contiguous input items into a single assistant message

* cont : simplify tool call msg

* cont : reduce and combine content

* cont : fix merging content items
aldehir added a commit to aldehir/llama.cpp that referenced this pull request Mar 6, 2026
…nt message (ggml-org#19773)

* server : merge contiguous input items into a single assistant message

* cont : simplify tool call msg

* cont : reduce and combine content

* cont : fix merging content items
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response)

5 participants