server : merge contiguous Responses input items into a single assistant message by aldehir · Pull Request #19773 · ggml-org/llama.cpp

aldehir · 2026-02-21T00:40:27Z

The Responses API endpoint constructs separate chat completion messages for each input item, except for reasoning. This causes problems with many chat templates that expect content, reasoning, and tool calls to appear in a single assistant message.

This PR merges contiguous assistant inputs into a single message before passing them to templates.

This also preserves reasoning content that isn't coupled with a tool call. A few models, such as Ministral 3, support interleaved reasoning within regular messages. Models that don't typically handle pruning the reasoning in their own templates.

ref: #19765 (comment)
fixes #19513

cc @bfroemel

tools/server/server-common.cpp

bfroemel · 2026-02-21T10:06:38Z

Tested with codex & responses API:

gpt-oss-120b: noticed no change of behavior; prompt is rendered exactly the same for the case of an assistant output_text element preceding a function_call element in the input array
qwen3-coder-next, ud_q4_k_xl: no premature end of agent turns; fixes Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response) #19513, prompt renders with tool preamble messages ending with two newlines and the desired merging:

For example:

<|im_start|>assistant\nNow let me look at how the `needs_follow_up` is determined:\n\n<tool_call>\n<function=exec_command>\n<parameter=cmd>\ngrep -n \"needs_follow_up\" /home/b/work/codex-new/codex-rs/core/src/codex.rs | head -30\n</parameter>\n<parameter=justification>\nFind needs_follow_up handling\n</parameter>\n</function>\n</tool_call><|im_end|>

Only a minor discrepancy remains with respect to openrouter; the generated two newlines of tool call preambles are not part of the output_text content element in the response, e.g., on openrouter you really get back something like that:

"output":[{"role":"assistant","type":"message","status":"incomplete","content":[{"type":"output_text","text":"Now let me look at `handle_output_item_done` which processes output items and determines if follow-up is needed:\n\n","annotations":[],"logprobs":[]}],"id":"msg_tmp_vg3tku6apmm"},{"type":"function_call","call_id":"call_f16c662f06654160a2f0af2f","name":"exec_command","arguments":"{\"cmd\": \"grep -n \\\"handle_output_item_done\\\" /home/b/work/codex-new/codex-rs/core/src/codex.rs\", \"justification\": \"Find handle_output_item_done\"}","id":"call_f16c662f06654160a2f0af2f","status":"completed"}]

Thanks for the diligence/pushing back!! Too soon to tell, but model quality might have improved/feels a bit better than the non-merging fix where we just not trimmed assistant content (/preserved the generated newlines).

btw: just noticed that parallel tool calls also seem to work (#19765 ); very nice! :)

 curl http://<llama.cpp host>/v1/responses -d '{
    "model": "current",
    "instructions": "Issue multiple tool calls in a single turn if possible to reduce latency and improve user experience!",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": [ {"type": "input_text",
        "text": "what\'s the weather in Paris and London?"}]
      }
    ],
    "tools": [
      {
        "type": "function",
        "name": "get_weather",
        "description": "Returns the weather of location.",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string"
            }
          },
          "required": ["location"],
          "additionalProperties": false
        }
      }
    ],
    "stream": false,
    "parallel_tool_calls": true
  }
  '
{"completed_at":1771668166,"created_at":1771668166,"id":"resp_HJZ7SNdrqK12w6i4TUJIqHkWDoe0BlgO","model":"qwen3-coder-next","object":"response","output":[{"type":"function_call","status":"completed","arguments":"{\"location\":\"Paris\"}","call_id":"fc_6ZZb88GAiND0rfZflEv52G8yqAHESV48","name":"get_weather"},{"type":"function_call","status":"completed","arguments":"{\"location\":\"London\"}","call_id":"fc_uzAq3J1ikJcxlu1YG21AcELop1Z4JNGd","name":"get_weather"}],"status":"completed","usage":{"input_tokens":291,"output_tokens":44,"total_tokens":335}}

Mushoz · 2026-02-21T10:12:56Z

@bfroemel did you try to see if reasoning content is correctly working and being retained for non-gpt-oss models with interleaved thinking?

bfroemel · 2026-02-21T11:03:05Z

@Mushoz no; only tested what I currently use. I kind of settled with codex for now and didn't look a lot at other agents in the past couple of months.

Do we have other non-gpt-oss models that work (well) with a responses API client like codex? Do we even have other mature responses API coding agents by now? :)

When looking at the code this PR might fix another subtle issue ( https://github.com/ggml-org/llama.cpp/pull/19773/changes#diff-562aee47dd99ff76cbd0c2a3e9b98d30cdf4d0111c0f09f345130e1f096d7ef4L1299 ) as it won't remove any reasoning content anymore. btw: regarding gpt-oss models, the template discards reasoning content only, if there is a final message (one without tool calls) generated in a assistant turn (otherwise the interleaved reasoning is kept). /edit: removal of that reasoning content wasn't relevant for gpt-oss, because the prompt ignores messages that only has reasoning content without tool calls.

Overall, I believe the responsibility remains with the client to include reasoning content in the request, if required by the model, otherwise the client can leave it out. llama.cpp won't remove it (anymore with this PR now regardless whether the reasoning content is followed by tool calls or not).

Uhm and just to make sure: This PR introduces no changes for the chat completions API - there everything stays the same.

aldehir · 2026-02-21T18:26:38Z

GPT-OSS will render the same because their template format puts every element into a separate "message". It is a 1-to-1 mapping to Responses input items, which makes sense that OpenAI designed it that way.

The difference from OpenRouter should now be cosmetic, depending on if the client trims the content or not.

I appreciate you taking the time to test this out! Glad we got to the root of the issue.

…nt message (ggml-org#19773) * server : merge contiguous input items into a single assistant message * cont : simplify tool call msg * cont : reduce and combine content * cont : fix merging content items

aldehir added 2 commits February 20, 2026 18:13

server : merge contiguous input items into a single assistant message

070b68d

cont : simplify tool call msg

38dc926

aldehir requested review from ggerganov and ngxson as code owners February 21, 2026 00:40

github-actions bot added examples server labels Feb 21, 2026

aldehir changed the title ~~server : merge contiguous input items into a single assistant message~~ server : merge contiguous Responses input items into a single assistant message Feb 21, 2026

visorcraft pushed a commit to visorcraft/llama.cpp that referenced this pull request Feb 21, 2026

Merge PR ggml-org#19773 from ggml-org/llama.cpp into fork

260c3ed

pwilkin reviewed Feb 21, 2026

View reviewed changes

tools/server/server-common.cpp Outdated Show resolved Hide resolved

aldehir added 2 commits February 20, 2026 19:44

cont : reduce and combine content

304a0e9

cont : fix merging content items

3c4ff18

loci-dev mentioned this pull request Feb 21, 2026

UPSTREAM PR #19773: server : merge contiguous Responses input items into a single assistant message auroralabs-loci/llama.cpp#1196

Open

mudler mentioned this pull request Feb 21, 2026

fix: merge openresponses messages mudler/LocalAI#8615

Merged

1 task

pwilkin approved these changes Feb 21, 2026

View reviewed changes

ggerganov approved these changes Feb 22, 2026

View reviewed changes

pwilkin merged commit 34ec1c3 into ggml-org:master Feb 22, 2026
77 of 78 checks passed

T0mSIlver mentioned this pull request Mar 3, 2026

Eval bug: Anthropic Messages API drops thinking content blocks during conversion #20090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : merge contiguous Responses input items into a single assistant message#19773

server : merge contiguous Responses input items into a single assistant message#19773
pwilkin merged 4 commits intoggml-org:masterfrom
aldehir:merge-response-items-to-chatcmpl

aldehir commented Feb 21, 2026

Uh oh!

Uh oh!

bfroemel commented Feb 21, 2026 •

edited

Loading

Uh oh!

Mushoz commented Feb 21, 2026

Uh oh!

bfroemel commented Feb 21, 2026 •

edited

Loading

Uh oh!

aldehir commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

aldehir commented Feb 21, 2026

Uh oh!

Uh oh!

bfroemel commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mushoz commented Feb 21, 2026

Uh oh!

bfroemel commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bfroemel commented Feb 21, 2026 •

edited

Loading

bfroemel commented Feb 21, 2026 •

edited

Loading