Skip to content

server: /v1/responses (partial)#18486

Merged
ngxson merged 25 commits intoggml-org:masterfrom
openingnow:v1_responses
Jan 21, 2026
Merged

server: /v1/responses (partial)#18486
ngxson merged 25 commits intoggml-org:masterfrom
openingnow:v1_responses

Conversation

@openingnow
Copy link
Contributor

@openingnow openingnow commented Dec 30, 2025

previous PR: #18227

Conversations need to be resolved:

This PR implements:

  1. Converting Chat Completions requests to Responses requests, while aware of reasoning_content.
  2. Emitting Responses SSEs

Current caveats:

  • If there are consecutive function calls, the response.output_item.added event will not be emitted for the latter ones.
  • All response.output_item.done events are generated at the end (i.e., when there is both reasoning and a function call, response.function_call_arguments.delta is created before response.output_item.done(of reasoning))

The issue arise from server_task_result_cmpl_partial::update and seems to be not problematic for codex-cli since it does not check order of events.


Here's a visualization of handling reasoning texts. Let's start with codex 'Explain this repo in one sentence'

# config.toml
# match model name github.com/openai/codex/blob/0a568a/codex-rs/core/src/models_manager/model_info.rs#L116
model = "gpt-oss_local_gguf"
model_provider = "llama_cpp"

[model_providers.llama_cpp]
name = "llama_cpp API"
base_url = "http://127.0.0.1:8080/v1"
wire_api = "responses"
stream_idle_timeout_ms = 10000000

First request (Responses format):

{"model":"gpt-oss_local_gguf","instructions":"You are a ...","input":[
    {"type":"message","role":"developer","content":[{"type":"input_text","text":"<permissions instructions>..."}]},
    {"type":"message","role":"user","content":[{"type":"input_text","text":"# AGENTS.md instructions for ..."}]},
    {"type":"message","role":"user","content":[{"type":"input_text","text":"<environment_context>..."}]},
    {"type":"message","role":"user","content":[{"type":"input_text","text":"Explain this repo in one sentence"}]}
],"tools":["..."],"tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59"}

convert_responses_to_chatcmpl converts this into the Chat Completions format:

{"model":"gpt-oss_local_gguf","tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59","messages":[
    {"role":"system","content":"You are a ..."},
    {"role":"developer","content":[{"text":"<permissions instructions>...","type":"text"}]},
    {"role":"user","content":[{"text":"# AGENTS.md instructions for ...","type":"text"}]},
    {"role":"user","content":[{"text":"<environment_context>...","type":"text"}]},
    {"role":"user","content":[{"text":"Explain this repo in one sentence","type":"text"}]}
],"tools":["..."]}

generated prompt:

<|start|>system<|message|>You are ChatGPT, ...<|end|>
<|start|>developer<|message|># Instructions\n\nYou are a ...\n\n# Tools...<|end|>
<|start|>user<|message|># AGENTS.md instructions for ...<|end|>
<|start|>user<|message|><environment_context>...<|end|>
<|start|>user<|message|>Explain this repo in one sentence<|end|>
<|start|>assistant

With the prompt, a reasoning text("We need to explain repo ...") and a function call(ls -R) are made.

    {"type":"response.created","response":{"id":"resp_WWZZqZyHtSnMteAaYg1oCkKtxeMk1mPO","object":"response","status":"in_progress"}}
    {"type":"response.in_progress","response":{"id":"resp_WWZZqZyHtSnMteAaYg1oCkKtxeMk1mPO","object":"response","status":"in_progress"}}
    {"type":"response.output_item.added","item":{"id":"rs_HL7QCzI9EEpldiojvIeiYPb6AOe3EnUi","summary":[],"type":"reasoning","content":[],"encrypted_content":"","status":"in_progress"}}
    {"type":"response.reasoning_text.delta","delta":"We","item_id":"rs_HL7QCzI9EEpldiojvIeiYPb6AOe3EnUi"}
    deltas ...
    {"type":"response.output_item.added","item":{"arguments":"","call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0","name":"shell","type":"function_call","status":"in_progress"}}
    {"type":"response.function_call_arguments.delta","delta":"{\"","item_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0"}
    deltas ...
--> {"type":"response.output_item.done","item":{"id":"rs_HL7QCzI9EEpldiojvIeiYPb6AOe3EnUi","summary":[],"type":"reasoning","content":[{"text":"We need to explain repo in one sentence. Let's inspect repo.","type":"reasoning_text"}],"encrypted_content":""}}
    {"type":"response.output_item.done","item":{"type":"function_call","status":"completed","arguments":"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}","call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0","name":"shell"}}
    {"type": "response.completed", ...}
(The arrow shows delayed `response.output_item.done` of reasoning)

Codex sends the new request after executing ls -R.

Second request (Responses format):

 {"model":"gpt-oss_local_gguf","instructions":"You are a ...","input":[
     {"type":"message","role":"developer","content":[{"type":"input_text","text":"<permissions instructions>..."}]},
     {"type":"message","role":"user","content":[{"type":"input_text","text":"# AGENTS.md instructions for ..."}]},
     {"type":"message","role":"user","content":[{"type":"input_text","text":"<environment_context>..."}]},
     {"type":"message","role":"user","content":[{"type":"input_text","text":"Explain this repo in one sentence"}]},
+    {"type":"reasoning","summary":[],"content":[{"type":"reasoning_text","text":"We need to explain repo in one sentence. Let's inspect repo."}],"encrypted_content":""},
+    {"type":"function_call","name":"shell","arguments":"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}","call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0"},
+    {"type":"function_call_output","call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0","output":"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"}
 ],"tools":["..."],"tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59"}

Converted request (Chat Completions format):

 {"model":"gpt-oss_local_gguf","tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59","messages":[
     {"role":"system","content":"You are a ..."},
     {"role":"developer","content":[{"text":"<permissions instructions>...","type":"text"}]},
     {"role":"user","content":[{"text":"# AGENTS.md instructions for ...","type":"text"}]},
     {"role":"user","content":[{"text":"<environment_context>...","type":"text"}]},
     {"role":"user","content":[{"text":"Explain this repo in one sentence","type":"text"}]},
+    {"role":"assistant","tool_calls":[{"function":{"arguments":"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}","name":"shell"},"id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0","type":"function"}],"reasoning_content":"We need to explain repo in one sentence. Let's inspect repo."},
+    {"role":"tool","content":"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}","tool_call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0"}
 ],"tools":["..."]}

generated prompt:

 <|start|>system<|message|>You are ChatGPT, ...<|end|>
 <|start|>developer<|message|># Instructions\n\nYou are a ...\n\n# Tools\n\n...<|end|>
 <|start|>user<|message|># AGENTS.md instructions for ...<|end|>
 <|start|>user<|message|><environment_context>...<|end|>
 <|start|>user<|message|>Explain this repo in one sentence<|end|>
-<|start|>assistant
+<|start|>assistant<|channel|>analysis<|message|>We need to explain repo in one sentence. Let's inspect repo.<|end|>
+<|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}"<|call|>
+<|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|>
+<|start|>assistant

This generates another function call(sed -n '1,200p' foo.cpp) and the next (third) request is done in similar way.

 <|start|>system<|message|>You are ChatGPT, ...<|end|>
 <|start|>developer<|message|># Instructions\n\nYou are a ...\n\n# Tools\n\n...<|end|>
 <|start|>user<|message|># AGENTS.md instructions for ...<|end|>
 <|start|>user<|message|><environment_context>...<|end|>
 <|start|>user<|message|>Explain this repo in one sentence<|end|>
 <|start|>assistant<|channel|>analysis<|message|>We need to explain repo in one sentence. Let's inspect repo.<|end|>
 <|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}"<|call|>
 <|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|>
-<|start|>assistant
+<|start|>assistant<|channel|>analysis<|message|>Only one file foo.cpp. Open it.<|end|>
+<|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"sed -n '1,200p' foo.cpp\"],\"workdir\":\"./foobar\"}"<|call|>
+<|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\"#include <iostream>\\n\\nint main() {\\n    std::cout << \\\"Hello!\\\" << std::endl;\\n}\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|>
+<|start|>assistant

This generates:

  • reasoning_text: "Repo contains single C++ hello world program. Provide one sentence."
  • output_text: "A single C++ file that prints “Hello!” to the console."

Next, the user types "Compile and run in one line" and the fourth request is sent. Reasoning contents from the previous turn are excluded (by the chat template).

 <|start|>system<|message|>You are ChatGPT, ...<|end|>
 <|start|>developer<|message|># Instructions\n\nYou are a ...\n\n# Tools\n\n...<|end|>
 <|start|>user<|message|># AGENTS.md instructions for ...<|end|>
 <|start|>user<|message|><environment_context>...<|end|>
 <|start|>user<|message|>Explain this repo in one sentence<|end|>
-<|start|>assistant<|channel|>analysis<|message|>We need to explain repo in one sentence. Let's inspect repo.<|end|>
 <|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}"<|call|>
 <|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|>
-<|start|>assistant<|channel|>analysis<|message|>Only one file foo.cpp. Open it.<|end|>
 <|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"sed -n '1,200p' foo.cpp\"],\"workdir\":\"./foobar\"}"<|call|>
 <|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\"#include <iostream>\\n\\nint main() {\\n    std::cout << \\\"Hello!\\\" << std::endl;\\n}\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|>
-<|start|>assistant
+<|start|>assistant<|channel|>final<|message|>A single C++ file that prints “Hello!” to the console.<|end|>
+<|start|>user<|message|>Compile and run in one line<|end|>
+<|start|>assistant

And so on.

@openingnow

This comment was marked as duplicate.

@aldehir
Copy link
Collaborator

aldehir commented Jan 2, 2026

Now converting tools.

The gpt-oss models require feeding the reasoning from prior assistant tool calls. In the common library, this is handled via the reasoning_content field in the message. Is this something that can be handled by the stateless responses API?

@ngxson
Copy link
Collaborator

ngxson commented Jan 2, 2026

Is this something that can be handled by the stateless responses API?

Just note that we do support state tracking for streamed API response, documented in server devs docs

@openingnow
Copy link
Contributor Author

OpenAI models do not provide raw reasoning text. (model spec)

Hidden chain-of-thought message: some of OpenAI’s models can generate a hidden chain-of-thought message to reason through a problem before generating a final answer. This chain of thought is used to guide the model’s behavior, but is not exposed to the user or developer except potentially in summarized form.

As Aldehir mentioned, if an LLM request includes tool call output, reasoning contents should also be included. (docs/function-calling)

for reasoning models like GPT-5 or o4-mini, any reasoning items returned in model responses with tool calls must also be passed back with tool call outputs.

I see 2 problems:

  1. While there is a field for providing previous reasoning contents, it is for summary and not for raw reasoning text.
  2. Chat Completions API does not have a field for providing reasoning text. This could be a difference between /chat/completions and /responses.

I hope inspecting codex-cli will help.

@coder543
Copy link

coder543 commented Jan 4, 2026

@openingnow I think the question was regarding gpt-oss, not the closed source OpenAI models, but the closed source OpenAI models do provide full reasoning traces that you hand back to the server with each request so that the reasoning can persist between tool calls, it is just encrypted so that no one outside of OpenAI can read those reasoning traces. They are not summaries, although unencrypted summaries are also available for user-facing usage.

@aldehir
Copy link
Collaborator

aldehir commented Jan 4, 2026

@openingnow

See https://cookbook.openai.com/articles/gpt-oss/handle-raw-cot; Indeed I am only concerned about gpt-oss.

llama.cpp sends reasoning traces to the client via reasoning_content and accepts them back in the same field, adjusting them as needed for the template:

llama.cpp/common/chat.cpp

Lines 1944 to 1957 in cef1d23

// Copy reasoning to the "thinking" field as expected by the gpt-oss template
auto adjusted_messages = json::array();
for (const auto & msg : inputs.messages) {
auto has_reasoning_content = msg.contains("reasoning_content") && msg.at("reasoning_content").is_string();
auto has_tool_calls = msg.contains("tool_calls") && msg.at("tool_calls").is_array();
if (has_reasoning_content && has_tool_calls) {
auto adjusted_message = msg;
adjusted_message["thinking"] = msg.at("reasoning_content");
adjusted_messages.push_back(adjusted_message);
} else {
adjusted_messages.push_back(msg);
}
}

This deviates from the recommended approach, which uses reasoning for the Chat Completions API, but its what llama.cpp and vLLM have settled on.

My hope is to align with OpenAI's recommendation for the Responses API.

This doesn't have to be tackled in this PR, which is basic support. I only bring it up for awareness, as it is a highly desired feature.

@openingnow
Copy link
Contributor Author

openingnow commented Jan 5, 2026

As this API is for openai compatibility and aims to be a drop-in replacement, shouldn't the behavior match with closed source models?

@coder543
Copy link

coder543 commented Jan 5, 2026

Encryption of the reasoning traces is not a compatibility concern, so… I don’t see any reason to encrypt them, if that’s what you’re asking? Otherwise, I’m not sure what you’re asking.

It sounds like the desire is to have compatibility, which I agree with entirely.

Given how buggy codex is with the Chat API provided by llama-server today, I would definitely like it to have proper Responses API support soon, which would include passing back reasoning traces.

@openingnow
Copy link
Contributor Author

My concern is, which field does codex-cli use to deliver the reasoning contents.

For reasoning input and output, we have summary, encrypted_content, and content. The problem is, proprietary models will use first two fields while oss models will only use the last one. Therefore applications based on the proprietary models will look at summary or encrypted_content and append it to subsequent requests. They will not look at plain content field since proprietary models will not fill it.

Supporting codex-cli with reasoning + tool calling would be a great test for this PR, and here is my plan.

  1. Figure out what field does codex-cli use
  2. On output, append reasoning to the field
  3. On input, read the field and move that into message["thinking"] (or other appropriate place)

@coder543 Does this make sense? And can you provide an edge case or dump where codex-cli with llama-server fails?

@coder543
Copy link

coder543 commented Jan 5, 2026

What you wrote makes sense, but I think it's not something we should have to worry so much about.

For the Responses API path, Codex CLI just replays the ResponseItem items it received; it doesn’t pick any specific fields. The schema explicitly supports summary, content (raw reasoning text), and encrypted_content in the same item. See ResponseItem::Reasoning and ReasoningItemContent.

On the Responses request path, the prompt.input array (which includes prior ResponseItems) is passed through unchanged into the request body:

Codex CLI explicitly asks for encrypted reasoning only when the model is known to support reasoning summaries: ["reasoning.encrypted_content"]

Codex CLI seems to fully support GPT-OSS, which makes sense because OpenAI defined the spec for it. We don't have to fake the encrypted_content field.

If we wanted to offer a CLI option to move the reasoning text into the encrypted_content field as a hack to support a poorly written client, we could do that, but I don't think it makes sense as the default.

Or we could pass the reasoning text back in the encrypted_content field when the client asks for encrypted_content? Which is required for a client to get encrypted_content from OpenAI's main API. Which Codex CLI will not do for GPT-OSS, at least not by default, since it recognizes that GPT-OSS implementations don't usually support encrypted_content.

This is my best understanding of the situation from poking around the code.

@openingnow openingnow force-pushed the v1_responses branch 2 times, most recently from fda1d43 to fdb26fb Compare January 19, 2026 10:00
@openingnow
Copy link
Contributor Author

I editted main text as the explanation is too long to be in a comment.
@ngxson Is it acceptable to make events from server_task_result_cmpl_partial::update()?

@ngxson
Copy link
Collaborator

ngxson commented Jan 19, 2026

I don't get what you mean. We're doing function programming here and it's unclear from your question which is the state and which is the derived state

@openingnow
Copy link
Contributor Author

The primary state would be variables related to state.update_chat_msg(content, true, oaicompat_msg_diffs);, which is oaicompat_msg_diffs and task_result_state & state(excluding task_result_state::openai_responses_item_ids).
Derived state includes openai_responses_item_ids(which is modified at update()) and openai_responses_current_events(which is generated SSEs from current diff chunk)

@openingnow openingnow force-pushed the v1_responses branch 2 times, most recently from 34c54c2 to e8061a2 Compare January 20, 2026 07:25
@openingnow
Copy link
Contributor Author

Rebased to resolve conflict around task_result_state(const common_chat_parser_params & chat_parser_params)

{"file_data", input_item.at("file_data")},
{"filename", input_item.at("filename")},
}},
{"type", "file"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support this type yet. It should probably be converted into a text chunk (please verify), or maybe we just reject this type for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be rejected unless file is supported from chat completions.

{"type", "function_call"},
{"status", "completed"},
{"arguments", tool_call.arguments},
{"call_id", "fc_" + tool_call.id},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect to use oai_resp_fc_id here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since oai_resp_fc_id is for keeping function call's id while generating args, it only exists in task_result_state and server_task_result_cmpl_partial and not in server_task_result_cmpl_final.

{"data", json {
{"type", "response.function_call_arguments.delta"},
{"delta", diff.tool_call_delta.arguments},
{"item_id", "fc_" + oai_resp_fc_id},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear to me, does oai_resp_fc_id value already include fc_ prefix inside it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not has "fc_" prefix. It is copied from diff.tool_call_delta.id without any prefix.

@ngxson ngxson merged commit fbbf3ad into ggml-org:master Jan 21, 2026
85 checks passed
@openingnow openingnow deleted the v1_responses branch January 22, 2026 10:31
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
* from previous PR

* Make instruction(system) as first message

* Convert [input_message] (text/image/file)

* Rename convert_responses_to_chatcmpl(body) -> response_body

* Initial tool call support

* Erase instructions field from chatcmpl body

* Feed reasoning texts to chat template

* Use std::vector instead of opaque json array

* Make output_item.added events consistent

* Move `server_task_result_cmpl_partial::update` from header to source

* Match ID of output_item.added and .done events

* Add function_call only if there is no "fc_" prefix

* Add function call output at non-streaming API

* Test if ID is persistent

* Add doc

* Fix style - use trailing comma

* Rewrite state management

* catch up with upstream/master

* Fix style - "type" is the first item of SSE data

* Explicitly check "instructions" from response_body

* Make lambdas static

* Check if reasoning content exists

* Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final

* Reject `input_file` since it is not supported by chatcmpl

* Add "fc_" prefix to non-straming function call id as coderabbit pointed out

---------

Co-authored-by: openingnow <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants