server: /v1/responses (partial)#18486
Conversation
293b94e to
5dcc7fa
Compare
5dcc7fa to
9f09745
Compare
This comment was marked as duplicate.
This comment was marked as duplicate.
The gpt-oss models require feeding the reasoning from prior assistant tool calls. In the common library, this is handled via the |
Just note that we do support state tracking for streamed API response, documented in server devs docs |
|
OpenAI models do not provide raw reasoning text. (model spec)
As Aldehir mentioned, if an LLM request includes tool call output, reasoning contents should also be included. (docs/function-calling)
I see 2 problems:
I hope inspecting codex-cli will help. |
|
@openingnow I think the question was regarding gpt-oss, not the closed source OpenAI models, but the closed source OpenAI models do provide full reasoning traces that you hand back to the server with each request so that the reasoning can persist between tool calls, it is just encrypted so that no one outside of OpenAI can read those reasoning traces. They are not summaries, although unencrypted summaries are also available for user-facing usage. |
|
See https://cookbook.openai.com/articles/gpt-oss/handle-raw-cot; Indeed I am only concerned about gpt-oss. llama.cpp sends reasoning traces to the client via Lines 1944 to 1957 in cef1d23 This deviates from the recommended approach, which uses reasoning for the Chat Completions API, but its what llama.cpp and vLLM have settled on.
My hope is to align with OpenAI's recommendation for the Responses API. This doesn't have to be tackled in this PR, which is basic support. I only bring it up for awareness, as it is a highly desired feature. |
|
As this API is for openai compatibility and aims to be a drop-in replacement, shouldn't the behavior match with closed source models? |
|
Encryption of the reasoning traces is not a compatibility concern, so… I don’t see any reason to encrypt them, if that’s what you’re asking? Otherwise, I’m not sure what you’re asking. It sounds like the desire is to have compatibility, which I agree with entirely. Given how buggy |
|
My concern is, which field does codex-cli use to deliver the reasoning contents. For reasoning input and output, we have Supporting codex-cli with reasoning + tool calling would be a great test for this PR, and here is my plan.
@coder543 Does this make sense? And can you provide an edge case or dump where codex-cli with llama-server fails? |
|
What you wrote makes sense, but I think it's not something we should have to worry so much about. For the Responses API path, Codex CLI just replays the On the Responses request path, the
Codex CLI explicitly asks for encrypted reasoning only when the model is known to support reasoning summaries: Codex CLI seems to fully support GPT-OSS, which makes sense because OpenAI defined the spec for it. We don't have to fake the If we wanted to offer a CLI option to move the reasoning text into the Or we could pass the reasoning text back in the This is my best understanding of the situation from poking around the code. |
fda1d43 to
fdb26fb
Compare
|
I editted main text as the explanation is too long to be in a comment. |
|
I don't get what you mean. We're doing function programming here and it's unclear from your question which is the state and which is the derived state |
|
The primary state would be variables related to |
34c54c2 to
e8061a2
Compare
…ver_task_result_cmpl_partial, and server_task_result_cmpl_final
cd11168 to
5ac23d2
Compare
|
Rebased to resolve conflict around |
tools/server/server-common.cpp
Outdated
| {"file_data", input_item.at("file_data")}, | ||
| {"filename", input_item.at("filename")}, | ||
| }}, | ||
| {"type", "file"}, |
There was a problem hiding this comment.
I don't think we support this type yet. It should probably be converted into a text chunk (please verify), or maybe we just reject this type for now
There was a problem hiding this comment.
I think it should be rejected unless file is supported from chat completions.
| {"type", "function_call"}, | ||
| {"status", "completed"}, | ||
| {"arguments", tool_call.arguments}, | ||
| {"call_id", "fc_" + tool_call.id}, |
There was a problem hiding this comment.
do we expect to use oai_resp_fc_id here?
There was a problem hiding this comment.
No, since oai_resp_fc_id is for keeping function call's id while generating args, it only exists in task_result_state and server_task_result_cmpl_partial and not in server_task_result_cmpl_final.
| {"data", json { | ||
| {"type", "response.function_call_arguments.delta"}, | ||
| {"delta", diff.tool_call_delta.arguments}, | ||
| {"item_id", "fc_" + oai_resp_fc_id}, |
There was a problem hiding this comment.
it's unclear to me, does oai_resp_fc_id value already include fc_ prefix inside it?
There was a problem hiding this comment.
No, it does not has "fc_" prefix. It is copied from diff.tool_call_delta.id without any prefix.
* from previous PR * Make instruction(system) as first message * Convert [input_message] (text/image/file) * Rename convert_responses_to_chatcmpl(body) -> response_body * Initial tool call support * Erase instructions field from chatcmpl body * Feed reasoning texts to chat template * Use std::vector instead of opaque json array * Make output_item.added events consistent * Move `server_task_result_cmpl_partial::update` from header to source * Match ID of output_item.added and .done events * Add function_call only if there is no "fc_" prefix * Add function call output at non-streaming API * Test if ID is persistent * Add doc * Fix style - use trailing comma * Rewrite state management * catch up with upstream/master * Fix style - "type" is the first item of SSE data * Explicitly check "instructions" from response_body * Make lambdas static * Check if reasoning content exists * Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final * Reject `input_file` since it is not supported by chatcmpl * Add "fc_" prefix to non-straming function call id as coderabbit pointed out --------- Co-authored-by: openingnow <>
previous PR: #18227
Conversations need to be resolved:
initial_eventsopenaifromrequirements-tool_bench.txtThis PR implements:
reasoning_content.Current caveats:
If there are consecutive function calls, theresponse.output_item.addedevent will not be emitted for the latter ones.response.output_item.doneevents are generated at the end (i.e., when there is both reasoning and a function call,response.function_call_arguments.deltais created beforeresponse.output_item.done(of reasoning))The issue arise from
server_task_result_cmpl_partial::updateand seems to be not problematic for codex-cli since it does not check order of events.Here's a visualization of handling reasoning texts. Let's start with
codex 'Explain this repo in one sentence'First request (Responses format):
{"model":"gpt-oss_local_gguf","instructions":"You are a ...","input":[ {"type":"message","role":"developer","content":[{"type":"input_text","text":"<permissions instructions>..."}]}, {"type":"message","role":"user","content":[{"type":"input_text","text":"# AGENTS.md instructions for ..."}]}, {"type":"message","role":"user","content":[{"type":"input_text","text":"<environment_context>..."}]}, {"type":"message","role":"user","content":[{"type":"input_text","text":"Explain this repo in one sentence"}]} ],"tools":["..."],"tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59"}convert_responses_to_chatcmplconverts this into the Chat Completions format:{"model":"gpt-oss_local_gguf","tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59","messages":[ {"role":"system","content":"You are a ..."}, {"role":"developer","content":[{"text":"<permissions instructions>...","type":"text"}]}, {"role":"user","content":[{"text":"# AGENTS.md instructions for ...","type":"text"}]}, {"role":"user","content":[{"text":"<environment_context>...","type":"text"}]}, {"role":"user","content":[{"text":"Explain this repo in one sentence","type":"text"}]} ],"tools":["..."]}generated prompt:
With the prompt, a reasoning text("We need to explain repo ...") and a function call(
ls -R) are made.Codex sends the new request after executing
ls -R.Second request (Responses format):
{"model":"gpt-oss_local_gguf","instructions":"You are a ...","input":[ {"type":"message","role":"developer","content":[{"type":"input_text","text":"<permissions instructions>..."}]}, {"type":"message","role":"user","content":[{"type":"input_text","text":"# AGENTS.md instructions for ..."}]}, {"type":"message","role":"user","content":[{"type":"input_text","text":"<environment_context>..."}]}, {"type":"message","role":"user","content":[{"type":"input_text","text":"Explain this repo in one sentence"}]}, + {"type":"reasoning","summary":[],"content":[{"type":"reasoning_text","text":"We need to explain repo in one sentence. Let's inspect repo."}],"encrypted_content":""}, + {"type":"function_call","name":"shell","arguments":"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}","call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0"}, + {"type":"function_call_output","call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0","output":"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"} ],"tools":["..."],"tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59"}Converted request (Chat Completions format):
{"model":"gpt-oss_local_gguf","tool_choice":"auto","parallel_tool_calls":false,"reasoning":null,"store":false,"stream":true,"include":[],"prompt_cache_key":"019bd4e7-1366-7571-93a6-f1b960ee1c59","messages":[ {"role":"system","content":"You are a ..."}, {"role":"developer","content":[{"text":"<permissions instructions>...","type":"text"}]}, {"role":"user","content":[{"text":"# AGENTS.md instructions for ...","type":"text"}]}, {"role":"user","content":[{"text":"<environment_context>...","type":"text"}]}, {"role":"user","content":[{"text":"Explain this repo in one sentence","type":"text"}]}, + {"role":"assistant","tool_calls":[{"function":{"arguments":"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}","name":"shell"},"id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0","type":"function"}],"reasoning_content":"We need to explain repo in one sentence. Let's inspect repo."}, + {"role":"tool","content":"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}","tool_call_id":"fc_wxzvZd6LrQJetz7V9ZxjjmAFObzRzzg0"} ],"tools":["..."]}generated prompt:
This generates another function call(
sed -n '1,200p' foo.cpp) and the next (third) request is done in similar way.<|start|>system<|message|>You are ChatGPT, ...<|end|> <|start|>developer<|message|># Instructions\n\nYou are a ...\n\n# Tools\n\n...<|end|> <|start|>user<|message|># AGENTS.md instructions for ...<|end|> <|start|>user<|message|><environment_context>...<|end|> <|start|>user<|message|>Explain this repo in one sentence<|end|> <|start|>assistant<|channel|>analysis<|message|>We need to explain repo in one sentence. Let's inspect repo.<|end|> <|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"ls -R\"],\"workdir\":\"./foobar\"}"<|call|> <|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\".:\\nfoo.cpp\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|> -<|start|>assistant +<|start|>assistant<|channel|>analysis<|message|>Only one file foo.cpp. Open it.<|end|> +<|start|>assistant to=functions.shell<|channel|>commentary json<|message|>"{\"command\":[\"bash\",\"-lc\",\"sed -n '1,200p' foo.cpp\"],\"workdir\":\"./foobar\"}"<|call|> +<|start|>functions.shell to=assistant<|channel|>commentary<|message|>"{\"output\":\"#include <iostream>\\n\\nint main() {\\n std::cout << \\\"Hello!\\\" << std::endl;\\n}\\n\",\"metadata\":{\"exit_code\":0,\"duration_seconds\":0.0}}"<|end|> +<|start|>assistantThis generates:
Next, the user types "Compile and run in one line" and the fourth request is sent. Reasoning contents from the previous turn are excluded (by the chat template).
And so on.