Bug Description
When the conversation context hits the ctx-size limit during a /v1/responses tool-calling session, the API returns a misleading 500 error that provides no indication of the actual cause (context overflow). This creates unrecoverable retry loops in agentic clients like Codex CLI.
Reproduction
- Use Codex CLI (or any agentic client) against
/v1/responses with a model like qwen3.5-27b at ctx-size=32768
- Let the conversation grow through tool call rounds until context approaches 32K tokens
- Observe the failure cascade
What happens
Phase 1: Silent truncation
When context reaches the limit, the model generates a truncated tool call (e.g., only 35 tokens before hitting the ceiling). The server returns this as HTTP 200 with:
"status": "completed" (should be "incomplete")
- Truncated, invalid JSON in the tool call arguments (e.g.,
{"command": "cat s — cut off at column 20)
finish_reason: "length" is set but no incomplete_details object
Phase 2: Misleading error on next request
The client includes the truncated tool call in conversation history for the next request. The func_args_not_string() function in common/chat.cpp:1407 tries to parse it and throws:
Failed to parse tool call arguments as JSON: [json.exception.parse_error.101]
parse error at line 1, column 20: syntax error while parsing object -
unexpected end of input; expected '}'
This is returned as a generic 500 with no indication that context overflow was the root cause. The client retries endlessly since the conversation only grows.
Expected behavior
Two things need to happen:
1. Truncated responses must signal "status": "incomplete"
Both to_json_oaicompat_resp() (line ~973) and to_json_oaicompat_resp_stream() (line ~1080) currently hardcode "status": "completed". When stop == STOP_TYPE_LIMIT, the response must:
- Set
"status": "incomplete"
- Include
"incomplete_details": {"reason": "max_output_tokens"} (per OpenAI spec)
- Either omit the truncated tool call entirely, or mark it as incomplete so the client doesn't try to use it
2. Input validation must return a clear, actionable error
When func_args_not_string() fails to parse tool call arguments from input messages, the error should check whether the token count is near the context limit and return:
- HTTP 400 (not 500 — the input is malformed, it's not a server error)
- A message that explains the actual cause, e.g.:
Tool call arguments in message history contain invalid JSON (truncated at column 20).
This typically happens when a previous response was truncated due to context length limits.
Consider reducing conversation history or increasing ctx-size.
At minimum, even without the smart detection, the error should be a 400 with a message like:
Invalid tool call arguments in input messages: JSON parse error at column 20 (unexpected end of input).
Check that all tool_calls in the conversation history contain valid JSON arguments.
Relevant code
common/chat.cpp:1396-1414 — func_args_not_string() throws the misleading error
tools/server/server-task.cpp:~973 — to_json_oaicompat_resp() hardcodes "status": "completed"
tools/server/server-task.cpp:~1080 — to_json_oaicompat_resp_stream() same issue
tools/server/server-context.cpp:1250-1253 — truncation sets STOP_TYPE_LIMIT but this is not propagated to the Responses API status
Impact
This causes agentic tool-calling loops (Codex CLI, etc.) to enter infinite retry cycles when context is exhausted, with no way for the client to diagnose or recover from the issue.
Bug Description
When the conversation context hits the
ctx-sizelimit during a/v1/responsestool-calling session, the API returns a misleading 500 error that provides no indication of the actual cause (context overflow). This creates unrecoverable retry loops in agentic clients like Codex CLI.Reproduction
/v1/responseswith a model likeqwen3.5-27batctx-size=32768What happens
Phase 1: Silent truncation
When context reaches the limit, the model generates a truncated tool call (e.g., only 35 tokens before hitting the ceiling). The server returns this as HTTP 200 with:
"status": "completed"(should be"incomplete"){"command": "cat s— cut off at column 20)finish_reason: "length"is set but noincomplete_detailsobjectPhase 2: Misleading error on next request
The client includes the truncated tool call in conversation history for the next request. The
func_args_not_string()function incommon/chat.cpp:1407tries to parse it and throws:This is returned as a generic 500 with no indication that context overflow was the root cause. The client retries endlessly since the conversation only grows.
Expected behavior
Two things need to happen:
1. Truncated responses must signal
"status": "incomplete"Both
to_json_oaicompat_resp()(line ~973) andto_json_oaicompat_resp_stream()(line ~1080) currently hardcode"status": "completed". Whenstop == STOP_TYPE_LIMIT, the response must:"status": "incomplete""incomplete_details": {"reason": "max_output_tokens"}(per OpenAI spec)2. Input validation must return a clear, actionable error
When
func_args_not_string()fails to parse tool call arguments from input messages, the error should check whether the token count is near the context limit and return:At minimum, even without the smart detection, the error should be a 400 with a message like:
Relevant code
common/chat.cpp:1396-1414—func_args_not_string()throws the misleading errortools/server/server-task.cpp:~973—to_json_oaicompat_resp()hardcodes"status": "completed"tools/server/server-task.cpp:~1080—to_json_oaicompat_resp_stream()same issuetools/server/server-context.cpp:1250-1253— truncation setsSTOP_TYPE_LIMITbut this is not propagated to the Responses API statusImpact
This causes agentic tool-calling loops (Codex CLI, etc.) to enter infinite retry cycles when context is exhausted, with no way for the client to diagnose or recover from the issue.