fix(server): Responses API emits incomplete status on truncation (#19) by marksverdhei · Pull Request #82 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-05T15:28:41Z

Summary

Closes Phase 1 of #19: when generation hits STOP_TYPE_LIMIT (max_output_tokens / ctx-size cap), the OAI Responses code paths used to hardcode "status": "completed" on the top-level response, all output items, and the streaming response.completed SSE event. Agentic clients (Codex CLI, etc.) couldn't tell a finished response from a truncated one — they fed partial output back into conversation history, triggering JSON-parse-failure 500s on the next request and infinite retry loops.

Per the OAI Responses spec:

Top-level status flips to "incomplete" on truncation.
New top-level incomplete_details: { reason: "max_output_tokens" } field.
Per-item status (message / reasoning / function_call) inherits the same value, so clients can detect partial tool_calls / partial messages at the per-item level.
Streaming variant: final SSE event becomes response.incomplete (instead of response.completed), with the same payload shape.

What this does not do

Phase 2 of #19 (HTTP 400 + actionable message from func_args_not_string) is intentionally out of scope. That requires typed-exception plumbing through common/chat.cpp into the server error path — a separate, bigger change. Phase 1 alone prevents the cascade in the first place: once clients see truncation as truncation, they don't retry with malformed history.

Test plan

tools/server/tests/unit/test_compat_oai_responses.py::test_responses_truncation_emits_incomplete_status — non-streaming repro with max_output_tokens: 2 on tinyllama2 (reliably trips STOP_TYPE_LIMIT). Asserts top-level status=incomplete + incomplete_details.reason=max_output_tokens + per-item status.
test_responses_truncation_stream_emits_incomplete_event — streaming repro. Verifies a response.incomplete event arrives with the same payload shape.
Two pre-existing test_responses_with_openai_library / test_responses_stream_with_openai_library tests still pass (no happy-path regression).
(manual, post-merge) re-run the original Codex CLI repro from the issue to confirm the retry loop is broken.

Files touched

tools/server/server-task.cpp — both to_json_oaicompat_resp and to_json_oaicompat_resp_stream
tools/server/tests/unit/test_compat_oai_responses.py — 2 new test cases

🤖 Generated with Claude Code

When generation hits `STOP_TYPE_LIMIT` (max_output_tokens / ctx-size cap), the OAI Responses code paths hardcoded `"status": "completed"` everywhere — top-level response, per-message output items, function_call items, and the streaming `response.completed` SSE event. Agentic clients (Codex CLI, etc.) couldn't tell a finished response from a truncated one and ended up feeding partial output back into conversation history, triggering infinite retry loops on JSON-parse failures (issue #19, Phase 2). Per the OAI Responses spec, branch on the stop type in: * `server_task_result_cmpl_final::to_json_oaicompat_resp()` — emit `"status": "incomplete"` on the top-level response, all output items inherit the same status, plus `"incomplete_details": {"reason": "max_output_tokens"}` at the top level. * `to_json_oaicompat_resp_stream()` — same mapping on the per-item statuses, plus the final SSE event becomes `response.incomplete` (vs `response.completed`) with `incomplete_details` on the payload. Doesn't address Phase 2 of the issue (HTTP 400 + actionable message from `func_args_not_string`) — that requires typed exception plumbing through common/chat.cpp into the server error path. Phase 1 alone prevents the cascade in the first place: clients see truncation as truncation, not as a malformed completed response. Test coverage in test_compat_oai_responses.py: * `test_responses_truncation_emits_incomplete_status` — non-streaming: `max_output_tokens: 2` on tinyllama2 reliably trips STOP_TYPE_LIMIT; assert status=incomplete + incomplete_details + per-item status. * `test_responses_truncation_stream_emits_incomplete_event` — streaming: same setup, verify a `response.incomplete` event arrives with the same payload shape.

This was referenced Jun 5, 2026

Responses API: misleading error on context overflow, must communicate token limit exceeded #19

Open

Hivemind Maintenance Tasks Epoch 3 #81

Closed

Hivemind Maintenance Tasks Epoch 4 #86

Closed

Hivemind Maintenance Tasks Epoch 5 #91

Closed

marksverdhei merged commit 87232cc into ht Jun 12, 2026
6 of 12 checks passed

marksverdhei deleted the fix/responses-api-truncation-status branch June 12, 2026 18:35

marksverdhei mentioned this pull request Jun 12, 2026

docs(readme): complete HT Fork Changes inventory with per-change justifications #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): Responses API emits incomplete status on truncation (#19)#82

fix(server): Responses API emits incomplete status on truncation (#19)#82
marksverdhei merged 1 commit into
htfrom
fix/responses-api-truncation-status

marksverdhei commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 5, 2026

Summary

What this does not do

Test plan

Files touched

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant