Skip to content

[Bug]: Expose model reasoning/thinking blocks in /v1/chat/completions (fix #37044)#37067

Closed
alaamohanad169-ship-it wants to merge 1 commit into
NousResearch:mainfrom
alaamohanad169-ship-it:auto-fix-37044
Closed

[Bug]: Expose model reasoning/thinking blocks in /v1/chat/completions (fix #37044)#37067
alaamohanad169-ship-it wants to merge 1 commit into
NousResearch:mainfrom
alaamohanad169-ship-it:auto-fix-37044

Conversation

@alaamohanad169-ship-it

Copy link
Copy Markdown
Contributor

Summary

Fixes #37044.

The OpenAI-compatible API server adapter was silently dropping the last_reasoning field from the agent result, so downstream UIs (Open WebUI, LobeChat, etc.) connected to the Hermes gateway never saw the model's chain-of-thought even when the model produced a visible reasoning block in the TUI / CLI.

This change adds an opt-in X-Hermes-Expose-Reasoning request header. When set to a truthy value, the gateway surfaces the reasoning/thinking content in BOTH paths:

  • Non-streaming /v1/chat/completions: message.reasoning_content and message.reasoning are added to the assistant message
  • Streaming /v1/chat/completions SSE: delta.reasoning_content and delta.reasoning chunks on the chat.completion.chunk stream

reasoning_content is the de facto standard used by Open WebUI / DeepSeek / OpenRouter / Nous Portal; reasoning is the OpenAI-native field name. Both are emitted for maximum client compatibility.

Wire-format compatibility

Default behaviour is unchanged — no reasoning fields are emitted unless the client opts in via the header, so strict OpenAI parsers that don't know about the extension won't break.

POST /v1/chat/completions
X-Hermes-Expose-Reasoning: true

Changes

  • gateway/platforms/api_server.py:

    • Added _parse_expose_reasoning_header helper (reuses _coerce_request_bool so true|false|1|0|yes|no|on|off all work)
    • Non-streaming: when opted in, add reasoning_content + reasoning to the assistant message
    • Streaming: wired a tool_progress_callback that captures reasoning.available events from the agent and emits them as delta.reasoning_content chunks on the SSE stream
    • Added MAX_REASONING_CHUNK_BYTES (256 KB) defensive cap on per-chunk reasoning payload to prevent memory/bandwidth abuse
  • tests/gateway/test_api_server.py: 6 new tests under TestChatCompletionsEndpoint:

    • test_non_streaming_omits_reasoning_by_default
    • test_non_streaming_exposes_reasoning_when_header_set
    • test_non_streaming_expose_reasoning_header_false_omits
    • test_streaming_omits_reasoning_chunks_by_default
    • test_streaming_exposes_reasoning_chunks_when_header_set
    • test_streaming_reasoning_chunk_capped_at_max_size

Test plan

  • Full tests/gateway/test_api_server.py suite: 162/162 pass
  • tests/run_agent/test_run_agent.py, test_partial_stream_finish_reason.py, test_streaming.py: 401/401 pass
  • Independent reviewer subagent: no security or logic defects (4 non-blocking suggestions, all addressed)
  • Static security scan: clean

Risk & scope

  • Low risk — additive change, default behavior preserved
  • No breaking changes — existing clients see identical responses
  • Scope is contained — touches only gateway/platforms/api_server.py + matching tests
  • No config schema change — opt-in via request header instead of a new gateway.expose_reasoning config flag (per-client granularity without a global config change)

🤖 Generated with [Claude Code]

… /v1/chat/completions (NousResearch#37044)

The OpenAI-compatible API server adapter was silently dropping
`last_reasoning` from the agent result, so downstream UIs (Open
WebUI, LobeChat, etc.) connected to the Hermes gateway never saw
the model's chain-of-thought even when the model produced a
visible reasoning block in the TUI / CLI.

This fix adds an opt-in `X-Hermes-Expose-Reasoning` request
header.  When set to a truthy value, the gateway surfaces the
reasoning/thinking content in BOTH paths:

* non-streaming: `message.reasoning_content` and `message.reasoning`
  (Open WebUI consumes `reasoning_content`, OpenAI-native clients
  consume `reasoning`)
* streaming: `delta.reasoning_content` and `delta.reasoning`
  chunks on the chat.completion.chunk SSE stream

Default behaviour is unchanged: no reasoning fields are emitted
unless the client opts in, preserving wire-format compatibility
with strict OpenAI parsers.

Also adds:
- a 256 KB defensive cap on per-chunk reasoning payload to bound
  memory + bandwidth against a malicious or buggy provider
- 6 regression tests under TestChatCompletionsEndpoint covering
  the default-omit, opt-in, explicit-opt-out, streaming
  reasoning chunks, and size-cap cases
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery platform/webhook Webhook / API server labels Jun 2, 2026
@alaamohanad169-ship-it alaamohanad169-ship-it marked this pull request as ready for review June 3, 2026 00:00
@alaamohanad169-ship-it

Copy link
Copy Markdown
Contributor Author

👋 Friendly nudge — this PR exposes model reasoning/thinking blocks in /v1/chat/completions responses. ✅ CI green, mergeable. Would love a review when someone gets a chance.

@alaamohanad169-ship-it

Copy link
Copy Markdown
Contributor Author

@OutThisLife — exposes model reasoning/thinking blocks in /v1/chat/completions. CI green, mergeable. Would appreciate a review.

@alaamohanad169-ship-it alaamohanad169-ship-it deleted the auto-fix-37044 branch June 6, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have platform/webhook Webhook / API server type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: API server gateway does not expose model reasoning/thinking blocks in /v1/chat/completions responses

2 participants