Skip to content

[Bug] Thinking model (glm-5-turbo) reasoning tokens exhaust output budget, producing empty responses with no recovery path #9344

@mwtxr

Description

@mwtxr

Bug Description

When using glm-5-turbo (ZhipuAI) as the sole provider, Hermes returns literal (empty) responses on Discord/gateway mid-conversation. The root cause is that reasoning tokens consume the entire output token budget, leaving content as an empty string. Hermes has no mechanism to detect, budget, or recover from this scenario.

This is not a transient failure — it is deterministic and progressive: as conversation context grows, the model spends proportionally more tokens on reasoning, eventually leaving zero tokens for the actual response. Once it starts, every subsequent retry (including Hermes's prefill recovery) makes it worse because appended thinking-only messages grow the context further.

Steps to Reproduce

  1. Configure Hermes with glm-5-turbo as the only provider (ZAI endpoint: https://open.bigmodel.cn/api/coding/paas/v4)
  2. Use gateway mode (Discord in my case, but likely affects all platforms)
  3. Have a multi-turn conversation that accumulates tool call results (e.g., session_search, read_file, web search tools — 29 MCP tool schemas inject ~14.5K tokens of overhead per call)
  4. After several turns, the model returns reasoning_content with substantive thinking but content="" and finish_reason="length"

API Verification

Direct API test confirms the behavior:

Input tokens: ~53,000 (accumulated conversation context)
max_tokens: default (~16K)
Result: reasoning_tokens ~ 16,000, completion_tokens = 16,000, content = ""
finish_reason: "length"

max_tokens: 50 (stress test)
Result: reasoning_tokens = 50, completion_tokens = 50, content = ""
finish_reason: "length" (100% reasoning, 0% content)

The model always produces reasoning_content. When context is large, reasoning scales proportionally and can consume the entire budget.

Why Hermes's Built-in Recovery Fails

  1. Prefill recovery is counterproductive (run_agent.py ~L9028-9041): When reasoning_content exists but content is empty, Hermes appends the thinking-only message to conversation history and retries. This grows context → reasoning takes even more tokens → content stays empty. Each retry compounds the problem.

  2. _last_content_with_tools fallback (L8992-9014): If a tool call preceded the empty response, Hermes uses the tool call's intermediate text as the final response and breaks immediately — bypassing retry mechanisms entirely. See [Bug]: _last_content_with_tools fallback bypasses empty-response retries, causing silent agent loop termination mid-task #7968.

  3. Compression is too late: Compression fires AFTER the API response. By then, the damage is done. With the default context_length guess of 128K (Hermes couldn't detect glm-5-turbo's actual limit), compression threshold was 64K — far above the ~53K context where the model was already failing.

Workaround

Setting these in config.yaml mitigates the issue by triggering compression earlier:

model:
  context_length: 32000  # real effective context for glm-5-turbo
compression:
  threshold: 0.4         # triggers at 12,800 tokens

This is a band-aid. The fundamental issue remains: Hermes has no reasoning token budget awareness.

Comparison with OpenClaw

OpenClaw handles the same model (glm-5-turbo) without this problem. Key differences:

Mechanism Hermes OpenClaw
Empty response handling Sends literal (empty) to user Reasoning suppression — silently drops thinking-only payloads at dispatch layer
Context budgeting Reactive: compresses after API response Proactive: assemble(tokenBudget) enforces budget before API call
Prefill recovery on empty Appends thinking message + retries (grows context, makes it worse) Short-circuits with { ok: true }, no content sent
Context overflow error handling Retries with same context Classified as special error → triggers compaction/reduction
Tool schema overhead All schemas sent every call (29 tools = ~14.5K tokens) Bootstrap file truncation at 20K chars, lazy schema loading

Related Issues

Expected Behavior

  1. Hermes should detect reasoning-only responses (has reasoning_content but empty content) and either:
    • Automatically increase max_tokens and retry (reasoning-aware budgeting)
    • Suppress the empty response entirely (like OpenClaw's reasoning suppression)
    • Trigger compression proactively before the next API call instead of after failure
  2. The (empty) literal should never be sent to a user. At minimum, replace with a user-friendly error message.

Environment

  • Hermes version: v0.8.0 (2026.4.8)
  • Model: glm-5-turbo via ZAI (https://open.bigmodel.cn/api/coding/paas/v4)
  • Provider: OpenAI-compatible (no fallback available)
  • Platform: Discord gateway
  • OS: Ubuntu 24.04 LTS
  • Python: 3.11.15
  • MCP servers: 4 servers (zai_reader, zai_zread, zai_search, zai_vision) — 29 tool schemas

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt builderprovider/zaiZAI providertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions