You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using glm-5-turbo (ZhipuAI) as the sole provider, Hermes returns literal (empty) responses on Discord/gateway mid-conversation. The root cause is that reasoning tokens consume the entire output token budget, leaving content as an empty string. Hermes has no mechanism to detect, budget, or recover from this scenario.
This is not a transient failure — it is deterministic and progressive: as conversation context grows, the model spends proportionally more tokens on reasoning, eventually leaving zero tokens for the actual response. Once it starts, every subsequent retry (including Hermes's prefill recovery) makes it worse because appended thinking-only messages grow the context further.
Steps to Reproduce
Configure Hermes with glm-5-turbo as the only provider (ZAI endpoint: https://open.bigmodel.cn/api/coding/paas/v4)
Use gateway mode (Discord in my case, but likely affects all platforms)
Have a multi-turn conversation that accumulates tool call results (e.g., session_search, read_file, web search tools — 29 MCP tool schemas inject ~14.5K tokens of overhead per call)
After several turns, the model returns reasoning_content with substantive thinking but content="" and finish_reason="length"
The model always produces reasoning_content. When context is large, reasoning scales proportionally and can consume the entire budget.
Why Hermes's Built-in Recovery Fails
Prefill recovery is counterproductive (run_agent.py ~L9028-9041): When reasoning_content exists but content is empty, Hermes appends the thinking-only message to conversation history and retries. This grows context → reasoning takes even more tokens → content stays empty. Each retry compounds the problem.
Compression is too late: Compression fires AFTER the API response. By then, the damage is done. With the default context_length guess of 128K (Hermes couldn't detect glm-5-turbo's actual limit), compression threshold was 64K — far above the ~53K context where the model was already failing.
Workaround
Setting these in config.yaml mitigates the issue by triggering compression earlier:
model:
context_length: 32000# real effective context for glm-5-turbocompression:
threshold: 0.4# triggers at 12,800 tokens
This is a band-aid. The fundamental issue remains: Hermes has no reasoning token budget awareness.
Comparison with OpenClaw
OpenClaw handles the same model (glm-5-turbo) without this problem. Key differences:
Mechanism
Hermes
OpenClaw
Empty response handling
Sends literal (empty) to user
Reasoning suppression — silently drops thinking-only payloads at dispatch layer
Context budgeting
Reactive: compresses after API response
Proactive: assemble(tokenBudget) enforces budget before API call
Prefill recovery on empty
Appends thinking message + retries (grows context, makes it worse)
Short-circuits with { ok: true }, no content sent
Context overflow error handling
Retries with same context
Classified as special error → triggers compaction/reduction
Tool schema overhead
All schemas sent every call (29 tools = ~14.5K tokens)
Bootstrap file truncation at 20K chars, lazy schema loading
Bug Description
When using
glm-5-turbo(ZhipuAI) as the sole provider, Hermes returns literal(empty)responses on Discord/gateway mid-conversation. The root cause is that reasoning tokens consume the entire output token budget, leavingcontentas an empty string. Hermes has no mechanism to detect, budget, or recover from this scenario.This is not a transient failure — it is deterministic and progressive: as conversation context grows, the model spends proportionally more tokens on reasoning, eventually leaving zero tokens for the actual response. Once it starts, every subsequent retry (including Hermes's prefill recovery) makes it worse because appended thinking-only messages grow the context further.
Steps to Reproduce
glm-5-turboas the only provider (ZAI endpoint:https://open.bigmodel.cn/api/coding/paas/v4)session_search,read_file, web search tools — 29 MCP tool schemas inject ~14.5K tokens of overhead per call)reasoning_contentwith substantive thinking butcontent=""andfinish_reason="length"API Verification
Direct API test confirms the behavior:
The model always produces
reasoning_content. When context is large, reasoning scales proportionally and can consume the entire budget.Why Hermes's Built-in Recovery Fails
Prefill recovery is counterproductive (
run_agent.py~L9028-9041): Whenreasoning_contentexists butcontentis empty, Hermes appends the thinking-only message to conversation history and retries. This grows context → reasoning takes even more tokens → content stays empty. Each retry compounds the problem._last_content_with_toolsfallback (L8992-9014): If a tool call preceded the empty response, Hermes uses the tool call's intermediate text as the final response and breaks immediately — bypassing retry mechanisms entirely. See [Bug]: _last_content_with_tools fallback bypasses empty-response retries, causing silent agent loop termination mid-task #7968.Compression is too late: Compression fires AFTER the API response. By then, the damage is done. With the default context_length guess of 128K (Hermes couldn't detect glm-5-turbo's actual limit), compression threshold was 64K — far above the ~53K context where the model was already failing.
Workaround
Setting these in
config.yamlmitigates the issue by triggering compression earlier:This is a band-aid. The fundamental issue remains: Hermes has no reasoning token budget awareness.
Comparison with OpenClaw
OpenClaw handles the same model (glm-5-turbo) without this problem. Key differences:
(empty)to userassemble(tokenBudget)enforces budget before API call{ ok: true }, no content sentRelated Issues
(empty)responses on Telegram (different root cause: Ollama concurrency, but same symptom and same missing recovery)_last_content_with_toolsfallback bypasses empty-response retries (PR fix: move _last_content_with_tools fallback after retry mechanisms #8024 submitted but not merged)Expected Behavior
reasoning_contentbut emptycontent) and either:max_tokensand retry (reasoning-aware budgeting)(empty)literal should never be sent to a user. At minimum, replace with a user-friendly error message.Environment
https://open.bigmodel.cn/api/coding/paas/v4)