[Bug] Thinking model (glm-5-turbo) reasoning tokens exhaust output budget, producing empty responses with no recovery path

## Bug Description

When using `glm-5-turbo` (ZhipuAI) as the sole provider, Hermes returns literal `(empty)` responses on Discord/gateway mid-conversation. The root cause is that **reasoning tokens consume the entire output token budget**, leaving `content` as an empty string. Hermes has no mechanism to detect, budget, or recover from this scenario.

This is not a transient failure — it is **deterministic and progressive**: as conversation context grows, the model spends proportionally more tokens on reasoning, eventually leaving zero tokens for the actual response. Once it starts, every subsequent retry (including Hermes's prefill recovery) makes it worse because appended thinking-only messages grow the context further.

## Steps to Reproduce

1. Configure Hermes with `glm-5-turbo` as the only provider (ZAI endpoint: `https://open.bigmodel.cn/api/coding/paas/v4`)
2. Use gateway mode (Discord in my case, but likely affects all platforms)
3. Have a multi-turn conversation that accumulates tool call results (e.g., `session_search`, `read_file`, web search tools — 29 MCP tool schemas inject ~14.5K tokens of overhead per call)
4. After several turns, the model returns `reasoning_content` with substantive thinking but `content=""` and `finish_reason="length"`

## API Verification

Direct API test confirms the behavior:

```
Input tokens: ~53,000 (accumulated conversation context)
max_tokens: default (~16K)
Result: reasoning_tokens ~ 16,000, completion_tokens = 16,000, content = ""
finish_reason: "length"

max_tokens: 50 (stress test)
Result: reasoning_tokens = 50, completion_tokens = 50, content = ""
finish_reason: "length" (100% reasoning, 0% content)
```

The model always produces `reasoning_content`. When context is large, reasoning scales proportionally and can consume the entire budget.

## Why Hermes's Built-in Recovery Fails

1. **Prefill recovery is counterproductive** (`run_agent.py` ~L9028-9041): When `reasoning_content` exists but `content` is empty, Hermes appends the thinking-only message to conversation history and retries. This grows context → reasoning takes even more tokens → content stays empty. Each retry compounds the problem.

2. **`_last_content_with_tools` fallback** (L8992-9014): If a tool call preceded the empty response, Hermes uses the tool call's intermediate text as the final response and breaks immediately — bypassing retry mechanisms entirely. See #7968.

3. **Compression is too late**: Compression fires AFTER the API response. By then, the damage is done. With the default context_length guess of 128K (Hermes couldn't detect glm-5-turbo's actual limit), compression threshold was 64K — far above the ~53K context where the model was already failing.

## Workaround

Setting these in `config.yaml` mitigates the issue by triggering compression earlier:

```yaml
model:
  context_length: 32000  # real effective context for glm-5-turbo
compression:
  threshold: 0.4         # triggers at 12,800 tokens
```

This is a band-aid. The fundamental issue remains: Hermes has no reasoning token budget awareness.

## Comparison with OpenClaw

OpenClaw handles the same model (glm-5-turbo) without this problem. Key differences:

| Mechanism | Hermes | OpenClaw |
|---|---|---|
| Empty response handling | Sends literal `(empty)` to user | Reasoning suppression — silently drops thinking-only payloads at dispatch layer |
| Context budgeting | Reactive: compresses after API response | Proactive: `assemble(tokenBudget)` enforces budget before API call |
| Prefill recovery on empty | Appends thinking message + retries (grows context, makes it worse) | Short-circuits with `{ ok: true }`, no content sent |
| Context overflow error handling | Retries with same context | Classified as special error → triggers compaction/reduction |
| Tool schema overhead | All schemas sent every call (29 tools = ~14.5K tokens) | Bootstrap file truncation at 20K chars, lazy schema loading |

## Related Issues

- #6559 — Literal `(empty)` responses on Telegram (different root cause: Ollama concurrency, but same symptom and same missing recovery)
- #7968 — `_last_content_with_tools` fallback bypasses empty-response retries (PR #8024 submitted but not merged)
- #2128 — Reasoning-only fix leaks empty assistant message into Chat Completions API (closed, but only fixed Responses API path)
- #7729 — GLM-4.7 thinking budget exhausted (closed, fixed thinking tag detection — different but adjacent issue)

## Expected Behavior

1. Hermes should detect reasoning-only responses (has `reasoning_content` but empty `content`) and either:
   - Automatically increase `max_tokens` and retry (reasoning-aware budgeting)
   - Suppress the empty response entirely (like OpenClaw's reasoning suppression)
   - Trigger compression proactively before the next API call instead of after failure
2. The `(empty)` literal should never be sent to a user. At minimum, replace with a user-friendly error message.

## Environment

- **Hermes version**: v0.8.0 (2026.4.8)
- **Model**: glm-5-turbo via ZAI (`https://open.bigmodel.cn/api/coding/paas/v4`)
- **Provider**: OpenAI-compatible (no fallback available)
- **Platform**: Discord gateway
- **OS**: Ubuntu 24.04 LTS
- **Python**: 3.11.15
- **MCP servers**: 4 servers (zai_reader, zai_zread, zai_search, zai_vision) — 29 tool schemas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Thinking model (glm-5-turbo) reasoning tokens exhaust output budget, producing empty responses with no recovery path #9344

Bug Description

Steps to Reproduce

API Verification

Why Hermes's Built-in Recovery Fails

Workaround

Comparison with OpenClaw

Related Issues

Expected Behavior

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mechanism	Hermes	OpenClaw
Empty response handling	Sends literal `(empty)` to user	Reasoning suppression — silently drops thinking-only payloads at dispatch layer
Context budgeting	Reactive: compresses after API response	Proactive: `assemble(tokenBudget)` enforces budget before API call
Prefill recovery on empty	Appends thinking message + retries (grows context, makes it worse)	Short-circuits with `{ ok: true }`, no content sent
Context overflow error handling	Retries with same context	Classified as special error → triggers compaction/reduction
Tool schema overhead	All schemas sent every call (29 tools = ~14.5K tokens)	Bootstrap file truncation at 20K chars, lazy schema loading

[Bug] Thinking model (glm-5-turbo) reasoning tokens exhaust output budget, producing empty responses with no recovery path #9344

Description

Bug Description

Steps to Reproduce

API Verification

Why Hermes's Built-in Recovery Fails

Workaround

Comparison with OpenClaw

Related Issues

Expected Behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions