[Bug]: Hermes Agent gets stuck indefinitely in the middle of execution

### Bug Description

## ROOT CAUSE ANALYSIS: Indefinite Stuck Behavior (April 9-10, 2026)

### EXECUTIVE SUMMARY

Two distinct stuck patterns were identified across 4 sessions. Both trace back to the same root cause: **the GLM model family (specifically GLM-4.5-Air via Z.AI API) returning empty API responses** with no content and no tool calls, causing Hermes' retry loop to exhaust silently and emit "(empty)" responses that appear to the user as a hung/frozen session.

---

### AFFECTED SESSIONS

| Session | Model | Empty Responses | Nature |
|---------|-------|----------------|--------|
| 20260409_184803 | GLM-4.5-Air | 42 | Kanboard research + install. Terminal stuck at end. |
| 20260410_004640 | GLM-4.5-Air | 37 | Kanboard Docker install. Worked through it. |
| 20260409_152415 | glm-5.1 | 19 | Dev team playbook design. Mostly tool-only (normal). |
| 20260410_021022 | glm-5 | 9 | WORK_ITEMS.md migration. Minor stuck spots. |

---

### ROOT CAUSE #1: GLM-4.5-Air Empty API Responses (CRITICAL)

**What happened:** The GLM-4.5-Air model via Z.AI API consistently returned responses with NO text content and NO tool calls. This happened 42 times in one session and 37 times in another.

**The mechanism:**
1. Model returns an API response with `content=""` and no `tool_calls`
2. Hermes detects this as "truly empty" and retries up to 3 times (see `run_agent.py` line ~9336)
3. After 3 retries, it falls through to the `"(empty)"` terminal path (line ~9348)
4. The `(empty)` string is written as the assistant message
5. The conversation loop continues, but the model keeps returning empty responses
6. From the user's perspective, the agent is "stuck" -- nothing is happening, no output is appearing

**Why it looks like a hang:** The retry loop runs silently (no output to user). When retries exhaust, the "(empty)" message produces no visible output. The loop continues requesting more from the API, which keeps returning empty, creating an infinite-looking stuck state.

**Evidence from the Kanboard install session (184803):**
```
[92] tool: (50K unzip output)
[93] user: "What's the status"
[94] assistant: "(empty)"    <-- STUCK: model returned nothing
[95] user: "Is postgresql installed?"
[96] assistant: "(empty)"    <-- STUCK AGAIN: model returned nothing again
```

The session had to be Ctrl+C'd at this point.

---

### ROOT CAUSE #2: Token Expiration (401 Auth Errors) Breaking Sessions

**What happened:** Multiple sessions hit `401 - "token expired or incorrect"` errors from the Z.AI API.

**Evidence from request dumps:**
```
session_20260409_190617: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260409_231931: 401 "token expired or incorrect" (glm-5.1)
session_20260410_003732: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260410_005455: 401 "token expired or incorrect" (GLM-4.5-Air)
```

This caused entire sessions to terminate and need to be restarted (that's why there are duplicate sessions like `152415` and `231931` which are identical conversations -- you had to restart).

---

### ROOT CAUSE #3: Benign Empty Responses (NOT a bug)

Many of the "empty" assistant messages in the dev-team session (152415) are actually **tool-call-only responses** -- the model returned a tool call with no explanatory text. This is NORMAL LLM behavior and was not a stuck issue. The session logging stores these as `content=""` but they have associated tool calls.

---

### CONTRIBUTING FACTORS

1. **Large tool outputs as context poison:** The session that died (184803) had just received a 50K-character unzip output (message 92) right before getting stuck. Large, noisy tool outputs in the context may have caused the model to fail to generate meaningful responses.

2. **Long context windows:** Session 184803 had 97 messages (42 empty). The growing context likely degraded GLM-4.5-Air's ability to respond.

3. **No fallback activation:** When GLM-4.5-Air kept returning empty, Hermes did not failover to a different model/provider. The empty-response retry path does not trigger fallback logic (only rate-limit and invalid-response paths do).

---

### RECOMMENDATIONS

1. **Empty response should trigger fallback** -- After exhausting empty retries, Hermes should attempt `_try_activate_fallback()` before giving up with "(empty)". Currently, only rate-limit and invalid-response paths trigger fallback.

2. **User-visible feedback during retries** -- The retry loop for empty content runs silently. Adding a spinner or status indicator would prevent the "stuck" perception.

3. **Auto-compact on repeated empties** -- If the model returns empty 3+ times in sequence, trigger context compression before retrying. The large context may be the cause.

4. **Token refresh handling** -- The 401 auth errors suggest Z.AI tokens expire mid-session. Hermes should detect 401s and refresh the API key automatically (if the provider supports token refresh) rather than requiring a manual session restart.

5. **Model-specific handling for GLM-4.5-Air** -- This model has a significantly higher empty-response rate than glm-5 or glm-5.1. Consider treating GLM-4.5-Air as a less reliable model and adding extra retry or fallback logic specifically for it

### Steps to Reproduce

Hard to reproduce.

### Expected Behavior

When the agent gets stuck in a loop wherein the LLM response is empty on a continued basis, there must be a configurable time-bound fallback. Given the latest PR release of async wake-up, there must be more graceful ways of handling this 'stuck' situation.

### Actual Behavior

The Agent seems to be 'stuck'. Nothing happens on screen. The last visible text is the mid-way execution trace frozen in time without task completion.

### Affected Component

Agent Core (conversation loop, context compression, memory)

### Messaging Platform (if gateway-related)

_No response_

### Operating System

Ubuntu 24.04

### Python Version

3.14.2

### Hermes Version

v0.7.0 (2026.4.3)

### Relevant Logs / Traceback

```shell
## ROOT CAUSE ANALYSIS: Indefinite Stuck Behavior (April 9-10, 2026)

### EXECUTIVE SUMMARY

Two distinct stuck patterns were identified across 4 sessions. Both trace back to the same root cause: **the GLM model family (specifically GLM-4.5-Air via Z.AI API) returning empty API responses** with no content and no tool calls, causing Hermes' retry loop to exhaust silently and emit "(empty)" responses that appear to the user as a hung/frozen session.

---

### AFFECTED SESSIONS

| Session | Model | Empty Responses | Nature |
|---------|-------|----------------|--------|
| 20260409_184803 | GLM-4.5-Air | 42 | Kanboard research + install. Terminal stuck at end. |
| 20260410_004640 | GLM-4.5-Air | 37 | Kanboard Docker install. Worked through it. |
| 20260409_152415 | glm-5.1 | 19 | Dev team playbook design. Mostly tool-only (normal). |
| 20260410_021022 | glm-5 | 9 | WORK_ITEMS.md migration. Minor stuck spots. |

---

### ROOT CAUSE #1: GLM-4.5-Air Empty API Responses (CRITICAL)

**What happened:** The GLM-4.5-Air model via Z.AI API consistently returned responses with NO text content and NO tool calls. This happened 42 times in one session and 37 times in another.

**The mechanism:**
1. Model returns an API response with `content=""` and no `tool_calls`
2. Hermes detects this as "truly empty" and retries up to 3 times (see `run_agent.py` line ~9336)
3. After 3 retries, it falls through to the `"(empty)"` terminal path (line ~9348)
4. The `(empty)` string is written as the assistant message
5. The conversation loop continues, but the model keeps returning empty responses
6. From the user's perspective, the agent is "stuck" -- nothing is happening, no output is appearing

**Why it looks like a hang:** The retry loop runs silently (no output to user). When retries exhaust, the "(empty)" message produces no visible output. The loop continues requesting more from the API, which keeps returning empty, creating an infinite-looking stuck state.

**Evidence from the Kanboard install session (184803):**

[92] tool: (50K unzip output)
[93] user: "What's the status"
[94] assistant: "(empty)"    <-- STUCK: model returned nothing
[95] user: "Is postgresql installed?"
[96] assistant: "(empty)"    <-- STUCK AGAIN: model returned nothing again


The session had to be Ctrl+C'd at this point.

---

### ROOT CAUSE #2: Token Expiration (401 Auth Errors) Breaking Sessions

**What happened:** Multiple sessions hit `401 - "token expired or incorrect"` errors from the Z.AI API.

**Evidence from request dumps:**

session_20260409_190617: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260409_231931: 401 "token expired or incorrect" (glm-5.1)
session_20260410_003732: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260410_005455: 401 "token expired or incorrect" (GLM-4.5-Air)


This caused entire sessions to terminate and need to be restarted (that's why there are duplicate sessions like `152415` and `231931` which are identical conversations -- you had to restart).

---

### ROOT CAUSE #3: Benign Empty Responses (NOT a bug)

Many of the "empty" assistant messages in the dev-team session (152415) are actually **tool-call-only responses** -- the model returned a tool call with no explanatory text. This is NORMAL LLM behavior and was not a stuck issue. The session logging stores these as `content=""` but they have associated tool calls.

---

### CONTRIBUTING FACTORS

1. **Large tool outputs as context poison:** The session that died (184803) had just received a 50K-character unzip output (message 92) right before getting stuck. Large, noisy tool outputs in the context may have caused the model to fail to generate meaningful responses.

2. **Long context windows:** Session 184803 had 97 messages (42 empty). The growing context likely degraded GLM-4.5-Air's ability to respond.

3. **No fallback activation:** When GLM-4.5-Air kept returning empty, Hermes did not failover to a different model/provider. The empty-response retry path does not trigger fallback logic (only rate-limit and invalid-response paths do).

---

### RECOMMENDATIONS

1. **Empty response should trigger fallback** -- After exhausting empty retries, Hermes should attempt `_try_activate_fallback()` before giving up with "(empty)". Currently, only rate-limit and invalid-response paths trigger fallback.

2. **User-visible feedback during retries** -- The retry loop for empty content runs silently. Adding a spinner or status indicator would prevent the "stuck" perception.

3. **Auto-compact on repeated empties** -- If the model returns empty 3+ times in sequence, trigger context compression before retrying. The large context may be the cause.

4. **Token refresh handling** -- The 401 auth errors suggest Z.AI tokens expire mid-session. Hermes should detect 401s and refresh the API key automatically (if the provider supports token refresh) rather than requiring a manual session restart.

5. **Model-specific handling for GLM-4.5-Air** -- This model has a significantly higher empty-response rate than glm-5 or glm-5.1. Consider treating GLM-4.5-Air as a less reliable model and adding extra retry or fallback logic specifically for it
```

### Root Cause Analysis (optional)

_No response_

### Proposed Fix (optional)

_No response_

### Are you willing to submit a PR for this?

- [ ] I'd like to fix this myself and submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Hermes Agent gets stuck indefinitely in the middle of execution #7180

Bug Description

ROOT CAUSE ANALYSIS: Indefinite Stuck Behavior (April 9-10, 2026)

EXECUTIVE SUMMARY

AFFECTED SESSIONS

ROOT CAUSE #1: GLM-4.5-Air Empty API Responses (CRITICAL)

ROOT CAUSE #2: Token Expiration (401 Auth Errors) Breaking Sessions

ROOT CAUSE #3: Benign Empty Responses (NOT a bug)

CONTRIBUTING FACTORS

RECOMMENDATIONS

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Operating System

Python Version

Hermes Version

Relevant Logs / Traceback

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Session	Model	Empty Responses	Nature
20260409_184803	GLM-4.5-Air	42	Kanboard research + install. Terminal stuck at end.
20260410_004640	GLM-4.5-Air	37	Kanboard Docker install. Worked through it.
20260409_152415	glm-5.1	19	Dev team playbook design. Mostly tool-only (normal).
20260410_021022	glm-5	9	WORK_ITEMS.md migration. Minor stuck spots.

[Bug]: Hermes Agent gets stuck indefinitely in the middle of execution #7180

Description

Bug Description

ROOT CAUSE ANALYSIS: Indefinite Stuck Behavior (April 9-10, 2026)

EXECUTIVE SUMMARY

AFFECTED SESSIONS

ROOT CAUSE #1: GLM-4.5-Air Empty API Responses (CRITICAL)

ROOT CAUSE #2: Token Expiration (401 Auth Errors) Breaking Sessions

ROOT CAUSE #3: Benign Empty Responses (NOT a bug)

CONTRIBUTING FACTORS

RECOMMENDATIONS

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Operating System

Python Version

Hermes Version

Relevant Logs / Traceback

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions