Skip to content

[Bug]: Hermes Agent gets stuck indefinitely in the middle of execution #7180

@deepakdgupta1

Description

@deepakdgupta1

Bug Description

ROOT CAUSE ANALYSIS: Indefinite Stuck Behavior (April 9-10, 2026)

EXECUTIVE SUMMARY

Two distinct stuck patterns were identified across 4 sessions. Both trace back to the same root cause: the GLM model family (specifically GLM-4.5-Air via Z.AI API) returning empty API responses with no content and no tool calls, causing Hermes' retry loop to exhaust silently and emit "(empty)" responses that appear to the user as a hung/frozen session.


AFFECTED SESSIONS

Session Model Empty Responses Nature
20260409_184803 GLM-4.5-Air 42 Kanboard research + install. Terminal stuck at end.
20260410_004640 GLM-4.5-Air 37 Kanboard Docker install. Worked through it.
20260409_152415 glm-5.1 19 Dev team playbook design. Mostly tool-only (normal).
20260410_021022 glm-5 9 WORK_ITEMS.md migration. Minor stuck spots.

ROOT CAUSE #1: GLM-4.5-Air Empty API Responses (CRITICAL)

What happened: The GLM-4.5-Air model via Z.AI API consistently returned responses with NO text content and NO tool calls. This happened 42 times in one session and 37 times in another.

The mechanism:

  1. Model returns an API response with content="" and no tool_calls
  2. Hermes detects this as "truly empty" and retries up to 3 times (see run_agent.py line ~9336)
  3. After 3 retries, it falls through to the "(empty)" terminal path (line ~9348)
  4. The (empty) string is written as the assistant message
  5. The conversation loop continues, but the model keeps returning empty responses
  6. From the user's perspective, the agent is "stuck" -- nothing is happening, no output is appearing

Why it looks like a hang: The retry loop runs silently (no output to user). When retries exhaust, the "(empty)" message produces no visible output. The loop continues requesting more from the API, which keeps returning empty, creating an infinite-looking stuck state.

Evidence from the Kanboard install session (184803):

[92] tool: (50K unzip output)
[93] user: "What's the status"
[94] assistant: "(empty)"    <-- STUCK: model returned nothing
[95] user: "Is postgresql installed?"
[96] assistant: "(empty)"    <-- STUCK AGAIN: model returned nothing again

The session had to be Ctrl+C'd at this point.


ROOT CAUSE #2: Token Expiration (401 Auth Errors) Breaking Sessions

What happened: Multiple sessions hit 401 - "token expired or incorrect" errors from the Z.AI API.

Evidence from request dumps:

session_20260409_190617: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260409_231931: 401 "token expired or incorrect" (glm-5.1)
session_20260410_003732: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260410_005455: 401 "token expired or incorrect" (GLM-4.5-Air)

This caused entire sessions to terminate and need to be restarted (that's why there are duplicate sessions like 152415 and 231931 which are identical conversations -- you had to restart).


ROOT CAUSE #3: Benign Empty Responses (NOT a bug)

Many of the "empty" assistant messages in the dev-team session (152415) are actually tool-call-only responses -- the model returned a tool call with no explanatory text. This is NORMAL LLM behavior and was not a stuck issue. The session logging stores these as content="" but they have associated tool calls.


CONTRIBUTING FACTORS

  1. Large tool outputs as context poison: The session that died (184803) had just received a 50K-character unzip output (message 92) right before getting stuck. Large, noisy tool outputs in the context may have caused the model to fail to generate meaningful responses.

  2. Long context windows: Session 184803 had 97 messages (42 empty). The growing context likely degraded GLM-4.5-Air's ability to respond.

  3. No fallback activation: When GLM-4.5-Air kept returning empty, Hermes did not failover to a different model/provider. The empty-response retry path does not trigger fallback logic (only rate-limit and invalid-response paths do).


RECOMMENDATIONS

  1. Empty response should trigger fallback -- After exhausting empty retries, Hermes should attempt _try_activate_fallback() before giving up with "(empty)". Currently, only rate-limit and invalid-response paths trigger fallback.

  2. User-visible feedback during retries -- The retry loop for empty content runs silently. Adding a spinner or status indicator would prevent the "stuck" perception.

  3. Auto-compact on repeated empties -- If the model returns empty 3+ times in sequence, trigger context compression before retrying. The large context may be the cause.

  4. Token refresh handling -- The 401 auth errors suggest Z.AI tokens expire mid-session. Hermes should detect 401s and refresh the API key automatically (if the provider supports token refresh) rather than requiring a manual session restart.

  5. Model-specific handling for GLM-4.5-Air -- This model has a significantly higher empty-response rate than glm-5 or glm-5.1. Consider treating GLM-4.5-Air as a less reliable model and adding extra retry or fallback logic specifically for it

Steps to Reproduce

Hard to reproduce.

Expected Behavior

When the agent gets stuck in a loop wherein the LLM response is empty on a continued basis, there must be a configurable time-bound fallback. Given the latest PR release of async wake-up, there must be more graceful ways of handling this 'stuck' situation.

Actual Behavior

The Agent seems to be 'stuck'. Nothing happens on screen. The last visible text is the mid-way execution trace frozen in time without task completion.

Affected Component

Agent Core (conversation loop, context compression, memory)

Messaging Platform (if gateway-related)

No response

Operating System

Ubuntu 24.04

Python Version

3.14.2

Hermes Version

v0.7.0 (2026.4.3)

Relevant Logs / Traceback

## ROOT CAUSE ANALYSIS: Indefinite Stuck Behavior (April 9-10, 2026)

### EXECUTIVE SUMMARY

Two distinct stuck patterns were identified across 4 sessions. Both trace back to the same root cause: **the GLM model family (specifically GLM-4.5-Air via Z.AI API) returning empty API responses** with no content and no tool calls, causing Hermes' retry loop to exhaust silently and emit "(empty)" responses that appear to the user as a hung/frozen session.

---

### AFFECTED SESSIONS

| Session | Model | Empty Responses | Nature |
|---------|-------|----------------|--------|
| 20260409_184803 | GLM-4.5-Air | 42 | Kanboard research + install. Terminal stuck at end. |
| 20260410_004640 | GLM-4.5-Air | 37 | Kanboard Docker install. Worked through it. |
| 20260409_152415 | glm-5.1 | 19 | Dev team playbook design. Mostly tool-only (normal). |
| 20260410_021022 | glm-5 | 9 | WORK_ITEMS.md migration. Minor stuck spots. |

---

### ROOT CAUSE #1: GLM-4.5-Air Empty API Responses (CRITICAL)

**What happened:** The GLM-4.5-Air model via Z.AI API consistently returned responses with NO text content and NO tool calls. This happened 42 times in one session and 37 times in another.

**The mechanism:**
1. Model returns an API response with `content=""` and no `tool_calls`
2. Hermes detects this as "truly empty" and retries up to 3 times (see `run_agent.py` line ~9336)
3. After 3 retries, it falls through to the `"(empty)"` terminal path (line ~9348)
4. The `(empty)` string is written as the assistant message
5. The conversation loop continues, but the model keeps returning empty responses
6. From the user's perspective, the agent is "stuck" -- nothing is happening, no output is appearing

**Why it looks like a hang:** The retry loop runs silently (no output to user). When retries exhaust, the "(empty)" message produces no visible output. The loop continues requesting more from the API, which keeps returning empty, creating an infinite-looking stuck state.

**Evidence from the Kanboard install session (184803):**

[92] tool: (50K unzip output)
[93] user: "What's the status"
[94] assistant: "(empty)"    <-- STUCK: model returned nothing
[95] user: "Is postgresql installed?"
[96] assistant: "(empty)"    <-- STUCK AGAIN: model returned nothing again


The session had to be Ctrl+C'd at this point.

---

### ROOT CAUSE #2: Token Expiration (401 Auth Errors) Breaking Sessions

**What happened:** Multiple sessions hit `401 - "token expired or incorrect"` errors from the Z.AI API.

**Evidence from request dumps:**

session_20260409_190617: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260409_231931: 401 "token expired or incorrect" (glm-5.1)
session_20260410_003732: 401 "token expired or incorrect" (GLM-4.5-Air)
session_20260410_005455: 401 "token expired or incorrect" (GLM-4.5-Air)


This caused entire sessions to terminate and need to be restarted (that's why there are duplicate sessions like `152415` and `231931` which are identical conversations -- you had to restart).

---

### ROOT CAUSE #3: Benign Empty Responses (NOT a bug)

Many of the "empty" assistant messages in the dev-team session (152415) are actually **tool-call-only responses** -- the model returned a tool call with no explanatory text. This is NORMAL LLM behavior and was not a stuck issue. The session logging stores these as `content=""` but they have associated tool calls.

---

### CONTRIBUTING FACTORS

1. **Large tool outputs as context poison:** The session that died (184803) had just received a 50K-character unzip output (message 92) right before getting stuck. Large, noisy tool outputs in the context may have caused the model to fail to generate meaningful responses.

2. **Long context windows:** Session 184803 had 97 messages (42 empty). The growing context likely degraded GLM-4.5-Air's ability to respond.

3. **No fallback activation:** When GLM-4.5-Air kept returning empty, Hermes did not failover to a different model/provider. The empty-response retry path does not trigger fallback logic (only rate-limit and invalid-response paths do).

---

### RECOMMENDATIONS

1. **Empty response should trigger fallback** -- After exhausting empty retries, Hermes should attempt `_try_activate_fallback()` before giving up with "(empty)". Currently, only rate-limit and invalid-response paths trigger fallback.

2. **User-visible feedback during retries** -- The retry loop for empty content runs silently. Adding a spinner or status indicator would prevent the "stuck" perception.

3. **Auto-compact on repeated empties** -- If the model returns empty 3+ times in sequence, trigger context compression before retrying. The large context may be the cause.

4. **Token refresh handling** -- The 401 auth errors suggest Z.AI tokens expire mid-session. Hermes should detect 401s and refresh the API key automatically (if the provider supports token refresh) rather than requiring a manual session restart.

5. **Model-specific handling for GLM-4.5-Air** -- This model has a significantly higher empty-response rate than glm-5 or glm-5.1. Consider treating GLM-4.5-Air as a less reliable model and adding extra retry or fallback logic specifically for it

Root Cause Analysis (optional)

No response

Proposed Fix (optional)

No response

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions