Skip to content

[Bug]: Infinite retry loop caused by stream stale timeout during local LLM prefill phase #7069

@clayop

Description

@clayop

Bug Description

Describe the bug

When using Hermes Agent with a local LLM backend(especiallly heavy ones), the agent falls into an infinite loop of timeouts and immediate retries if the local model's prompt processing (prefill) time exceeds the default stream timeout (e.g., 180s).

The agent aborts the request ([stream_generate] Aborting request) before the first token is generated, causing the backend to interrupt the prefill (prefill interrupted). The agent then immediately retries the same heavy request, leading to a permanent deadlock where the model never gets enough time to finish the prefill phase.

Additional context

Increasing the stream timeout in the configuration serves as a temporary workaround, but the core issue lies in the retry logic aggressively looping without allowing the local backend sufficient time to complete prompt ingestion.

Steps to Reproduce

  1. Set up Hermes Agent with an MLX-based local LLM backend on an Apple Silicon machine (e.g., Mac M3 Ultra with 512GB Unified RAM).
  2. Load a massive local model, such as baa-ai/GLM-5.1-RAM-420GB-MLX.
  3. Send a complex prompt to the agent. Due to the sheer size of the model (420GB), the initial prompt processing (prefill) naturally takes longer than 180 seconds.
  4. Observe the logs: The agent times out after 180s, the backend throws a prefill interrupted error, and the agent immediately retries, starting the infinite loop.

Expected Behavior

  1. The agent should ideally have a separate, longer timeout configuration for the "Time To First Token" (TTFT) or prefill phase, distinct from the inter-token stream timeout.
  2. When a timeout occurs, the retry mechanism should implement an exponential backoff or limit the maximum number of retries, rather than immediately and endlessly spamming the backend with the same heavy context.

Actual Behavior

The agent forcibly aborts the connection after the default 180-second timeout ([stream_generate] Aborting request). This causes the local backend to halt its prompt evaluation (prefill interrupted). Immediately after the timeout, the agent retries the exact same heavy request without any delay or backoff. This results in an infinite retry loop, completely preventing the model from ever finishing the prefill phase and returning a response.

Affected Component

Other

Messaging Platform (if gateway-related)

Telegram

Operating System

MacOS Tahoe

Python Version

3.11

Hermes Version

0.8.0

Relevant Logs / Traceback

Root Cause Analysis (optional)

No response

Proposed Fix (optional)

No response

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions