Skip to content

Gateway infinite loop: log handler duplication + unbounded interrupt recursion #816

@DaAwesomeRazor

Description

@DaAwesomeRazor

Bug Description

Two bugs combine to create an infinite loop / resource exhaustion in the gateway when the agent hits repeated errors (e.g., context overflow from #813):

1. Log handler duplication — every log line written N times after N messages

AIAgent.__init__() unconditionally calls logging.getLogger().addHandler(_error_file_handler) (line ~310 in run_agent.py). The gateway creates a new AIAgent for every incoming message. After 20 messages in a session, every single log line gets written 20 times to errors.log.

Evidence from production — the same error line repeated 22+ times simultaneously:

2026-03-10 01:58:17,956 ERROR root: API call failed after 6 retries...
2026-03-10 01:58:17,956 ERROR root: API call failed after 6 retries...
2026-03-10 01:58:17,956 ERROR root: API call failed after 6 retries...
... (22 identical lines at the same timestamp)

2. Unbounded interrupt recursion in _run_agent

When a new message arrives while the agent is processing, the gateway interrupts the current run and _run_agent recursively calls itself (line ~2895 in gateway/run.py) with the pending message. There is no depth limit.

If the agent keeps failing (context too large, API returning 400/502, etc.) and the user sends more messages, each recursive call spawns a new agent that also fails, which can be interrupted by yet another message, recursing indefinitely.

Steps to Reproduce

  1. Have an active Discord/Telegram session with a large conversation history
  2. Hit a persistent API error (e.g., context overflow from Anthropic "prompt is too long" 400 error not detected as context length error — aborts instead of compressing #813, or API outage)
  3. Send 2-3 more messages while the agent is retrying/failing
  4. Each message interrupts the current run and triggers a recursive _run_agent call
  5. The gateway gets stuck in a loop of failing agent runs, each adding another log handler
  6. errors.log fills with N*N duplicate lines, CPU spins

Fix Applied

Log handler deduplication (run_agent.py)

Added a sentinel attribute (_hermes_error_log) to the handler and a check before adding:

_already_has_error_log = any(
    getattr(h, '_hermes_error_log', False) for h in _root_logger.handlers
)
if not _already_has_error_log:
    # ... create and add handler ...
    _error_file_handler._hermes_error_log = True

Interrupt recursion depth cap (gateway/run.py)

Added _interrupt_depth parameter to _run_agent, capped at MAX_INTERRUPT_DEPTH = 3:

if _interrupt_depth >= MAX_INTERRUPT_DEPTH:
    return {
        "final_response": "Too many rapid messages while processing. Please wait a moment and try again.",
        ...
    }

The recursive call now passes _interrupt_depth=_interrupt_depth + 1.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions