Skip to content

gateway agent timeout (HERMES_AGENT_TIMEOUT) kills legitimate long-running tasks #4815

@BongSuCHOI

Description

@BongSuCHOI

Problem

Commit 970042de introduced a hard 10-minute timeout (HERMES_AGENT_TIMEOUT, default 600s) on gateway agent execution via asyncio.wait_for(). This was designed to prevent stuck sessions, but it also kills legitimate long-running tasks — particularly subagent/delegate work with reasoning-heavy models.

When a user sets the agent to work autonomously (e.g., multi-step research, code implementation via delegate_task, or reasoning models with long chain-of-thought), hitting the 10-minute wall results in:

  1. Agent is force-interrupted mid-work (agent.interrupt())
  2. All intermediate progress is lost
  3. User gets a generic error: "Request timed out after 10 minutes. Try again, or use /reset to start fresh."
  4. The session transcript is marked as failed: true, breaking conversation continuity

This significantly reduces the appeal of autonomous/unsupervised agent workflows, which is a core value proposition.

Environment

  • OS: Oracle Linux 9 (aarch64)
  • Python: 3.11.15
  • Hermes Agent: v0.7.0 (v2026.4.3)
  • gateway/run.py line 6042-6070: asyncio.wait_for(loop.run_in_executor(None, run_sync), timeout=_agent_timeout)
  • cron/scheduler.py: Similar HERMES_CRON_TIMEOUT for cron jobs
  • Timeout is env-var only (HERMES_AGENT_TIMEOUT=600), not exposed in config.yaml or DEFAULT_CONFIG

Steps to Reproduce

  1. Start hermes gateway: hermes gateway
  2. Send a task that requires subagent delegation with a reasoning model (e.g., via Telegram):
    • "Analyze the Hermes codebase for all timeout-related code"
    • This triggers delegate_task, which spawns a subagent doing multiple tool calls
  3. Wait for the agent to work for 10+ minutes
  4. Observe: the agent is interrupted and the user receives:

    ⏱️ Request timed out after 10 minutes. The agent may have been stuck on a tool or API call.
    Try again, or use /reset to start fresh.

Suggested Improvements

1. Activity-based timeout instead of wall-clock timeout

Instead of a fixed wall-clock limit, track the last "active" timestamp (updated on each tool_call completion / API response). Only trigger timeout if there's been no activity for N seconds. This distinguishes "working hard on a complex task" from "hung on a dead API call."

# Pseudocode: reset on each successful tool/API round-trip
self._last_activity = time.time()
# Timeout check: time.time() - self._last_activity > INACTIVITY_TIMEOUT

2. Expose timeout in config.yaml

HERMES_AGENT_TIMEOUT is env-var only. Users shouldn't need to edit .env for a behavioral setting. Add to DEFAULT_CONFIG:

agent:
  gateway_timeout: 600  # seconds, 0 = unlimited

3. Timeout extension prompt

When approaching the timeout (e.g., at 80% elapsed), send a non-blocking notification to the user with an option to extend. On platforms that support it (Telegram inline keyboard), offer a "Continue" button that resets the timer.

4. Graceful degradation over hard kill

On timeout, instead of agent.interrupt() (which kills everything), consider:

  • Saving the current conversation state for resumption
  • Returning a partial result with what was completed so far
  • Offering /resume or auto-retry with a fresh context window

5. Subagent-aware timeout accounting

When the main agent delegates to a subagent (delegate_task), the subagent runs in a separate process/thread. The main agent is effectively "waiting" (idle). This idle-wait time should not count against the timeout, since the agent isn't stuck — it's waiting for a child to finish.

Workaround

Set a larger timeout via environment variable:

# In ~/.hermes/.env
HERMES_AGENT_TIMEOUT=1800  # 30 minutes
HERMES_CRON_TIMEOUT=1800   # also for cron jobs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions