Skip to content

fix: replace wall-clock agent timeout with inactivity-based timeout#4864

Closed
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/activity-based-agent-timeout-4815
Closed

fix: replace wall-clock agent timeout with inactivity-based timeout#4864
Mibayy wants to merge 1 commit into
NousResearch:mainfrom
Mibayy:fix/activity-based-agent-timeout-4815

Conversation

@Mibayy

@Mibayy Mibayy commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #4815 — The hard 10-minute HERMES_AGENT_TIMEOUT wall-clock timeout kills legitimate long-running tasks (subagent delegation with reasoning models, multi-step research, autonomous code implementation).

This PR replaces it with an inactivity-based timeout: the timer resets on every tool call, step completion, and streaming API response. It only fires when the agent has been completely idle for the configured duration — correctly detecting hung API calls without killing active work.

How it works

Before: asyncio.wait_for(executor, timeout=600) — fixed wall-clock, kills at 10 min regardless of activity.

After: Poll loop (5s interval) checks time.time() - last_activity against the timeout threshold:

  • tool_progress_callback → resets activity timer (tool execution)
  • step_callback → resets activity timer (iteration completion)
  • stream_delta_callback → resets activity timer (API streaming tokens, throttled to every 2s)

A 30-minute subagent task with continuous tool calls will never timeout. A stuck API call with no activity for 10 minutes will.

Config

Two ways to configure (existing env var still works):

# config.yaml
agent:
  gateway_timeout: 600  # seconds, 0 = unlimited
# .env (takes precedence over config.yaml)
HERMES_AGENT_TIMEOUT=1800

Changes

  • gateway/run.py:
    • Config bridging: agent.gateway_timeoutHERMES_AGENT_TIMEOUT env var
    • Activity tracker _last_activity updated by 3 callbacks
    • Replaced asyncio.wait_for() with inactivity polling loop
    • Support 0 for unlimited timeout
    • Stream delta activity throttled to every 2s

Test plan

  • Long-running task with many tool calls (>10 min) completes successfully
  • Stuck agent (simulate hung API) triggers timeout after configured inactivity period
  • HERMES_AGENT_TIMEOUT=0 allows unlimited execution
  • agent.gateway_timeout in config.yaml is respected
  • Env var HERMES_AGENT_TIMEOUT takes precedence over config.yaml
  • Timeout message correctly says "inactive" not "timed out"

🤖 Generated with Claude Code

…ousResearch#4815)

The hard 10-minute wall-clock timeout killed legitimate long-running
tasks (subagent delegation, reasoning models, multi-step research).

Replace with inactivity-based timeout: the timer resets on every tool
call, step completion, and streaming token. Only fires when the agent
has been completely idle (no callbacks) for the configured duration,
which correctly detects hung API calls without killing active work.

Changes:
- Poll loop (5s interval) checks last_activity timestamp vs timeout
- Activity tracked via tool_progress, step, and stream_delta callbacks
- Stream delta tracking throttled to every 2s to avoid per-token overhead
- Support 0 = unlimited timeout
- Expose via config.yaml: agent.gateway_timeout (bridges to env var)
- HERMES_AGENT_TIMEOUT env var still works, config.yaml doesn't override it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
teknium1 added a commit that referenced this pull request Apr 6, 2026
…timeout

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR #4864 (Mibayy) and issue #4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
teknium1 added a commit that referenced this pull request Apr 6, 2026
…timeout (#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR #4864 (Mibayy) and issue #4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
@teknium1

teknium1 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

Merged via #5389. Your design — inactivity-based timeout instead of wall-clock, and the agent.gateway_timeout config.yaml bridging — was the foundation for the implementation. We reimplemented on current main using the existing get_activity_summary() infrastructure from run_agent.py (avoids a parallel activity tracker) and preserved the diagnostic logging that was added after your PR was opened. Thanks for the contribution, @Mibayy!

@teknium1 teknium1 closed this Apr 6, 2026
jinzheming pushed a commit to jinzheming/hermes-agent that referenced this pull request Apr 6, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
saxster pushed a commit to saxster/hermes-agent that referenced this pull request Apr 8, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
Tommyeds pushed a commit to Tommyeds/hermes-agent that referenced this pull request Apr 12, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…timeout (NousResearch#5389)

The gateway previously used a hard wall-clock asyncio.wait_for timeout
that killed agents after a fixed duration regardless of activity. This
punished legitimate long-running tasks (subagent delegation, reasoning
models, multi-step research).

Now uses an inactivity-based polling loop that checks the agent's
built-in activity tracker (get_activity_summary) every 5 seconds. The
agent can run indefinitely as long as it's actively calling tools or
receiving API responses. Only fires when the agent has been completely
idle for the configured duration.

Changes:
- Replace asyncio.wait_for with asyncio.wait poll loop checking
  agent idle time via get_activity_summary()
- Add agent.gateway_timeout config.yaml key (default 1800s, 0=unlimited)
- Update stale session eviction to use agent idle time instead of
  pure wall-clock (prevents evicting active long-running tasks)
- Preserve all existing diagnostic logging and user-facing context

Inspired by PR NousResearch#4864 (Mibayy) and issue NousResearch#4815 (BongSuCHOI).
Reimplemented on current main using existing _touch_activity()
infrastructure rather than a parallel tracker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway agent timeout (HERMES_AGENT_TIMEOUT) kills legitimate long-running tasks

2 participants