Skip to content

Gateway becomes unresponsive under subagent load on Windows - completion announcements timeout #64253

@L8ton-crypto

Description

@L8ton-crypto

Bug

The gateway becomes unresponsive when handling multiple subagent completion announcements, causing cascading timeouts and failed deliveries. This is distinct from #7042 (which was hard crashes) - the gateway stays alive but stops processing requests effectively.

Symptoms

  1. Subagent completion timeouts - repeated warnings:
    \
    [warn] Subagent announce completion direct announce agent call transient failure, retrying 2/4 in 5s: gateway timeout after 120000ms
    [warn] Subagent announce completion direct announce agent call transient failure, retrying 3/4 in 10s: gateway timeout after 120000ms
    \\

  2. Provider profile failover - Anthropic requests time out, triggering failover:
    \
    Profile anthropic:manual timed out. Trying next account...
    embedded run failover decision: rotate_profile, failoverReason: timeout
    \\

  3. Lane wait exceeded - sessions queue up waiting for processing:
    \
    lane wait exceeded: waitedMs=158509
    \\

  4. Eventually the gateway stops responding to Telegram/channels entirely until restart.

Impact

  • Overnight cron jobs (subagent builds) fail to deliver results
  • Channels go silent even though gateway process is still running
  • Watchdog wrapper doesn't help because process hasn't crashed

Environment

  • OpenClaw \2026.4.5\
  • Windows 10 (x64)
  • Node.js v25.5.0
  • Gateway as Windows Scheduled Task with watchdog wrapper
  • Workload: overnight build cron spawning Sonnet subagents (30-60 min builds)

Reproduction

  1. Configure overnight cron job that spawns subagent builds
  2. Run 2-3 builds back-to-back (e.g. overnight build + dream cycle)
  3. Gateway progressively slows, completion announcements start timing out
  4. Eventually gateway becomes unresponsive to all channels

Logs

From \openclaw-2026-04-10.log:

  • 08:05:56 - First subagent timeout (retry 2/4)
  • 08:06:31 - Anthropic profile timeout, failover triggered
  • 08:08:01 - Subagent timeout (retry 3/4)
  • Pattern continues with increasing delays

Suggested Investigation

  • Memory leak under sustained subagent load?
  • WebSocket connection pool exhaustion?
  • Anthropic API timeout handling blocking the event loop?

Workaround

Currently none - manual gateway restart required once symptoms appear. The watchdog from #7042 only helps with crashes, not unresponsive states.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions