Bug
The gateway becomes unresponsive when handling multiple subagent completion announcements, causing cascading timeouts and failed deliveries. This is distinct from #7042 (which was hard crashes) - the gateway stays alive but stops processing requests effectively.
Symptoms
-
Subagent completion timeouts - repeated warnings:
\
[warn] Subagent announce completion direct announce agent call transient failure, retrying 2/4 in 5s: gateway timeout after 120000ms
[warn] Subagent announce completion direct announce agent call transient failure, retrying 3/4 in 10s: gateway timeout after 120000ms
\\
-
Provider profile failover - Anthropic requests time out, triggering failover:
\
Profile anthropic:manual timed out. Trying next account...
embedded run failover decision: rotate_profile, failoverReason: timeout
\\
-
Lane wait exceeded - sessions queue up waiting for processing:
\
lane wait exceeded: waitedMs=158509
\\
-
Eventually the gateway stops responding to Telegram/channels entirely until restart.
Impact
- Overnight cron jobs (subagent builds) fail to deliver results
- Channels go silent even though gateway process is still running
- Watchdog wrapper doesn't help because process hasn't crashed
Environment
- OpenClaw \2026.4.5\
- Windows 10 (x64)
- Node.js v25.5.0
- Gateway as Windows Scheduled Task with watchdog wrapper
- Workload: overnight build cron spawning Sonnet subagents (30-60 min builds)
Reproduction
- Configure overnight cron job that spawns subagent builds
- Run 2-3 builds back-to-back (e.g. overnight build + dream cycle)
- Gateway progressively slows, completion announcements start timing out
- Eventually gateway becomes unresponsive to all channels
Logs
From \openclaw-2026-04-10.log:
- 08:05:56 - First subagent timeout (retry 2/4)
- 08:06:31 - Anthropic profile timeout, failover triggered
- 08:08:01 - Subagent timeout (retry 3/4)
- Pattern continues with increasing delays
Suggested Investigation
- Memory leak under sustained subagent load?
- WebSocket connection pool exhaustion?
- Anthropic API timeout handling blocking the event loop?
Workaround
Currently none - manual gateway restart required once symptoms appear. The watchdog from #7042 only helps with crashes, not unresponsive states.
Bug
The gateway becomes unresponsive when handling multiple subagent completion announcements, causing cascading timeouts and failed deliveries. This is distinct from #7042 (which was hard crashes) - the gateway stays alive but stops processing requests effectively.
Symptoms
Subagent completion timeouts - repeated warnings:
\
[warn] Subagent announce completion direct announce agent call transient failure, retrying 2/4 in 5s: gateway timeout after 120000ms
[warn] Subagent announce completion direct announce agent call transient failure, retrying 3/4 in 10s: gateway timeout after 120000ms
\\
Provider profile failover - Anthropic requests time out, triggering failover:
\
Profile anthropic:manual timed out. Trying next account...
embedded run failover decision: rotate_profile, failoverReason: timeout
\\
Lane wait exceeded - sessions queue up waiting for processing:
\
lane wait exceeded: waitedMs=158509
\\
Eventually the gateway stops responding to Telegram/channels entirely until restart.
Impact
Environment
Reproduction
Logs
From \openclaw-2026-04-10.log:
Suggested Investigation
Workaround
Currently none - manual gateway restart required once symptoms appear. The watchdog from #7042 only helps with crashes, not unresponsive states.