Bug type
Regression (worked before, now fails)
Summary
Subagent completion events are permanently dropped when the downstream channel (e.g., Telegram) is temporarily unreachable, because the announce pipeline is coupled to channel delivery. The event is never injected into the parent session, even though the subagent completed successfully and output files exist on disk.
Steps to reproduce
- Configure OpenClaw with a single channel (e.g., Telegram)
- Spawn a subagent worker task
- During worker execution, make the channel API unreachable (e.g., network instability, DNS failure)
- Worker completes and writes output to disk
- Gateway attempts "direct announce agent call" to the parent session
- Agent call triggers LLM response → response delivery to channel → channel API fails
- Agent call times out at
DEFAULT_SUBAGENT_ANNOUNCE_TIMEOUT_MS (60000ms)
- Gateway retries up to 3 times (
MAX_ANNOUNCE_RETRY_COUNT) with backoff, then gives up
- Completion event is permanently lost — parent session is never notified
Expected behavior
The completion event should be durably injected into the parent session context regardless of whether the LLM's response can be delivered to the channel. If delivery fails, the response should enter the normal delivery queue with its own retry logic — the announce itself should not be dropped.
Actual behavior
The announce is treated as failed because the response couldn't be delivered, even though the event could have been injected into the session context without requiring delivery.
sendAnnounce() calls callGateway({method: "agent", deliver: true, timeoutMs: announceTimeoutMs}). The deliver: true flag means the call doesn't return until channel delivery completes or times out. When the channel is down, the OS-level network timeout (~63-66s observed) exceeds the announce timeout (60s), so the announce always times out first. After 3 retries, the event is permanently dropped with no fallback to a durable queue.
Dependency chain:
Worker completes
→ Gateway "direct announce agent call" to parent session
→ Parent session LLM generates response
→ Response delivery to channel (Telegram)
→ Telegram API call (sendMessage)
→ FAILS → entire chain times out → event lost
OpenClaw version
2026.3.2
Operating system
Linux 6.6.87.2-microsoft-standard-WSL2 (x64) — WSL2 on Windows 11
Install method
npm
Logs, screenshots, and evidence
Log entries observed in real-time gateway console output during the incident. Console output is ephemeral and was lost on gateway restart — specific runtime values are from observation, not exported logs. Log formats verified against source code.
Gateway announce logs:
01:14:11 — Worker completes, writes output file (31KB)
01:14:15 — Subagent completion direct announce failed: gateway timeout after 60000ms
01:15:11 — Retry 2/4: gateway timeout after 60000ms
01:15:17 — Retry via Telegram fails: [ws] ⇄ res ✗ send 66094ms (Telegram unreachable)
01:15:20 — Subagent announce give up (retry-limit) run=56dbb3dc retries=3 endedAgo=69s
WebSocket delivery errors (gateway/ws subsystem):
[ws] ⇄ res ✗ send 66094ms errorCode=UNAVAILABLE errorMessage=HttpError: Network request for 'sendMessage' failed!
[ws] ⇄ res ✗ send 63345ms errorCode=UNAVAILABLE errorMessage=HttpError: Network request for 'sendMessage' failed!
Source references:
- Announce give-up log: `reply-DFFRlayb.js:24305`
- Default timeout: `DEFAULT_SUBAGENT_ANNOUNCE_TIMEOUT_MS = 6e4` (`reply-DFFRlayb.js:23187`)
- Max retries: `MAX_ANNOUNCE_RETRY_COUNT = 3` (`reply-DFFRlayb.js:24279`)
- Coupling: `sendAnnounce()` → `callGateway({deliver: !requesterIsSubagent})` (`reply-DFFRlayb.js:23496-23518`)
- Grammy HttpError: `toHttpError()` at `grammy/out/web.mjs:2310`
Impact and severity
Severity: High — Transient network issues cause permanent data loss.
- Lost work notifications: Workers complete tasks but the parent session never knows. Output sits on disk unprocessed until someone manually checks.
- Pipeline stalls: Sequential workflows (process result → spawn next task) break silently.
- No recovery path: Once the announce gives up, the event is gone. No mechanism exists to replay missed completions.
- Disproportionate impact: A 2-3 minute channel outage permanently drops events that the gateway could have queued.
- Affects any channel: Triggered by WSL2 DNS issues in this case, but the bug affects any deployment where a channel becomes temporarily unavailable.
Additional information
Trigger context: Observed during a ~7 minute WSL2 network outage (01:07–01:14 AM PST) that caused simultaneous Anthropic API timeouts and Telegram API failures. The network outage is a WSL2-specific trigger, but the coupling bug is platform-independent.
Grammy timeout is not the bottleneck: Grammy's default timeoutSeconds is 500s. The 63-66s failure duration reflects OS-level DNS/TCP timeouts on WSL2's degraded network, not a Grammy or Telegram limit. The announce timeout (60s) fires before the Telegram request even returns its error.
The delivery queue already solves this for normal messages. Completion events should get the same durability guarantees.
Suggested fixes (in order of preference):
- Decouple announce from delivery — Inject the completion event into the session context without waiting for response delivery. Let the response enter the normal delivery queue separately.
- Persist completion events — Store them durably and retry injection independently of channel availability.
- Fallback to delivery queue — If direct announce fails, enqueue in the delivery queue rather than dropping.
Related: OpenClaw #37375 — Discord fetch failure crashes gateway (different bug, same WSL2 networking trigger)
Bug type
Regression (worked before, now fails)
Summary
Subagent completion events are permanently dropped when the downstream channel (e.g., Telegram) is temporarily unreachable, because the announce pipeline is coupled to channel delivery. The event is never injected into the parent session, even though the subagent completed successfully and output files exist on disk.
Steps to reproduce
DEFAULT_SUBAGENT_ANNOUNCE_TIMEOUT_MS(60000ms)MAX_ANNOUNCE_RETRY_COUNT) with backoff, then gives upExpected behavior
The completion event should be durably injected into the parent session context regardless of whether the LLM's response can be delivered to the channel. If delivery fails, the response should enter the normal delivery queue with its own retry logic — the announce itself should not be dropped.
Actual behavior
The announce is treated as failed because the response couldn't be delivered, even though the event could have been injected into the session context without requiring delivery.
sendAnnounce()callscallGateway({method: "agent", deliver: true, timeoutMs: announceTimeoutMs}). Thedeliver: trueflag means the call doesn't return until channel delivery completes or times out. When the channel is down, the OS-level network timeout (~63-66s observed) exceeds the announce timeout (60s), so the announce always times out first. After 3 retries, the event is permanently dropped with no fallback to a durable queue.Dependency chain:
OpenClaw version
2026.3.2
Operating system
Linux 6.6.87.2-microsoft-standard-WSL2 (x64) — WSL2 on Windows 11
Install method
npm
Logs, screenshots, and evidence
Impact and severity
Severity: High — Transient network issues cause permanent data loss.
Additional information
Trigger context: Observed during a ~7 minute WSL2 network outage (01:07–01:14 AM PST) that caused simultaneous Anthropic API timeouts and Telegram API failures. The network outage is a WSL2-specific trigger, but the coupling bug is platform-independent.
Grammy timeout is not the bottleneck: Grammy's default
timeoutSecondsis 500s. The 63-66s failure duration reflects OS-level DNS/TCP timeouts on WSL2's degraded network, not a Grammy or Telegram limit. The announce timeout (60s) fires before the Telegram request even returns its error.The delivery queue already solves this for normal messages. Completion events should get the same durability guarantees.
Suggested fixes (in order of preference):
Related: OpenClaw #37375 — Discord fetch failure crashes gateway (different bug, same WSL2 networking trigger)