BUG: Subagent completion announce permanently lost when channel delivery is unavailable

### Bug type

Regression (worked before, now fails)

### Summary

Subagent completion events are permanently dropped when the downstream channel (e.g., Telegram) is temporarily unreachable, because the announce pipeline is coupled to channel delivery. The event is never injected into the parent session, even though the subagent completed successfully and output files exist on disk.

### Steps to reproduce

1. Configure OpenClaw with a single channel (e.g., Telegram)
2. Spawn a subagent worker task
3. During worker execution, make the channel API unreachable (e.g., network instability, DNS failure)
4. Worker completes and writes output to disk
5. Gateway attempts "direct announce agent call" to the parent session
6. Agent call triggers LLM response → response delivery to channel → channel API fails
7. Agent call times out at `DEFAULT_SUBAGENT_ANNOUNCE_TIMEOUT_MS` (60000ms)
8. Gateway retries up to 3 times (`MAX_ANNOUNCE_RETRY_COUNT`) with backoff, then gives up
9. Completion event is permanently lost — parent session is never notified

### Expected behavior

The completion event should be durably injected into the parent session context regardless of whether the LLM's response can be delivered to the channel. If delivery fails, the response should enter the normal delivery queue with its own retry logic — the announce itself should not be dropped.

### Actual behavior

The announce is treated as failed because the *response* couldn't be delivered, even though the *event* could have been injected into the session context without requiring delivery.

`sendAnnounce()` calls `callGateway({method: "agent", deliver: true, timeoutMs: announceTimeoutMs})`. The `deliver: true` flag means the call doesn't return until channel delivery completes or times out. When the channel is down, the OS-level network timeout (~63-66s observed) exceeds the announce timeout (60s), so the announce always times out first. After 3 retries, the event is permanently dropped with no fallback to a durable queue.

Dependency chain:
```
Worker completes
  → Gateway "direct announce agent call" to parent session
    → Parent session LLM generates response
      → Response delivery to channel (Telegram)
        → Telegram API call (sendMessage)
          → FAILS → entire chain times out → event lost
```

### OpenClaw version

2026.3.2

### Operating system

Linux 6.6.87.2-microsoft-standard-WSL2 (x64) — WSL2 on Windows 11

### Install method

npm

### Logs, screenshots, and evidence

```shell
Log entries observed in real-time gateway console output during the incident. Console output is ephemeral and was lost on gateway restart — specific runtime values are from observation, not exported logs. Log formats verified against source code.

Gateway announce logs:

01:14:11 — Worker completes, writes output file (31KB)
01:14:15 — Subagent completion direct announce failed: gateway timeout after 60000ms
01:15:11 — Retry 2/4: gateway timeout after 60000ms
01:15:17 — Retry via Telegram fails: [ws] ⇄ res ✗ send 66094ms (Telegram unreachable)
01:15:20 — Subagent announce give up (retry-limit) run=56dbb3dc retries=3 endedAgo=69s


WebSocket delivery errors (gateway/ws subsystem):

[ws] ⇄ res ✗ send 66094ms errorCode=UNAVAILABLE errorMessage=HttpError: Network request for 'sendMessage' failed!
[ws] ⇄ res ✗ send 63345ms errorCode=UNAVAILABLE errorMessage=HttpError: Network request for 'sendMessage' failed!


Source references:
- Announce give-up log: `reply-DFFRlayb.js:24305`
- Default timeout: `DEFAULT_SUBAGENT_ANNOUNCE_TIMEOUT_MS = 6e4` (`reply-DFFRlayb.js:23187`)
- Max retries: `MAX_ANNOUNCE_RETRY_COUNT = 3` (`reply-DFFRlayb.js:24279`)
- Coupling: `sendAnnounce()` → `callGateway({deliver: !requesterIsSubagent})` (`reply-DFFRlayb.js:23496-23518`)
- Grammy HttpError: `toHttpError()` at `grammy/out/web.mjs:2310`
```

### Impact and severity

**Severity: High** — Transient network issues cause permanent data loss.

- **Lost work notifications**: Workers complete tasks but the parent session never knows. Output sits on disk unprocessed until someone manually checks.
- **Pipeline stalls**: Sequential workflows (process result → spawn next task) break silently.
- **No recovery path**: Once the announce gives up, the event is gone. No mechanism exists to replay missed completions.
- **Disproportionate impact**: A 2-3 minute channel outage permanently drops events that the gateway could have queued.
- **Affects any channel**: Triggered by WSL2 DNS issues in this case, but the bug affects any deployment where a channel becomes temporarily unavailable.

### Additional information

**Trigger context:** Observed during a ~7 minute WSL2 network outage (01:07–01:14 AM PST) that caused simultaneous Anthropic API timeouts and Telegram API failures. The network outage is a WSL2-specific trigger, but the coupling bug is platform-independent.

**Grammy timeout is not the bottleneck:** Grammy's default `timeoutSeconds` is 500s. The 63-66s failure duration reflects OS-level DNS/TCP timeouts on WSL2's degraded network, not a Grammy or Telegram limit. The announce timeout (60s) fires before the Telegram request even returns its error.

**The delivery queue already solves this for normal messages.** Completion events should get the same durability guarantees.

**Suggested fixes (in order of preference):**

1. **Decouple announce from delivery** — Inject the completion event into the session context without waiting for response delivery. Let the response enter the normal delivery queue separately.
2. **Persist completion events** — Store them durably and retry injection independently of channel availability.
3. **Fallback to delivery queue** — If direct announce fails, enqueue in the delivery queue rather than dropping.

**Related:** OpenClaw #37375 — Discord fetch failure crashes gateway (different bug, same WSL2 networking trigger)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: Subagent completion announce permanently lost when channel delivery is unavailable #38055

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

BUG: Subagent completion announce permanently lost when channel delivery is unavailable #38055

Description

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions