Skip to content

Subagent completion announce retry-limit logs hide the underlying delivery error #84272

@pearl-dot

Description

@pearl-dot

Summary

In OpenClaw 2026.5.12, failed subagent completion announcements can end with only:

[warn] Subagent announce give up (retry-limit) run=<runId> child=<childSessionKey> requester=<requesterSessionKey> retries=3 endedAgo=<s>

The line does not include the real delivery failure. On systems where the gateway LaunchAgent discards stderr, the more useful per-attempt diagnostic can be lost, leaving no way to tell from gateway.log whether the failure was a gateway timeout, Slack/outbound configuration issue, routed dispatch failure, model failure, missing message-tool delivery, or no visible final reply.

Environment

  • OpenClaw: 2026.5.12 (f066dd2)
  • Gateway: macOS LaunchAgent
  • Gateway stdout: ~/.openclaw/logs/gateway.log
  • Gateway stderr: /dev/null

Observed

Live gateway.log had retry-limit warnings for several runs:

2026-05-19T12:32:33.842-04:00 [warn] Subagent announce give up (retry-limit) run=<run-id-a> ... retries=3 endedAgo=22s
2026-05-19T12:51:58.995-04:00 [warn] Subagent announce give up (retry-limit) run=<run-id-b> ... retries=3 endedAgo=22s
2026-05-19T13:03:38.699-04:00 [warn] Subagent announce give up (retry-limit) run=<run-id-c> ... retries=3 endedAgo=10s
2026-05-19T13:20:33.471-04:00 [warn] Subagent announce give up (retry-limit) run=<run-id> ... retries=3 endedAgo=10s

Each run had three preceding subagent_delivery_target fired events with expectsCompletionMessage=true, but the retry-limit warning had no delivery error.

Historical gateway.err.log shows the missing diagnostic previously carried the useful cause, for example:

Subagent completion direct announce failed for run <historical-run-id>: Error: Outbound not configured for channel: slack
Subagent completion direct announce failed for run <historical-run-id>: routed-dispatch-did-not-queue-final

Other historical causes included gateway timeout after ..., model/fallback failures, and routed-dispatch failures.

Relevant Code Path

  • subagent-registry-32aElbRE.js: resumeSubagentRun() gives up after 3 retries and calls finalizeResumedAnnounceGiveUp({ reason: "retry-limit" }).
  • subagent-registry-32aElbRE.js: onDeliveryResult can format and persist entry.lastAnnounceDeliveryError.
  • subagent-announce-delivery-DzsdC5tX.js: completion messages use direct-primary delivery first, then queue fallback, and concrete failures are returned in delivery.error.
  • subagent-announce-Cdo94lsz.js: direct announce failures can be logged per attempt.

Expected

Operators should be able to diagnose a failed subagent completion announcement from gateway.log alone, even when stderr is discarded.

Proposed Fix

  • Log per-attempt direct completion announce failures through the normal gateway log path.
  • Persist the formatted failure on the run entry via entry.lastAnnounceDeliveryError or equivalent.
  • Include the last known delivery error in Subagent announce give up (retry-limit) and expiry warnings.
  • Include enough phase context to identify whether the direct attempt, queue fallback, or agent-mediated final delivery failed.

Suggested warning shape:

[warn] Subagent completion direct announce failed run=<runId> child=<childSessionKey> requester=<requesterSessionKey> attempt=<n> path=direct error=<delivery.error>
[warn] Subagent announce give up (retry-limit) run=<runId> child=<childSessionKey> requester=<requesterSessionKey> retries=<n> endedAgo=<s> deliveryError=<lastAnnounceDeliveryError>

This is observability-only. It should not change retry policy, delivery ordering, Slack behavior, cleanup, or hook semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions