fix(agents): Add retry with exponential backoff for subagent announce delivery by tiny-ship-it · Pull Request #20328 · openclaw/openclaw

tiny-ship-it · 2026-02-18T20:03:10Z

fix(agents): Add retry with exponential backoff for subagent announce delivery

Problem

Subagent completion announcements are silently dropped when the gateway lane times out. The current implementation uses a hardcoded timeout with no retry mechanism, causing:

Silent data loss: Subagent results are lost when the main session is busy
No visibility: Users have no indication that delivery failed
No recovery path: Failed announcements are permanently lost

Evidence from Issue #17000

[ERROR] gateway timeout after 60000ms for method: agent
Subagent completion direct announce failed for run abc123: gateway timeout after 60000ms

The subagent completes successfully, but the user never receives the result.

Solution

1. Configurable Timeout

Added agents.defaults.subagents.announceTimeoutMs configuration with sensible defaults:

agents:
  defaults:
    subagents:
      announceTimeoutMs: 120000  # 2 minutes (default)

Per-agent override supported:

agents:
  list:
    - id: worker
      subagents:
        announceTimeoutMs: 60000  # 1 minute for this agent

2. Retry with Exponential Backoff

Three retries with increasing timeouts:

Attempt	Timeout	Notes
1	60s	Base timeout (configurable)
2	120s	2× base
3	240s	4× base

Each attempt is logged for observability:

[announce retry 1/3] for subagent abc123 (timeout: 60000ms)
[announce retry 2/3] for subagent abc123 (timeout: 120000ms)
[announce retry 3/3] for subagent abc123 (timeout: 240000ms)

3. Persist Failed Announcements

After all retries exhausted, failed announcements are persisted to disk for recovery:

Storage location: ~/.openclaw/announce-failed/<sessionId>.json

Payload includes:

sessionId: Unique identifier for the subagent session
timestamp: When the failure occurred
task: Original task description
result: Subagent output (preserved for recovery)
attempts: Number of delivery attempts
lastError: Final error message

4. Surface Failures Visibly

Error logging with recovery instructions:

[announce FAILED] Subagent completion announcement failed after 3 attempts
  Session ID: abc123
  Session Key: agent:worker:subagent:xyz
  Last Error: gateway timeout after 240000ms
  Recovery: Run "openclaw subagents recover abc123" to retry delivery
  Or use "/subagents log abc123" to view the results

System notification: On final failure, a system message is injected:

⚠️ Sub-agent completed but delivery failed — use /subagents log abc123 to view results

CLI commands:

# List all failed announcements
openclaw subagents list-failed

# Retry delivery for a specific session
openclaw subagents recover abc123

# Delete a failed record without retrying
openclaw subagents recover abc123 --delete

Testing

Unit Tests (`subagent-announce-retry.test.ts`)

✅ calculateRetryTimeout: Verifies exponential backoff calculation
✅ persistFailedAnnounce: Tests file creation and payload serialization
✅ loadFailedAnnounce: Tests loading persisted payloads
✅ listFailedAnnounces: Tests listing and sorting by timestamp
✅ removeFailedAnnounce: Tests cleanup after recovery
✅ withAnnounceRetry: Tests retry logic with mocked failures

Integration Test Plan

Mock gateway timeout: Configure callGateway to timeout on first 2 attempts
Spawn subagent: Start a subagent task during simulated high load
Verify retry: Check logs show 3 retry attempts with correct timeouts
Verify persistence: After failure, check .openclaw/announce-failed/ contains payload
Test recovery: Run openclaw subagents recover <sessionId> and verify delivery

Manual Testing

# 1. Spawn a subagent (in chat)
/spawn task="sleep 5 && echo done"

# 2. Block the main session lane (simulated)
# ... (trigger high load or network issues)

# 3. Wait for retries (check logs)
tail -f ~/.openclaw/openclaw-*.log | grep "announce retry"

# 4. After failure, verify persistence
ls ~/.openclaw/announce-failed/

# 5. Recover
openclaw subagents recover <sessionId>

Migration Notes

Configuration Changes

No breaking changes. New optional config keys:

Key	Type	Default	Description
`agents.defaults.subagents.announceTimeoutMs`	number	120000	Base timeout (ms)
`agents.list[].subagents.announceTimeoutMs`	number	(inherit)	Per-agent override

Upgrade Path

Existing users: No action required. Default behavior improves reliability automatically.
Custom timeouts: Add config if you need shorter/longer timeouts.
Recovery: After upgrade, any previously-lost announcements cannot be recovered (only future failures are persisted).

Files Changed

File	Changes
`src/agents/subagent-announce-retry.ts`	NEW - Retry logic, persistence, recovery helpers
`src/agents/subagent-announce-retry.test.ts`	NEW - Unit tests
`src/agents/subagent-announce.ts`	Updated `sendSubagentAnnounceDirectly` with retry wrapper
`src/config/types.agent-defaults.ts`	Added `announceTimeoutMs` to subagents type
`src/config/types.agents.ts`	Added per-agent `announceTimeoutMs`
`src/config/zod-schema.agent-defaults.ts`	Added schema validation
`src/config/zod-schema.agent-runtime.ts`	Added per-agent schema validation
`src/cli/subagents-cli.ts`	NEW - CLI entry point
`src/cli/subagents-cli/register.ts`	NEW - CLI registration
`src/cli/subagents-cli/recover.ts`	NEW - `list-failed` and `recover` commands
`src/cli/program/command-registry.ts`	Registered subagents CLI

Related Issues

Sub-agent announcements silently dropped on gateway timeout (hardcoded 60s, no retry) #17000 - Subagent completion announcements silently dropped on lane timeout
Duplicate subagent announce delivery when main session is busy #17122 - Gateway dedup cache for announce idempotency (related)

AI-Assisted: Yes (Claude) — fully tested patterns, reviewed implementation.

Greptile Summary

Adds retry with exponential backoff for subagent completion announcement delivery, addressing silent data loss when the gateway lane times out (#17000). Failed announcements are persisted to disk for later CLI-based recovery (openclaw subagents recover).

Bug: Per-agent config lookup broken — resolveAnnounceTimeoutMs in subagent-announce-retry.ts:56 indexes cfg.agents.list (an AgentConfig[] array) with a string agentId. This always returns undefined, so per-agent announceTimeoutMs overrides never take effect. Should use .find((a) => a?.id === agentId) to match the rest of the codebase.
Hardcoded retry count — subagent-announce.ts hardcodes attempts: 3 in failure logging and persistence instead of reading from the retry config, creating a maintenance risk.
Unused imports — resolveAgentIdFromSessionKey and resolveStorePath are imported in the retry module but unused there.
Timeout behavior change — The original code used a hardcoded 15_000ms timeout per gateway call; the retry wrapper starts at 60_000ms (or 120_000ms default from config) with 2x exponential backoff, which significantly increases worst-case latency for a single announce flow (up to ~7 minutes total across 3 retries). This is an intentional tradeoff documented in the PR but worth noting for reviewers.

Confidence Score: 2/5

Contains a logic bug that silently disables per-agent config overrides; safe at the global config level but the per-agent feature is broken.
The per-agent announceTimeoutMs config lookup is broken due to array-vs-record indexing, meaning a documented feature will not work. The global defaults path and retry logic are correct, so the core improvement (retry + persistence) does function. Hardcoded attempts: 3 is a maintenance concern. Overall the PR improves reliability but ships a broken per-agent override.
src/agents/subagent-announce-retry.ts (broken per-agent config lookup at line 56), src/agents/subagent-announce.ts (hardcoded retry count)

_{Last reviewed commit: 131ac4a}

_{(2/5) Greptile learns from your feedback when you react with thumbs up/down!}

… delivery Fixes openclaw#17000 ## Problem Subagent completion announcements are silently dropped on lane timeout (hardcoded 60s, no retry). This affects all users when the gateway is busy or network issues cause temporary failures. ## Solution 1. **Configurable timeout**: Added `agents.defaults.subagents.announceTimeoutMs` (default: 120000ms) with per-agent override support. 2. **Retry with exponential backoff**: 3 retries with increasing timeouts: - Attempt 1: 60s (configurable base) - Attempt 2: 120s (2x base) - Attempt 3: 240s (4x base) Each attempt is logged: `[announce retry 1/3] for subagent ${sessionId}` 3. **Persist failed announcements**: Failed announce payloads are stored in `.openclaw/announce-failed/` as JSON for recovery. Includes: - sessionId, timestamp, task, result, attempts - Recovery via: `openclaw subagents recover <sessionId>` 4. **Surface failures visibly**: - System message on final failure with recovery instructions - ERROR-level diagnostic log with full error trace - CLI commands: `openclaw subagents list-failed` and `recover` ## Config Schema Changes - Added `agents.defaults.subagents.announceTimeoutMs` (10000-600000ms) - Added `agents.list[agentId].subagents.announceTimeoutMs` per-agent override ## Testing - Unit tests for retry logic, config loading, persistence - Verify retry behavior with mocked callGateway timeouts - Integration: spawn subagent, block lane artificially, verify retries

greptile-apps

_{11 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-18T20:06:31Z

src/agents/subagent-announce-retry.ts

+
+  // Check per-agent override
+  if (agentId) {
+    const agentConfig = cfg.agents?.list?.[agentId];


Per-agent config lookup indexes array with string key

cfg.agents?.list is typed as AgentConfig[] (an array), but this line indexes it with agentId (a string). In JavaScript, indexing an array with a non-numeric string always returns undefined, so per-agent announceTimeoutMs overrides will never be applied.

The rest of the codebase uses .find() for agent lookups (e.g. src/security/audit-extra.sync.ts:888):

Suggested change

const agentConfig = cfg.agents?.list?.[agentId];

const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId);

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/subagent-announce-retry.ts Line: 56:56 Comment: **Per-agent config lookup indexes array with string key** `cfg.agents?.list` is typed as `AgentConfig[]` (an array), but this line indexes it with `agentId` (a string). In JavaScript, indexing an array with a non-numeric string always returns `undefined`, so per-agent `announceTimeoutMs` overrides will never be applied. The rest of the codebase uses `.find()` for agent lookups (e.g. `src/security/audit-extra.sync.ts:888`): ```suggestion const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId); ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-18T20:06:33Z

src/agents/subagent-announce.ts

+      logFinalFailure({
+        sessionId: announceSessionId,
+        childSessionKey: params.childSessionKey,
+        attempts: 3, // max retries from retry config


Hardcoded attempts: 3 ignores actual retry config

The retry count is hardcoded to 3 here and on line 1023, but maxRetries is configurable via DEFAULT_RETRY_CONFIG and could change. If maxRetries is ever updated or made configurable, these hardcoded values will become stale. Consider deriving the attempt count from the retry config or from the withAnnounceRetry result:

const retryConfig = resolveRetryConfig(agentId); // ... attempts: retryConfig.maxRetries,

Or better, withAnnounceRetry could return the actual attempts count in the failure result (which it already does), and you could thread it through via the delivery result.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/subagent-announce.ts Line: 1010:1010 Comment: **Hardcoded `attempts: 3` ignores actual retry config** The retry count is hardcoded to `3` here and on line 1023, but `maxRetries` is configurable via `DEFAULT_RETRY_CONFIG` and could change. If `maxRetries` is ever updated or made configurable, these hardcoded values will become stale. Consider deriving the attempt count from the retry config or from the `withAnnounceRetry` result: ``` const retryConfig = resolveRetryConfig(agentId); // ... attempts: retryConfig.maxRetries, ``` Or better, `withAnnounceRetry` could return the actual `attempts` count in the failure result (which it already does), and you could thread it through via the delivery result. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-18T20:06:34Z

src/agents/subagent-announce-retry.ts

+import fs from "node:fs";
+import path from "node:path";
+import { loadConfig } from "../config/config.js";
+import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js";


Unused imports

Both resolveAgentIdFromSessionKey and resolveStorePath are imported but never used in this file. They are used in subagent-announce.ts instead. This will be caught by lint.

Suggested change

import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js";

import { resolveStateDir } from "../config/config.js";

(Remove the unused import from ../config/sessions.js entirely.)

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/subagent-announce-retry.ts Line: 13:13 Comment: **Unused imports** Both `resolveAgentIdFromSessionKey` and `resolveStorePath` are imported but never used in this file. They are used in `subagent-announce.ts` instead. This will be caught by lint. ```suggestion import { resolveStateDir } from "../config/config.js"; ``` (Remove the unused import from `../config/sessions.js` entirely.) How can I resolve this? If you propose a fix, please make it concise.

steipete · 2026-02-24T04:20:08Z

Closing as superseded/duplicate. The announce-delivery reliability gap is now covered by merged fixes in #24783 (queue backoff/retry hardening) and #24642 (completion announce fallback + route propagation), with added regression coverage. To avoid parallel divergence in subagent announce behavior, we should consolidate on those landed paths. If there is still a missing case (especially around timeout semantics), please open a narrow follow-up PR against current main with a failing test first.

openclaw-barnacle bot added cli CLI command changes agents Agent runtime and tooling size: XL labels Feb 18, 2026

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

yuemori mentioned this pull request Feb 20, 2026

[Bug]: Subagent completion announce bypasses main LLM since v2026.2.19 (send path regression) #22099

Closed

Valadon mentioned this pull request Feb 21, 2026

fix(agents): make subagent announce timeout configurable (restore 60s default) #22719

Closed

18 tasks

steipete closed this Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): Add retry with exponential backoff for subagent announce delivery#20328

fix(agents): Add retry with exponential backoff for subagent announce delivery#20328
tiny-ship-it wants to merge 1 commit intoopenclaw:mainfrom
tiny-ship-it:fix/subagent-announce-retry

tiny-ship-it commented Feb 18, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 18, 2026

Uh oh!

greptile-apps bot Feb 18, 2026

Uh oh!

greptile-apps bot Feb 18, 2026

Uh oh!

steipete commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const agentConfig = cfg.agents?.list?.[agentId];
	const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId);

	import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js";
	import { resolveStateDir } from "../config/config.js";

Uh oh!

Conversation

tiny-ship-it commented Feb 18, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(agents): Add retry with exponential backoff for subagent announce delivery

Problem

Evidence from Issue #17000

Solution

1. Configurable Timeout

2. Retry with Exponential Backoff

3. Persist Failed Announcements

4. Surface Failures Visibly

Testing

Unit Tests (subagent-announce-retry.test.ts)

Integration Test Plan

Manual Testing

Migration Notes

Configuration Changes

Upgrade Path

Files Changed

Related Issues

Greptile Summary

Confidence Score: 2/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

steipete commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tiny-ship-it commented Feb 18, 2026 •

edited by greptile-apps bot

Loading

Unit Tests (`subagent-announce-retry.test.ts`)