Skip to content

fix(agents): Add retry with exponential backoff for subagent announce delivery#20328

Closed
tiny-ship-it wants to merge 1 commit intoopenclaw:mainfrom
tiny-ship-it:fix/subagent-announce-retry
Closed

fix(agents): Add retry with exponential backoff for subagent announce delivery#20328
tiny-ship-it wants to merge 1 commit intoopenclaw:mainfrom
tiny-ship-it:fix/subagent-announce-retry

Conversation

@tiny-ship-it
Copy link

@tiny-ship-it tiny-ship-it commented Feb 18, 2026

fix(agents): Add retry with exponential backoff for subagent announce delivery

Fixes #17000

Problem

Subagent completion announcements are silently dropped when the gateway lane times out. The current implementation uses a hardcoded timeout with no retry mechanism, causing:

  • Silent data loss: Subagent results are lost when the main session is busy
  • No visibility: Users have no indication that delivery failed
  • No recovery path: Failed announcements are permanently lost

Evidence from Issue #17000

[ERROR] gateway timeout after 60000ms for method: agent
Subagent completion direct announce failed for run abc123: gateway timeout after 60000ms

The subagent completes successfully, but the user never receives the result.

Solution

1. Configurable Timeout

Added agents.defaults.subagents.announceTimeoutMs configuration with sensible defaults:

agents:
  defaults:
    subagents:
      announceTimeoutMs: 120000  # 2 minutes (default)

Per-agent override supported:

agents:
  list:
    - id: worker
      subagents:
        announceTimeoutMs: 60000  # 1 minute for this agent

2. Retry with Exponential Backoff

Three retries with increasing timeouts:

Attempt Timeout Notes
1 60s Base timeout (configurable)
2 120s 2× base
3 240s 4× base

Each attempt is logged for observability:

[announce retry 1/3] for subagent abc123 (timeout: 60000ms)
[announce retry 2/3] for subagent abc123 (timeout: 120000ms)
[announce retry 3/3] for subagent abc123 (timeout: 240000ms)

3. Persist Failed Announcements

After all retries exhausted, failed announcements are persisted to disk for recovery:

Storage location: ~/.openclaw/announce-failed/<sessionId>.json

Payload includes:

  • sessionId: Unique identifier for the subagent session
  • timestamp: When the failure occurred
  • task: Original task description
  • result: Subagent output (preserved for recovery)
  • attempts: Number of delivery attempts
  • lastError: Final error message

4. Surface Failures Visibly

Error logging with recovery instructions:

[announce FAILED] Subagent completion announcement failed after 3 attempts
  Session ID: abc123
  Session Key: agent:worker:subagent:xyz
  Last Error: gateway timeout after 240000ms
  Recovery: Run "openclaw subagents recover abc123" to retry delivery
  Or use "/subagents log abc123" to view the results

System notification: On final failure, a system message is injected:

⚠️ Sub-agent completed but delivery failed — use /subagents log abc123 to view results

CLI commands:

# List all failed announcements
openclaw subagents list-failed

# Retry delivery for a specific session
openclaw subagents recover abc123

# Delete a failed record without retrying
openclaw subagents recover abc123 --delete

Testing

Unit Tests (subagent-announce-retry.test.ts)

  • calculateRetryTimeout: Verifies exponential backoff calculation
  • persistFailedAnnounce: Tests file creation and payload serialization
  • loadFailedAnnounce: Tests loading persisted payloads
  • listFailedAnnounces: Tests listing and sorting by timestamp
  • removeFailedAnnounce: Tests cleanup after recovery
  • withAnnounceRetry: Tests retry logic with mocked failures

Integration Test Plan

  1. Mock gateway timeout: Configure callGateway to timeout on first 2 attempts
  2. Spawn subagent: Start a subagent task during simulated high load
  3. Verify retry: Check logs show 3 retry attempts with correct timeouts
  4. Verify persistence: After failure, check .openclaw/announce-failed/ contains payload
  5. Test recovery: Run openclaw subagents recover <sessionId> and verify delivery

Manual Testing

# 1. Spawn a subagent (in chat)
/spawn task="sleep 5 && echo done"

# 2. Block the main session lane (simulated)
# ... (trigger high load or network issues)

# 3. Wait for retries (check logs)
tail -f ~/.openclaw/openclaw-*.log | grep "announce retry"

# 4. After failure, verify persistence
ls ~/.openclaw/announce-failed/

# 5. Recover
openclaw subagents recover <sessionId>

Migration Notes

Configuration Changes

No breaking changes. New optional config keys:

Key Type Default Description
agents.defaults.subagents.announceTimeoutMs number 120000 Base timeout (ms)
agents.list[].subagents.announceTimeoutMs number (inherit) Per-agent override

Upgrade Path

  1. Existing users: No action required. Default behavior improves reliability automatically.
  2. Custom timeouts: Add config if you need shorter/longer timeouts.
  3. Recovery: After upgrade, any previously-lost announcements cannot be recovered (only future failures are persisted).

Files Changed

File Changes
src/agents/subagent-announce-retry.ts NEW - Retry logic, persistence, recovery helpers
src/agents/subagent-announce-retry.test.ts NEW - Unit tests
src/agents/subagent-announce.ts Updated sendSubagentAnnounceDirectly with retry wrapper
src/config/types.agent-defaults.ts Added announceTimeoutMs to subagents type
src/config/types.agents.ts Added per-agent announceTimeoutMs
src/config/zod-schema.agent-defaults.ts Added schema validation
src/config/zod-schema.agent-runtime.ts Added per-agent schema validation
src/cli/subagents-cli.ts NEW - CLI entry point
src/cli/subagents-cli/register.ts NEW - CLI registration
src/cli/subagents-cli/recover.ts NEW - list-failed and recover commands
src/cli/program/command-registry.ts Registered subagents CLI

Related Issues


AI-Assisted: Yes (Claude) — fully tested patterns, reviewed implementation.

Greptile Summary

Adds retry with exponential backoff for subagent completion announcement delivery, addressing silent data loss when the gateway lane times out (#17000). Failed announcements are persisted to disk for later CLI-based recovery (openclaw subagents recover).

  • Bug: Per-agent config lookup brokenresolveAnnounceTimeoutMs in subagent-announce-retry.ts:56 indexes cfg.agents.list (an AgentConfig[] array) with a string agentId. This always returns undefined, so per-agent announceTimeoutMs overrides never take effect. Should use .find((a) => a?.id === agentId) to match the rest of the codebase.
  • Hardcoded retry countsubagent-announce.ts hardcodes attempts: 3 in failure logging and persistence instead of reading from the retry config, creating a maintenance risk.
  • Unused importsresolveAgentIdFromSessionKey and resolveStorePath are imported in the retry module but unused there.
  • Timeout behavior change — The original code used a hardcoded 15_000ms timeout per gateway call; the retry wrapper starts at 60_000ms (or 120_000ms default from config) with 2x exponential backoff, which significantly increases worst-case latency for a single announce flow (up to ~7 minutes total across 3 retries). This is an intentional tradeoff documented in the PR but worth noting for reviewers.

Confidence Score: 2/5

  • Contains a logic bug that silently disables per-agent config overrides; safe at the global config level but the per-agent feature is broken.
  • The per-agent announceTimeoutMs config lookup is broken due to array-vs-record indexing, meaning a documented feature will not work. The global defaults path and retry logic are correct, so the core improvement (retry + persistence) does function. Hardcoded attempts: 3 is a maintenance concern. Overall the PR improves reliability but ships a broken per-agent override.
  • src/agents/subagent-announce-retry.ts (broken per-agent config lookup at line 56), src/agents/subagent-announce.ts (hardcoded retry count)

Last reviewed commit: 131ac4a

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

… delivery

Fixes openclaw#17000

## Problem
Subagent completion announcements are silently dropped on lane timeout
(hardcoded 60s, no retry). This affects all users when the gateway is
busy or network issues cause temporary failures.

## Solution
1. **Configurable timeout**: Added `agents.defaults.subagents.announceTimeoutMs`
   (default: 120000ms) with per-agent override support.

2. **Retry with exponential backoff**: 3 retries with increasing timeouts:
   - Attempt 1: 60s (configurable base)
   - Attempt 2: 120s (2x base)
   - Attempt 3: 240s (4x base)
   Each attempt is logged: `[announce retry 1/3] for subagent ${sessionId}`

3. **Persist failed announcements**: Failed announce payloads are stored in
   `.openclaw/announce-failed/` as JSON for recovery. Includes:
   - sessionId, timestamp, task, result, attempts
   - Recovery via: `openclaw subagents recover <sessionId>`

4. **Surface failures visibly**:
   - System message on final failure with recovery instructions
   - ERROR-level diagnostic log with full error trace
   - CLI commands: `openclaw subagents list-failed` and `recover`

## Config Schema Changes
- Added `agents.defaults.subagents.announceTimeoutMs` (10000-600000ms)
- Added `agents.list[agentId].subagents.announceTimeoutMs` per-agent override

## Testing
- Unit tests for retry logic, config loading, persistence
- Verify retry behavior with mocked callGateway timeouts
- Integration: spawn subagent, block lane artificially, verify retries
@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes agents Agent runtime and tooling size: XL labels Feb 18, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile


// Check per-agent override
if (agentId) {
const agentConfig = cfg.agents?.list?.[agentId];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-agent config lookup indexes array with string key

cfg.agents?.list is typed as AgentConfig[] (an array), but this line indexes it with agentId (a string). In JavaScript, indexing an array with a non-numeric string always returns undefined, so per-agent announceTimeoutMs overrides will never be applied.

The rest of the codebase uses .find() for agent lookups (e.g. src/security/audit-extra.sync.ts:888):

Suggested change
const agentConfig = cfg.agents?.list?.[agentId];
const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-retry.ts
Line: 56:56

Comment:
**Per-agent config lookup indexes array with string key**

`cfg.agents?.list` is typed as `AgentConfig[]` (an array), but this line indexes it with `agentId` (a string). In JavaScript, indexing an array with a non-numeric string always returns `undefined`, so per-agent `announceTimeoutMs` overrides will never be applied.

The rest of the codebase uses `.find()` for agent lookups (e.g. `src/security/audit-extra.sync.ts:888`):

```suggestion
    const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId);
```

How can I resolve this? If you propose a fix, please make it concise.

logFinalFailure({
sessionId: announceSessionId,
childSessionKey: params.childSessionKey,
attempts: 3, // max retries from retry config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded attempts: 3 ignores actual retry config

The retry count is hardcoded to 3 here and on line 1023, but maxRetries is configurable via DEFAULT_RETRY_CONFIG and could change. If maxRetries is ever updated or made configurable, these hardcoded values will become stale. Consider deriving the attempt count from the retry config or from the withAnnounceRetry result:

const retryConfig = resolveRetryConfig(agentId);
// ...
attempts: retryConfig.maxRetries,

Or better, withAnnounceRetry could return the actual attempts count in the failure result (which it already does), and you could thread it through via the delivery result.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce.ts
Line: 1010:1010

Comment:
**Hardcoded `attempts: 3` ignores actual retry config**

The retry count is hardcoded to `3` here and on line 1023, but `maxRetries` is configurable via `DEFAULT_RETRY_CONFIG` and could change. If `maxRetries` is ever updated or made configurable, these hardcoded values will become stale. Consider deriving the attempt count from the retry config or from the `withAnnounceRetry` result:

```
const retryConfig = resolveRetryConfig(agentId);
// ...
attempts: retryConfig.maxRetries,
```

Or better, `withAnnounceRetry` could return the actual `attempts` count in the failure result (which it already does), and you could thread it through via the delivery result.

How can I resolve this? If you propose a fix, please make it concise.

import fs from "node:fs";
import path from "node:path";
import { loadConfig } from "../config/config.js";
import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports

Both resolveAgentIdFromSessionKey and resolveStorePath are imported but never used in this file. They are used in subagent-announce.ts instead. This will be caught by lint.

Suggested change
import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js";
import { resolveStateDir } from "../config/config.js";

(Remove the unused import from ../config/sessions.js entirely.)

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-retry.ts
Line: 13:13

Comment:
**Unused imports**

Both `resolveAgentIdFromSessionKey` and `resolveStorePath` are imported but never used in this file. They are used in `subagent-announce.ts` instead. This will be caught by lint.

```suggestion
import { resolveStateDir } from "../config/config.js";
```

(Remove the unused import from `../config/sessions.js` entirely.)

How can I resolve this? If you propose a fix, please make it concise.

@steipete
Copy link
Contributor

Closing as superseded/duplicate. The announce-delivery reliability gap is now covered by merged fixes in #24783 (queue backoff/retry hardening) and #24642 (completion announce fallback + route propagation), with added regression coverage. To avoid parallel divergence in subagent announce behavior, we should consolidate on those landed paths. If there is still a missing case (especially around timeout semantics), please open a narrow follow-up PR against current main with a failing test first.

@steipete steipete closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling cli CLI command changes size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sub-agent announcements silently dropped on gateway timeout (hardcoded 60s, no retry)

2 participants