fix(agents): Add retry with exponential backoff for subagent announce delivery#20328
fix(agents): Add retry with exponential backoff for subagent announce delivery#20328tiny-ship-it wants to merge 1 commit intoopenclaw:mainfrom
Conversation
… delivery Fixes openclaw#17000 ## Problem Subagent completion announcements are silently dropped on lane timeout (hardcoded 60s, no retry). This affects all users when the gateway is busy or network issues cause temporary failures. ## Solution 1. **Configurable timeout**: Added `agents.defaults.subagents.announceTimeoutMs` (default: 120000ms) with per-agent override support. 2. **Retry with exponential backoff**: 3 retries with increasing timeouts: - Attempt 1: 60s (configurable base) - Attempt 2: 120s (2x base) - Attempt 3: 240s (4x base) Each attempt is logged: `[announce retry 1/3] for subagent ${sessionId}` 3. **Persist failed announcements**: Failed announce payloads are stored in `.openclaw/announce-failed/` as JSON for recovery. Includes: - sessionId, timestamp, task, result, attempts - Recovery via: `openclaw subagents recover <sessionId>` 4. **Surface failures visibly**: - System message on final failure with recovery instructions - ERROR-level diagnostic log with full error trace - CLI commands: `openclaw subagents list-failed` and `recover` ## Config Schema Changes - Added `agents.defaults.subagents.announceTimeoutMs` (10000-600000ms) - Added `agents.list[agentId].subagents.announceTimeoutMs` per-agent override ## Testing - Unit tests for retry logic, config loading, persistence - Verify retry behavior with mocked callGateway timeouts - Integration: spawn subagent, block lane artificially, verify retries
|
|
||
| // Check per-agent override | ||
| if (agentId) { | ||
| const agentConfig = cfg.agents?.list?.[agentId]; |
There was a problem hiding this comment.
Per-agent config lookup indexes array with string key
cfg.agents?.list is typed as AgentConfig[] (an array), but this line indexes it with agentId (a string). In JavaScript, indexing an array with a non-numeric string always returns undefined, so per-agent announceTimeoutMs overrides will never be applied.
The rest of the codebase uses .find() for agent lookups (e.g. src/security/audit-extra.sync.ts:888):
| const agentConfig = cfg.agents?.list?.[agentId]; | |
| const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-retry.ts
Line: 56:56
Comment:
**Per-agent config lookup indexes array with string key**
`cfg.agents?.list` is typed as `AgentConfig[]` (an array), but this line indexes it with `agentId` (a string). In JavaScript, indexing an array with a non-numeric string always returns `undefined`, so per-agent `announceTimeoutMs` overrides will never be applied.
The rest of the codebase uses `.find()` for agent lookups (e.g. `src/security/audit-extra.sync.ts:888`):
```suggestion
const agentConfig = cfg.agents?.list?.find((a) => a?.id === agentId);
```
How can I resolve this? If you propose a fix, please make it concise.| logFinalFailure({ | ||
| sessionId: announceSessionId, | ||
| childSessionKey: params.childSessionKey, | ||
| attempts: 3, // max retries from retry config |
There was a problem hiding this comment.
Hardcoded attempts: 3 ignores actual retry config
The retry count is hardcoded to 3 here and on line 1023, but maxRetries is configurable via DEFAULT_RETRY_CONFIG and could change. If maxRetries is ever updated or made configurable, these hardcoded values will become stale. Consider deriving the attempt count from the retry config or from the withAnnounceRetry result:
const retryConfig = resolveRetryConfig(agentId);
// ...
attempts: retryConfig.maxRetries,
Or better, withAnnounceRetry could return the actual attempts count in the failure result (which it already does), and you could thread it through via the delivery result.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce.ts
Line: 1010:1010
Comment:
**Hardcoded `attempts: 3` ignores actual retry config**
The retry count is hardcoded to `3` here and on line 1023, but `maxRetries` is configurable via `DEFAULT_RETRY_CONFIG` and could change. If `maxRetries` is ever updated or made configurable, these hardcoded values will become stale. Consider deriving the attempt count from the retry config or from the `withAnnounceRetry` result:
```
const retryConfig = resolveRetryConfig(agentId);
// ...
attempts: retryConfig.maxRetries,
```
Or better, `withAnnounceRetry` could return the actual `attempts` count in the failure result (which it already does), and you could thread it through via the delivery result.
How can I resolve this? If you propose a fix, please make it concise.| import fs from "node:fs"; | ||
| import path from "node:path"; | ||
| import { loadConfig } from "../config/config.js"; | ||
| import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js"; |
There was a problem hiding this comment.
Unused imports
Both resolveAgentIdFromSessionKey and resolveStorePath are imported but never used in this file. They are used in subagent-announce.ts instead. This will be caught by lint.
| import { resolveAgentIdFromSessionKey, resolveStorePath } from "../config/sessions.js"; | |
| import { resolveStateDir } from "../config/config.js"; |
(Remove the unused import from ../config/sessions.js entirely.)
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-retry.ts
Line: 13:13
Comment:
**Unused imports**
Both `resolveAgentIdFromSessionKey` and `resolveStorePath` are imported but never used in this file. They are used in `subagent-announce.ts` instead. This will be caught by lint.
```suggestion
import { resolveStateDir } from "../config/config.js";
```
(Remove the unused import from `../config/sessions.js` entirely.)
How can I resolve this? If you propose a fix, please make it concise.|
Closing as superseded/duplicate. The announce-delivery reliability gap is now covered by merged fixes in #24783 (queue backoff/retry hardening) and #24642 (completion announce fallback + route propagation), with added regression coverage. To avoid parallel divergence in subagent announce behavior, we should consolidate on those landed paths. If there is still a missing case (especially around timeout semantics), please open a narrow follow-up PR against current main with a failing test first. |
fix(agents): Add retry with exponential backoff for subagent announce delivery
Fixes #17000
Problem
Subagent completion announcements are silently dropped when the gateway lane times out. The current implementation uses a hardcoded timeout with no retry mechanism, causing:
Evidence from Issue #17000
The subagent completes successfully, but the user never receives the result.
Solution
1. Configurable Timeout
Added
agents.defaults.subagents.announceTimeoutMsconfiguration with sensible defaults:Per-agent override supported:
2. Retry with Exponential Backoff
Three retries with increasing timeouts:
Each attempt is logged for observability:
3. Persist Failed Announcements
After all retries exhausted, failed announcements are persisted to disk for recovery:
Storage location:
~/.openclaw/announce-failed/<sessionId>.jsonPayload includes:
sessionId: Unique identifier for the subagent sessiontimestamp: When the failure occurredtask: Original task descriptionresult: Subagent output (preserved for recovery)attempts: Number of delivery attemptslastError: Final error message4. Surface Failures Visibly
Error logging with recovery instructions:
System notification: On final failure, a system message is injected:
CLI commands:
Testing
Unit Tests (
subagent-announce-retry.test.ts)calculateRetryTimeout: Verifies exponential backoff calculationpersistFailedAnnounce: Tests file creation and payload serializationloadFailedAnnounce: Tests loading persisted payloadslistFailedAnnounces: Tests listing and sorting by timestampremoveFailedAnnounce: Tests cleanup after recoverywithAnnounceRetry: Tests retry logic with mocked failuresIntegration Test Plan
callGatewayto timeout on first 2 attempts.openclaw/announce-failed/contains payloadopenclaw subagents recover <sessionId>and verify deliveryManual Testing
Migration Notes
Configuration Changes
No breaking changes. New optional config keys:
agents.defaults.subagents.announceTimeoutMsagents.list[].subagents.announceTimeoutMsUpgrade Path
Files Changed
src/agents/subagent-announce-retry.tssrc/agents/subagent-announce-retry.test.tssrc/agents/subagent-announce.tssendSubagentAnnounceDirectlywith retry wrappersrc/config/types.agent-defaults.tsannounceTimeoutMsto subagents typesrc/config/types.agents.tsannounceTimeoutMssrc/config/zod-schema.agent-defaults.tssrc/config/zod-schema.agent-runtime.tssrc/cli/subagents-cli.tssrc/cli/subagents-cli/register.tssrc/cli/subagents-cli/recover.tslist-failedandrecovercommandssrc/cli/program/command-registry.tsRelated Issues
AI-Assisted: Yes (Claude) — fully tested patterns, reviewed implementation.
Greptile Summary
Adds retry with exponential backoff for subagent completion announcement delivery, addressing silent data loss when the gateway lane times out (#17000). Failed announcements are persisted to disk for later CLI-based recovery (
openclaw subagents recover).resolveAnnounceTimeoutMsinsubagent-announce-retry.ts:56indexescfg.agents.list(anAgentConfig[]array) with a stringagentId. This always returnsundefined, so per-agentannounceTimeoutMsoverrides never take effect. Should use.find((a) => a?.id === agentId)to match the rest of the codebase.subagent-announce.tshardcodesattempts: 3in failure logging and persistence instead of reading from the retry config, creating a maintenance risk.resolveAgentIdFromSessionKeyandresolveStorePathare imported in the retry module but unused there.15_000mstimeout per gateway call; the retry wrapper starts at60_000ms(or120_000msdefault from config) with 2x exponential backoff, which significantly increases worst-case latency for a single announce flow (up to ~7 minutes total across 3 retries). This is an intentional tradeoff documented in the PR but worth noting for reviewers.Confidence Score: 2/5
announceTimeoutMsconfig lookup is broken due to array-vs-record indexing, meaning a documented feature will not work. The global defaults path and retry logic are correct, so the core improvement (retry + persistence) does function. Hardcodedattempts: 3is a maintenance concern. Overall the PR improves reliability but ships a broken per-agent override.src/agents/subagent-announce-retry.ts(broken per-agent config lookup at line 56),src/agents/subagent-announce.ts(hardcoded retry count)Last reviewed commit: 131ac4a
(2/5) Greptile learns from your feedback when you react with thumbs up/down!