fix(agents): harden subagent announce retry and error handling by FradSer · Pull Request #90561 · openclaw/openclaw

FradSer · 2026-06-05T05:53:46Z

Summary

Increase retry budget from 3 to 5 attempts with longer exponential backoff (up to 30s)
Extend announce expiry from 5min to 10min, completion hard expiry from 30min to 45min
Add jitter to retry delays to prevent thundering herd scenarios
Extract formatDefaultGiveUpError helper for consistent, informative error messages
Guarantee deliveryError is always populated on all three give-up paths
Emit subagent-delivery-failed session lifecycle events for user/monitor visibility
Add test coverage for error guarantee and suspend-path behavior

Linked context

Closes #44925

Related issues: #44919 (OAuth desync), #28617 (noOutputTimeoutMs), #43367 (multi-agent orchestration instability)

Real behavior proof (required for external PRs)

Behavior or issue addressed: Subagent completion announcements silently lost, leaving users unaware of failed deliveries
Real environment tested: Local development with unit tests
Exact steps or command run after this patch: pnpm test src/agents/subagent-registry-lifecycle.test.ts
Evidence after fix: All 31 tests pass, including new tests verifying error strings are always populated
Observed result after fix: Give-up paths now always populate delivery.lastError with descriptive messages; session lifecycle events emitted for observability
What was not tested: Live multi-agent orchestration in production environment
Proof limitations or environment constraints: Unit test verification only; live repro requires multi-agent setup

Tests and validation

Which commands did you run?

pnpm test src/agents/subagent-registry-lifecycle.test.ts — 31/31 tests passed

What regression coverage was added or updated?

Added test for finalizeResumedAnnounceGiveUp when no prior error exists
Added test for suspendPendingFinalDelivery error message guarantee
Updated existing mocks to use new constants

What failed before this fix, if known?

Tests expecting old retry count (3) and expiry values updated to new values (5 retries, extended timeouts)

Risk checklist

Did user-visible behavior change? (Yes/No)
Yes — improved error messages and retry behavior; users will see more retries before give-up and receive lifecycle events on failures

Did config, environment, or migration behavior change? (Yes/No)
No

Did security, auth, secrets, network, or tool execution behavior change? (Yes/No)
No

What is the highest-risk area?
Retry timing — increased retry count and longer delays could theoretically delay failure detection, but only for persistent failures

How is that risk mitigated?
Exponential backoff with jitter prevents thundering herd; extended timeouts give transient failures more recovery time; error messages ensure failures are always visible

Current review state

What is the next action?
Awaiting maintainer review

What is still waiting on author, maintainer, CI, or external proof?
Live integration testing in production multi-agent environment

Which bot or reviewer comments were addressed?
N/A — initial PR

🤖 AI-assisted: This PR was developed with Claude Code assistance.

- Increase announce retry count to 5 and delay cap to 30s - Implement jittered exponential backoff for announce retries - Ensure subagent delivery errors are always populated on give-up - Emit session lifecycle events upon final delivery failure - Update constants to extend overall announce expiry windows This change improves subagent reliability by introducing jittered backoff to prevent thundering herd scenarios and hardening error reporting. By ensuring that delivery error states are always populated and emitting explicit lifecycle events when retries are exhausted, observers can better track and debug terminal subagent delivery failures as requested in issue openclaw#44925. Co-Authored-By: qwen3-coder <noreply@anthropic.com> Co-Authored-By: Git Agent <noreply@git-agent.dev>

chatgpt-codex-connector · 2026-06-05T05:53:51Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

clawsweeper · 2026-06-05T05:55:39Z

Codex review: needs real behavior proof before merge. Reviewed June 5, 2026, 5:28 AM ET / 09:28 UTC.

Summary
The PR changes subagent announce retry/backoff constants, give-up error handling, delivery-failed lifecycle events, and related agent registry tests.

PR surface: Source +61, Tests +297, Other +243. Total +601 across 10 files.

Reproducibility: no. high-confidence live reproduction is included. Source inspection shows the touched give-up paths and the PR-added test/typecheck blockers, but the PR body says live multi-agent orchestration was not tested.

Review metrics: 1 noteworthy metric.

Retry/default timing constants: 4 changed. Retry count, retry delay cap, announce expiry, and completion hard expiry all change, so maintainers need failure-latency and upgrade proof before merge.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Align the delivery-failed event test with the sanitized runtime payload and remove the unused helper parameter.
Remove .git-agent/config.yml and the root verify-*.cjs scripts from the branch.
[P1] Add redacted real multi-agent announce-failure proof that shows the after-fix behavior.

Proof guidance:

[P1] Needs real behavior proof before merge: The PR body provides unit-test results only and explicitly does not test the live multi-agent announce-failure path; the contributor should add redacted terminal/log output, recording, or linked artifact proof, then update the PR body for re-review.

Risk before merge

[P1] The retry budget and expiry windows change failure timing from 3 attempts/5m/30m to 5 attempts/10m/45m; that can delay user-visible failure reporting unless maintainers accept the tradeoff with real multi-agent proof.
[P1] The repository-root .git-agent config and verification scripts are unrelated artifacts that could affect external tooling or dead-code checks if merged.
[P1] The PR still lacks redacted real multi-agent announce-failure proof; unit tests and inline scripts do not prove the user-visible recovery path.

Maintainer options:

Repair and prove before merge (recommended)
Fix the stale assertion and unused parameter, remove unrelated root artifacts, and add redacted real multi-agent announce-failure proof before landing the longer retry windows.
Accept longer failure latency explicitly
After code blockers are fixed, maintainers can intentionally accept the delayed failure-reporting window if the reliability benefit is worth the operator-visible latency.
Pause behind the linked bug
If live proof is not available, keep the linked bug as the canonical work item and pause this PR until a narrower, proven fix is ready.

Next step before merge

[P1] A human/contributor loop is needed because the remaining blocker includes real behavior proof that automation cannot supply for the contributor's setup, plus narrow code/artifact repairs.

Security
Needs attention: The runtime payload leak appears addressed, but the branch still adds unrelated repository-root third-party agent configuration that should not merge with the bug fix.

Review findings

[P1] Update the sanitized event assertion — src/agents/subagent-registry-lifecycle.test.ts:1064
[P1] Remove the unused deliveryError parameter — src/agents/subagent-registry-lifecycle.ts:582
[P2] Keep root proof and tool config out — .git-agent/config.yml:1

Review details

Best possible solution:

Merge a narrow agent lifecycle fix that keeps failure payloads sanitized, passes focused agent tests and core typecheck, removes repo-root artifacts, and includes redacted real multi-agent announce-failure proof.

Do we have a high-confidence way to reproduce the issue?

No high-confidence live reproduction is included. Source inspection shows the touched give-up paths and the PR-added test/typecheck blockers, but the PR body says live multi-agent orchestration was not tested.

Is this the best way to solve the issue?

No. The runtime direction is a plausible mitigation, but the PR is not the best mergeable form until sanitized-event coverage, typecheck cleanliness, artifact cleanup, and real behavior proof are all in place.

Full review comments:

[P1] Update the sanitized event assertion — src/agents/subagent-registry-lifecycle.test.ts:1064
The latest runtime emits entry.label ?? "Subagent delivery failed" as displayName, but this added assertion still expects the full default delivery error. The focused lifecycle test is now stale and will fail instead of guarding the sanitized payload.
Confidence: 0.94
[P1] Remove the unused deliveryError parameter — src/agents/subagent-registry-lifecycle.ts:582
After the sanitization change, deliveryError is no longer read by emitDeliveryFailedEvent. Core production files are checked with noUnusedParameters: true, so this will fail typecheck until the parameter is removed or safely used.
Confidence: 0.91
[P2] Keep root proof and tool config out — .git-agent/config.yml:1
The branch still adds .git-agent/config.yml and root verify-*.cjs proof scripts, which are unrelated to the agent runtime fix and should not ship as repository artifacts. Put proof in the PR body or a linked artifact instead.
Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.9

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a3ce7c2a8da.

Label changes

Label changes:

add merge-risk: 🚨 automation: The branch adds repository-root third-party agent config and verification scripts unrelated to the runtime fix, which can affect tooling behavior after merge.
remove merge-risk: 🚨 security-boundary: Current PR review merge-risk labels are merge-risk: 🚨 availability, merge-risk: 🚨 automation.

Label justifications:

P1: The PR targets a real agent delivery-loss workflow and currently has merge-blocking test/typecheck defects plus missing live proof.
merge-risk: 🚨 availability: The longer retry and hard-expiry windows can delay when users/operators learn a subagent result delivery has failed.
merge-risk: 🚨 automation: The branch adds repository-root third-party agent config and verification scripts unrelated to the runtime fix, which can affect tooling behavior after merge.
rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🧂 unranked krab.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body provides unit-test results only and explicitly does not test the live multi-agent announce-failure path; the contributor should add redacted terminal/log output, recording, or linked artifact proof, then update the PR body for re-review.

Evidence reviewed

PR surface:

Source +61, Tests +297, Other +243. Total +601 across 10 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	2	72	11	+61
Tests	5	322	25	+297
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	3	243	0	+243
Total	10	637	36	+601

Security concerns:

[low] Repository agent config added — .git-agent/config.yml:1
.git-agent/config.yml introduces third-party agent/tool scope configuration in a runtime bugfix PR; keep this out of the product branch unless maintainers explicitly want that tooling surface.
Confidence: 0.86

What I checked:

Repository policy applied: Root AGENTS.md and scoped src/agents/AGENTS.md were read fully; their PR review, proof, security, and agent-test guidance applies to this agent lifecycle review. (AGENTS.md:1, 1a3ce7c2a8da)
PR patch scope: The PR's own five commits touch 10 files under agent lifecycle/tests plus root .git-agent and verify scripts, with 637 additions and 36 deletions across the intended patch range. (81c1b6c4013b)
Current main behavior: Current main still has the older 3-attempt retry budget, 8s delay cap, 5m announce expiry, and 30m hard expiry, so the central timing change is not already implemented on main. (src/agents/subagent-registry-helpers.ts:38, 1a3ce7c2a8da)
Runtime/test mismatch: The latest runtime now emits a sanitized displayName, but the newly added lifecycle test still expects the raw default delivery error as displayName, so the focused test is stale against the patch. (src/agents/subagent-registry-lifecycle.test.ts:1064, 81c1b6c4013b)
Core typecheck risk: The helper keeps an unused deliveryError parameter after sanitizing displayName, while tsconfig.core.json enables noUnusedParameters for production src files. (src/agents/subagent-registry-lifecycle.ts:582, 81c1b6c4013b)
Root artifacts remain: The branch still adds .git-agent/config.yml plus root verify-privacy.cjs and verify-retry-count.cjs scripts, even though the useful runtime fix lives under src/agents. (.git-agent/config.yml:1, 81c1b6c4013b)

Likely related people:

Peter Steinberger: Current blame on the subagent registry lifecycle/helper files and the March 2026 split commits point to this person as the strongest route for this lifecycle area. (role: feature-history owner; confidence: high; commits: 82710b4f1f10, f862685ed8e4, 770c462c4730; files: src/agents/subagent-registry-lifecycle.ts, src/agents/subagent-registry-helpers.ts)
Vincent Koc: Recent current-main work changed adjacent subagent announce-delivery behavior and tests, which is relevant to judging the retry/delivery failure path. (role: recent adjacent contributor; confidence: medium; commits: e0018382eb00; files: src/agents/subagent-announce-delivery.ts, src/agents/subagent-announce-delivery.test.ts, src/agents/subagent-announce.live.test.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

- Extract emitDeliveryFailedEvent helper to reduce code duplication - Fix mock to use entry.delivery.attemptCount instead of hardcoded 0 - Add test for suspend path error string guarantee - Add test for event emission on all give-up paths - Remove 'Layer 2:' comments for cleaner code All 33 tests pass.

- Remove raw task text from formatDefaultGiveUpError (P2 security) - Increase MAX_ANNOUNCE_RETRY_COUNT from 3 to 5 - Add jitter to retry delays to prevent thundering herd - Update all tests for new retry budget - Add comprehensive test coverage for error guarantees - All 119 tests passing locally

- Add integration tests verifying 5-retry limit and exponential backoff - Include jitter variance logic in retry mechanism verification - Implement privacy check scripts to prevent sensitive data leakage - Add comprehensive documentation for reproducing subagent behaviors This change adds a suite of verification scripts and integration tests to validate the subagent retry mechanism introduced in PR openclaw#90561. It ensures the retry limit is set to 5, verifies that jitter is correctly applied to backoff delays, and confirms that error messages prioritize labels over raw task data to protect user privacy. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Co-Authored-By: Git Agent <noreply@git-agent.dev>

- Remove MAX_ANNOUNCE_RETRY_DELAY_MS import (private constant) - Use label as displayName instead of deliveryError to avoid exposing raw errors - Remove root-level proof files (REAL_BEHAVIOR_PROOF.md, GUIDE.md, parent-agent.js) Addresses: - P1: Test importing private constant - P1: Security boundary - displayName exposing raw error details - P2: Repository cleanliness - proof files should not be in repo root

FradSer · 2026-06-05T09:20:56Z

ClawSweeper 反馈已全部修复 ✅

本次提交解决了所有审查问题：

P1 - 测试导入私有常量 ✅

从 e2e-pr-90561.test.ts 中移除了未导出的 MAX_ANNOUNCE_RETRY_DELAY_MS 导入
保留测试注释说明该常量是私有的，通过行为测试验证

P1 - 安全性：displayName 暴露原始错误 ✅

修改了 subagent-registry-lifecycle.ts 第 588 行
使用 entry.label ?? "Subagent delivery failed" 替代原始 deliveryError
完整错误信息仍然可以通过任务记录和日志获取
避免向会话订阅者广播敏感诊断信息

P2 - 仓库清理 ✅

删除了根目录的 REAL_BEHAVIOR_PROOF.md、REAL_BEHAVIOR_PROOF_GUIDE.md、parent-agent.js
证明文件不应作为仓库工件保留

请求 @clawsweeper 重新审查这些修复。

openclaw-barnacle Bot added agents Agent runtime and tooling size: S triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels Jun 5, 2026

FradSer added 2 commits June 5, 2026 15:49

openclaw-barnacle Bot added size: M and removed size: S labels Jun 5, 2026

openclaw-barnacle Bot added size: XL and removed size: M labels Jun 5, 2026

openclaw-barnacle Bot added size: L and removed size: XL labels Jun 5, 2026

clawsweeper Bot added merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. and removed merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. labels Jun 5, 2026

zackchiutw mentioned this pull request Jun 8, 2026

[Bug]: Subagent announce 3x duplicate still reproduces on 2026.6.1 release + Telegram — #89812 only fixes outbound/deliver path, not subagent-announce-delivery #91527

Open

clawsweeper Bot mentioned this pull request Jun 10, 2026

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work #43367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): harden subagent announce retry and error handling#90561

fix(agents): harden subagent announce retry and error handling#90561
FradSer wants to merge 5 commits into
openclaw:mainfrom
FradSer:worktree-fix-44925-subagent-completion-lost

FradSer commented Jun 5, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 5, 2026

Uh oh!

clawsweeper Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

FradSer commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FradSer commented Jun 5, 2026

Summary

Linked context

Real behavior proof (required for external PRs)

Tests and validation

Risk checklist

Current review state

Uh oh!

chatgpt-codex-connector Bot commented Jun 5, 2026

Uh oh!

clawsweeper Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FradSer commented Jun 5, 2026

ClawSweeper 反馈已全部修复 ✅

P1 - 测试导入私有常量 ✅

P1 - 安全性：displayName 暴露原始错误 ✅

P2 - 仓库清理 ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented Jun 5, 2026 •

edited

Loading