fix(agents): harden subagent announce retry and error handling#90561
fix(agents): harden subagent announce retry and error handling#90561FradSer wants to merge 5 commits into
Conversation
- Increase announce retry count to 5 and delay cap to 30s - Implement jittered exponential backoff for announce retries - Ensure subagent delivery errors are always populated on give-up - Emit session lifecycle events upon final delivery failure - Update constants to extend overall announce expiry windows This change improves subagent reliability by introducing jittered backoff to prevent thundering herd scenarios and hardening error reporting. By ensuring that delivery error states are always populated and emitting explicit lifecycle events when retries are exhausted, observers can better track and debug terminal subagent delivery failures as requested in issue openclaw#44925. Co-Authored-By: qwen3-coder <noreply@anthropic.com> Co-Authored-By: Git Agent <noreply@git-agent.dev>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Codex review: needs real behavior proof before merge. Reviewed June 5, 2026, 5:28 AM ET / 09:28 UTC. Summary PR surface: Source +61, Tests +297, Other +243. Total +601 across 10 files. Reproducibility: no. high-confidence live reproduction is included. Source inspection shows the touched give-up paths and the PR-added test/typecheck blockers, but the PR body says live multi-agent orchestration was not tested. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review findings
Review detailsBest possible solution: Merge a narrow agent lifecycle fix that keeps failure payloads sanitized, passes focused agent tests and core typecheck, removes repo-root artifacts, and includes redacted real multi-agent announce-failure proof. Do we have a high-confidence way to reproduce the issue? No high-confidence live reproduction is included. Source inspection shows the touched give-up paths and the PR-added test/typecheck blockers, but the PR body says live multi-agent orchestration was not tested. Is this the best way to solve the issue? No. The runtime direction is a plausible mitigation, but the PR is not the best mergeable form until sanitized-event coverage, typecheck cleanliness, artifact cleanup, and real behavior proof are all in place. Full review comments:
Overall correctness: patch is incorrect AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 1a3ce7c2a8da. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +61, Tests +297, Other +243. Total +601 across 10 files. View PR surface stats
Security concerns:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
- Extract emitDeliveryFailedEvent helper to reduce code duplication - Fix mock to use entry.delivery.attemptCount instead of hardcoded 0 - Add test for suspend path error string guarantee - Add test for event emission on all give-up paths - Remove 'Layer 2:' comments for cleaner code All 33 tests pass.
- Remove raw task text from formatDefaultGiveUpError (P2 security) - Increase MAX_ANNOUNCE_RETRY_COUNT from 3 to 5 - Add jitter to retry delays to prevent thundering herd - Update all tests for new retry budget - Add comprehensive test coverage for error guarantees - All 119 tests passing locally
- Add integration tests verifying 5-retry limit and exponential backoff - Include jitter variance logic in retry mechanism verification - Implement privacy check scripts to prevent sensitive data leakage - Add comprehensive documentation for reproducing subagent behaviors This change adds a suite of verification scripts and integration tests to validate the subagent retry mechanism introduced in PR openclaw#90561. It ensures the retry limit is set to 5, verifies that jitter is correctly applied to backoff delays, and confirms that error messages prioritize labels over raw task data to protect user privacy. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Co-Authored-By: Git Agent <noreply@git-agent.dev>
- Remove MAX_ANNOUNCE_RETRY_DELAY_MS import (private constant) - Use label as displayName instead of deliveryError to avoid exposing raw errors - Remove root-level proof files (REAL_BEHAVIOR_PROOF.md, GUIDE.md, parent-agent.js) Addresses: - P1: Test importing private constant - P1: Security boundary - displayName exposing raw error details - P2: Repository cleanliness - proof files should not be in repo root
ClawSweeper 反馈已全部修复 ✅本次提交解决了所有审查问题: P1 - 测试导入私有常量 ✅
P1 - 安全性:displayName 暴露原始错误 ✅
P2 - 仓库清理 ✅
请求 @clawsweeper 重新审查这些修复。 |
Summary
formatDefaultGiveUpErrorhelper for consistent, informative error messagesdeliveryErroris always populated on all three give-up pathssubagent-delivery-failedsession lifecycle events for user/monitor visibilityLinked context
Closes #44925
Related issues: #44919 (OAuth desync), #28617 (noOutputTimeoutMs), #43367 (multi-agent orchestration instability)
Real behavior proof (required for external PRs)
pnpm test src/agents/subagent-registry-lifecycle.test.tsdelivery.lastErrorwith descriptive messages; session lifecycle events emitted for observabilityTests and validation
Which commands did you run?
pnpm test src/agents/subagent-registry-lifecycle.test.ts— 31/31 tests passedWhat regression coverage was added or updated?
finalizeResumedAnnounceGiveUpwhen no prior error existssuspendPendingFinalDeliveryerror message guaranteeWhat failed before this fix, if known?
Risk checklist
Did user-visible behavior change? (
Yes/No)Yes — improved error messages and retry behavior; users will see more retries before give-up and receive lifecycle events on failures
Did config, environment, or migration behavior change? (
Yes/No)No
Did security, auth, secrets, network, or tool execution behavior change? (
Yes/No)No
What is the highest-risk area?
Retry timing — increased retry count and longer delays could theoretically delay failure detection, but only for persistent failures
How is that risk mitigated?
Exponential backoff with jitter prevents thundering herd; extended timeouts give transient failures more recovery time; error messages ensure failures are always visible
Current review state
What is the next action?
Awaiting maintainer review
What is still waiting on author, maintainer, CI, or external proof?
Live integration testing in production multi-agent environment
Which bot or reviewer comments were addressed?
N/A — initial PR
🤖 AI-assisted: This PR was developed with Claude Code assistance.