Skip to content

fix(announce): break infinite retry loop with max attempts and expiry#18444

Merged
steipete merged 2 commits intoopenclaw:mainfrom
widingmarcus-cyber:fix/announce-loop-18264-v2
Feb 16, 2026
Merged

fix(announce): break infinite retry loop with max attempts and expiry#18444
steipete merged 2 commits intoopenclaw:mainfrom
widingmarcus-cyber:fix/announce-loop-18264-v2

Conversation

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor

@widingmarcus-cyber widingmarcus-cyber commented Feb 16, 2026

Fixes #18264

Problem

When runSubagentAnnounceFlow repeatedly returns false (deferred), finalizeSubagentCleanup resets cleanupHandled = false and removes the entry from resumedRuns, allowing retryDeferredCompletedAnnounces to pick it up again immediately. If the underlying condition persists (stale registry data, transient state after cron deletion), this creates an infinite loop delivering 100+ announces over 3+ hours.

The loop persists even after the originating cron job is disabled/deleted, because the subagent registry is persisted to disk and restored on gateway restart.

Root Cause

In subagent-registry.ts, finalizeSubagentCleanup():

if (!didAnnounce) {
  entry.cleanupHandled = false;    // allows retry
  resumedRuns.delete(runId);       // allows resumeSubagentRun to fire again
  // No limit on retries -> infinite loop
}

Fix

  1. Max retry count: Track announceRetryCount on SubagentRunRecord. After MAX_ANNOUNCE_RETRY_COUNT (3) failed attempts, mark as completed.
  2. Expiration: Announce entries older than ANNOUNCE_EXPIRY_MS (5 min since endedAt) are force-expired in both resumeSubagentRun and retryDeferredCompletedAnnounces.
  3. Persistence: Both new fields (announceRetryCount, lastAnnounceRetryAt) persist to the registry, surviving gateway restarts.

Testing

  • All 10 existing tests pass (nested, steer-restart, including the 'retries deferred parent cleanup' test)
  • 2 new regression tests added for the loop guard
  • 3 announce-queue tests pass

Greptile Summary

This PR adds loop-guard protections to the subagent announce retry mechanism to fix an infinite retry loop (#18264). When runSubagentAnnounceFlow repeatedly returns false, the code now tracks retry attempts (announceRetryCount) and enforces both a max retry count (3) and a time-based expiry (5 min since endedAt). Guards are added at three points: resumeSubagentRun, finalizeSubagentCleanup, and retryDeferredCompletedAnnounces.

  • Max retry + expiry guards: Well-placed at all three entry points that could re-trigger the loop, with consistent logic and persistence across restarts.
  • Bug: replaceSubagentRunAfterSteer does not reset announceRetryCount/lastAnnounceRetryAt when creating a replacement run — a steered replacement could inherit an exhausted retry budget and be immediately force-expired.
  • Tests: Two new regression tests verify field persistence and entry skipping, though test coverage for the finalizeSubagentCleanup retry-counting path would strengthen confidence.

Confidence Score: 3/5

  • The core fix is sound but has a bug in the steer-restart path that could suppress announce delivery for replacement runs.
  • The main loop-guard logic is correct and well-structured, with guards at all three re-entry points. However, replaceSubagentRunAfterSteer not resetting the new retry fields is a real bug that could cause steered replacement runs to silently lose announce delivery. This is the same category of issue (silent announce loss) that the PR is trying to fix.
  • src/agents/subagent-registry.ts — the replaceSubagentRunAfterSteer function needs to reset announceRetryCount and lastAnnounceRetryAt on the replacement record.

Last reviewed commit: efe0541

…openclaw#18264)

When runSubagentAnnounceFlow returns false (deferred), finalizeSubagentCleanup
resets cleanupHandled=false and removes from resumedRuns, allowing
retryDeferredCompletedAnnounces to pick it up again. If the underlying
condition persists (stale registry data, transient state), this creates an
infinite loop delivering 100+ announces over hours.

Fix:
- Add announceRetryCount + lastAnnounceRetryAt to SubagentRunRecord
- finalizeSubagentCleanup: after MAX_ANNOUNCE_RETRY_COUNT (3) failed attempts
  or ANNOUNCE_EXPIRY_MS (5 min) since endedAt, mark as completed and stop
- resumeSubagentRun: skip entries that have exhausted retries or expired
- retryDeferredCompletedAnnounces: force-expire stale entries
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Feb 16, 2026
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 16, 2026

Additional Comments (1)

src/agents/subagent-registry.ts
Stale retry state carried into replacement run

replaceSubagentRunAfterSteer spreads ...source but doesn't explicitly reset announceRetryCount or lastAnnounceRetryAt. If the previous run had exhausted its retry budget before the steer-restart, the replacement run inherits those values and could be immediately force-expired by the guards in resumeSubagentRun (lines 109-113) without ever attempting announce delivery.

The other completion-related fields (endedAt, outcome, cleanupCompletedAt, cleanupHandled, suppressAnnounceReason) are all explicitly reset here — the new fields should be too.

  const next: SubagentRunRecord = {
    ...source,
    runId: nextRunId,
    startedAt: now,
    endedAt: undefined,
    outcome: undefined,
    cleanupCompletedAt: undefined,
    cleanupHandled: false,
    suppressAnnounceReason: undefined,
    announceRetryCount: undefined,
    lastAnnounceRetryAt: undefined,
    archiveAtMs,
    runTimeoutSeconds,
  };
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/subagent-registry.ts
Line: 430:441

Comment:
**Stale retry state carried into replacement run**

`replaceSubagentRunAfterSteer` spreads `...source` but doesn't explicitly reset `announceRetryCount` or `lastAnnounceRetryAt`. If the previous run had exhausted its retry budget before the steer-restart, the replacement run inherits those values and could be immediately force-expired by the guards in `resumeSubagentRun` (lines 109-113) without ever attempting announce delivery.

The other completion-related fields (`endedAt`, `outcome`, `cleanupCompletedAt`, `cleanupHandled`, `suppressAnnounceReason`) are all explicitly reset here — the new fields should be too.

```suggestion
  const next: SubagentRunRecord = {
    ...source,
    runId: nextRunId,
    startedAt: now,
    endedAt: undefined,
    outcome: undefined,
    cleanupCompletedAt: undefined,
    cleanupHandled: false,
    suppressAnnounceReason: undefined,
    announceRetryCount: undefined,
    lastAnnounceRetryAt: undefined,
    archiveAtMs,
    runTimeoutSeconds,
  };
```

How can I resolve this? If you propose a fix, please make it concise.

Address review feedback: the spread operator carries stale retry state
into replacement runs, potentially causing immediate force-expiration
without ever attempting announce delivery.
@steipete steipete merged commit de900ba into openclaw:main Feb 16, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway announcement delivery loop - infinite retries of completed subagent announcements

2 participants