Skip to content

fix(cron): stale announce retries drain one-per-trigger instead of bulk flush #2088

@alexey-pelykh

Description

@alexey-pelykh

Summary

When cron announce delivery fails for multiple consecutive runs (e.g., due to misconfigured delivery target), the failed entries accumulate in the subagent registry. When the configuration is fixed, these entries drain one per heartbeat/trigger instead of being flushed sequentially in a single pass. Each trigger processes one entry, forcing the user to manually trigger N times to clear the backlog.

Reproduction

  1. Configure an isolated cron job with a broken delivery target (e.g., user:D... instead of channel:D...)
  2. Let 6+ scheduled runs execute — each produces output but delivery fails
  3. Fix the delivery target
  4. Observe: each subsequent trigger delivers one old result, not the full queue
  5. Must trigger 6+ times to see all past results + fresh output

Expected Behavior

When delivery starts working again, the system should flush the entire pending queue in one pass, oldest-to-newest. In normal operation the queue is 1 (the latest run). When delivery was broken, the queue accumulates — and once delivery recovers, all entries should be delivered sequentially in a single pass, preserving chronological order.

This is NOT a staleness problem — every run's output is valuable (it was the delivery that failed, not the content). The user expects to receive all missed results in order once the delivery issue is resolved.

Root Cause

In src/agents/subagent-registry.ts:

  1. One-at-a-time processing: retryDeferredCompletedAnnounces (line 750-776) triggers resumeSubagentRun for entries, but only one entry gets processed per cycle
  2. Requires external trigger: Each retry needs a heartbeat tick or manual cron run to process the next entry
  3. No queue drain loop: There's no mechanism to say "delivery is working now, flush all pending entries for this job"

Proposed Fix

When a deferred announce succeeds (delivery confirmed), immediately check for additional pending entries for the same cron job and process them in a loop until the queue is empty:

retryDeferredCompletedAnnounces:
  for each pending entry (oldest first):
    attempt delivery
    if success → continue to next entry
    if failure → stop (delivery still broken)

This gives the correct behavior:

  • Normal operation: Queue is 1, delivers immediately
  • Recovery after outage: Queue is N, flushes all N entries oldest-to-newest in one pass
  • Partial recovery: If delivery breaks again mid-flush, stops at the failure point

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions