fix(continuation): #990 Pillar-0 β exp-backoff on busy-skip re-arm (storm-killer, architecturally-neutral)#994
Conversation
β¦ested harden (#990 Pillar-0) The PRE-drive busy-skip re-arm (requests-in-flight/draining) requeued at a flat BUSY_RETRY_MS=1s with no backoff, spinning a chronically-busy seat at ~1Hz forever (the storm). Replace with exponential backoff on a NEW busySkipCount counter that is DISTINCT from retryCount and never feeds the transient-error fail-bound β rate-cap- forever: the flow keeps deferring at a decaying rate (1s,2s,4s,...,capped at maxDelayMs) and delivers the instant the seat quiets, never dropped (#952 never-penalize survives). - computeBusySkipBackoffMs(busySkipCount, ceilingMs) = min(ceilingMs, BUSY_RETRY_MS*2^n). - busySkipCount persisted in PendingWorkState (same shape as retryCount), reset to 0 on drive (markPendingWorkTurnGranted) so a deferred-then-granted flow is never permanently backed off. Busy-skip never passes retryCount. - The interrupted/threw retryCount-bounded path (bucket-3) is unchanged. - :259 dedup: verified consumePendingWork filters terminal status structurally; hardened it to also skip cancelRequestedAt-marked flows so a cancel-requested wake (pre-reaper finalize window) is never granted a turn. - Tests: exp-backoff progression (RED vs old flat-1s), distinct-from-retryCount/rate-cap- forever across 20 skips, reset-on-drive, cancel-requested + succeeded dedup gates, helper unit curve+cap. Fixed pre-existing oxlint nits in the same test file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6168d1f
into
frond-scribe/20260609/assembly-token-wiring
There was a problem hiding this comment.
π‘ Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56bd568fea
βΉοΈ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with π.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| */ | ||
| export function computeBusySkipBackoffMs(busySkipCount: number, ceilingMs: number): number { | ||
| const exponent = Math.max(0, busySkipCount); | ||
| return Math.min(ceilingMs, BUSY_RETRY_MS * 2 ** exponent); |
There was a problem hiding this comment.
Guard busy backoff against zero ceilings
When agents.defaults.continuation.maxDelayMs is configured below BUSY_RETRY_MSβespecially 0, which the continuation schema/resolver allow (src/config/zod-schema.core.ts:888, src/auto-reply/continuation/config.ts:31)βthis returns 0, so the retry path sets dueAt = Date.now() and armWorkTimer schedules setTimeout(0). While the session stays active, the main lane is nonempty, or the gateway is draining, each timer immediately re-consumes and requeues again, turning the intended backoff into a tight event-loop/CPU storm; keep a nonzero floor for busy retries independent of the scheduling ceiling.
Useful? React with πΒ / π.
| }), | ||
| dueAt: params.dueAt, | ||
| ...(params.retryCount !== undefined ? { retryCount: params.retryCount } : {}), | ||
| ...(params.busySkipCount !== undefined ? { busySkipCount: params.busySkipCount } : {}), |
There was a problem hiding this comment.
Reset stale busy-skip counts on other retries
Because nextState is copied from the current durable state and this new branch only overwrites busySkipCount when the caller supplies one, a transient-error retry after a busy-deferred flow preserves the old busy counter. If the seat was busy several times, then briefly got far enough to throw from getReplyFromConfig, and is busy again on the next retry, the first busy skip of that new streak immediately uses the old capped backoff (for example 60s) instead of the intended 1s for a consecutive PRE-drive streak, delaying delivery after intermittent errors; clear the counter on non-busy requeues or make callers explicitly reset it.
Useful? React with πΒ / π.
| if (candidate.cancelRequestedAt != null) { | ||
| return false; |
There was a problem hiding this comment.
Do not strand cancelled continuation flows
If cancelFlowById leaves a managed continuation flow non-terminal with cancelRequestedAt after the revision-conflict path described here, this filter now makes the dispatcher ignore it forever, but the liveness checks still count queued/running continuation flows without excluding cancelRequestedAt (hasLiveOrRecentlyDispatchedContinuationWork is what subagent cleanup waits on). The only maintenance finalizer I found is the tasks command path, not an automatic runtime reaper, so a cancelled child-session wake in that state can keep cleanup retrying indefinitely; either finalize/mark it here or exclude cancel-requested flows from the pending/liveness predicates.
Useful? React with πΒ / π.
#990 Pillar-0 β the busy-retry-loop storm-killer (ship-now slice)
Decoupled, architecturally-neutral slice of #990. Cures the live busy-skip spin (tonight's multi-DGX storm + the
14b1e6f9orphan exhibited it). Stacks on assembly0ea26acd(a437ca7 + #988-P2-1). Design home:#990 issuecomment-4677664820.The fix (two parts)
BUSY_RETRY_MS=1000flat/uncapped re-arm (the ~1Hz spin onrequests-in-flight) becomes exponential:min(BUSY_RETRY_MS * 2^busySkipCount, CEILING).busySkipCountis a NEW per-flow counter, distinct fromretryCountβ it never incrementsretryCount/never tripsMAX_TRANSIENT_ERROR_RETRY_COUNT, and resets to 0 on actual drive (ran).:259succeeded-gate verify + harden β confirmed the requeue read-guard skips terminal (succeeded/cancelled) via the consume-filter; hardened to also honorcancel_requested_at.Invariants (locked give-up-policy)
retryCount-bounded) UNCHANGED. Architecturally-neutral β touches no outcome-classification.Verification (lane self-report
56bd568)work-dispatch.test.ts35/35 green + RED-verified (old flat-1s fails the backoff-progression assertion).tsgo:core+tsgo:extensionsexit 0,oxlint+oxfmtclean.Reviewers: π (outcome-model owner) + πͺ¨ (continuation-lifecycle pair). Folds into the consolidated pass; PR-2 (discriminator) + PR-3 (locus-3) stack next.
Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com