Skip to content

fix(continuation): #990 Pillar-0 β€” exp-backoff on busy-skip re-arm (storm-killer, architecturally-neutral)#994

Merged
scribe-dandelion-cult merged 1 commit into
frond-scribe/20260609/assembly-token-wiringfrom
codeagent/990-pillar0-exp-backoff
Jun 11, 2026
Merged

fix(continuation): #990 Pillar-0 β€” exp-backoff on busy-skip re-arm (storm-killer, architecturally-neutral)#994
scribe-dandelion-cult merged 1 commit into
frond-scribe/20260609/assembly-token-wiringfrom
codeagent/990-pillar0-exp-backoff

Conversation

@scribe-dandelion-cult

Copy link
Copy Markdown

#990 Pillar-0 β€” the busy-retry-loop storm-killer (ship-now slice)

Decoupled, architecturally-neutral slice of #990. Cures the live busy-skip spin (tonight's multi-DGX storm + the 14b1e6f9 orphan exhibited it). Stacks on assembly 0ea26acd (a437ca7 + #988-P2-1). Design home: #990 issuecomment-4677664820.

The fix (two parts)

  1. exp-backoff on the busy-skip re-arm β€” the flat BUSY_RETRY_MS=1000 flat/uncapped re-arm (the ~1Hz spin on requests-in-flight) becomes exponential: min(BUSY_RETRY_MS * 2^busySkipCount, CEILING). busySkipCount is a NEW per-flow counter, distinct from retryCount β€” it never increments retryCount/never trips MAX_TRANSIENT_ERROR_RETRY_COUNT, and resets to 0 on actual drive (ran).
  2. :259 succeeded-gate verify + harden β€” confirmed the requeue read-guard skips terminal (succeeded/cancelled) via the consume-filter; hardened to also honor cancel_requested_at.

Invariants (locked give-up-policy)

Verification (lane self-report 56bd568)

  • work-dispatch.test.ts 35/35 green + RED-verified (old flat-1s fails the backoff-progression assertion).
  • Full suite 83/88 shards; the 5 fails (telegram net-timeout, agents-core thinking-default drift, secrets contract, memory-lancedb envelope, tooling pnpm-path) touch zero of these files + import none of these modules = known baseline-env.
  • tsgo:core+tsgo:extensions exit 0, oxlint+oxfmt clean.

Reviewers: 🌊 (outcome-model owner) + πŸͺ¨ (continuation-lifecycle pair). Folds into the consolidated pass; PR-2 (discriminator) + PR-3 (locus-3) stack next.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…ested harden (#990 Pillar-0)

The PRE-drive busy-skip re-arm (requests-in-flight/draining) requeued at a flat
BUSY_RETRY_MS=1s with no backoff, spinning a chronically-busy seat at ~1Hz forever
(the storm). Replace with exponential backoff on a NEW busySkipCount counter that is
DISTINCT from retryCount and never feeds the transient-error fail-bound β€” rate-cap-
forever: the flow keeps deferring at a decaying rate (1s,2s,4s,...,capped at maxDelayMs)
and delivers the instant the seat quiets, never dropped (#952 never-penalize survives).

- computeBusySkipBackoffMs(busySkipCount, ceilingMs) = min(ceilingMs, BUSY_RETRY_MS*2^n).
- busySkipCount persisted in PendingWorkState (same shape as retryCount), reset to 0 on
  drive (markPendingWorkTurnGranted) so a deferred-then-granted flow is never permanently
  backed off. Busy-skip never passes retryCount.
- The interrupted/threw retryCount-bounded path (bucket-3) is unchanged.
- :259 dedup: verified consumePendingWork filters terminal status structurally; hardened
  it to also skip cancelRequestedAt-marked flows so a cancel-requested wake (pre-reaper
  finalize window) is never granted a turn.
- Tests: exp-backoff progression (RED vs old flat-1s), distinct-from-retryCount/rate-cap-
  forever across 20 skips, reset-on-drive, cancel-requested + succeeded dedup gates, helper
  unit curve+cap. Fixed pre-existing oxlint nits in the same test file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@scribe-dandelion-cult scribe-dandelion-cult merged commit 6168d1f into frond-scribe/20260609/assembly-token-wiring Jun 11, 2026
130 of 142 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’‘ Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56bd568fea

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with πŸ‘.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

*/
export function computeBusySkipBackoffMs(busySkipCount: number, ceilingMs: number): number {
const exponent = Math.max(0, busySkipCount);
return Math.min(ceilingMs, BUSY_RETRY_MS * 2 ** exponent);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard busy backoff against zero ceilings

When agents.defaults.continuation.maxDelayMs is configured below BUSY_RETRY_MSβ€”especially 0, which the continuation schema/resolver allow (src/config/zod-schema.core.ts:888, src/auto-reply/continuation/config.ts:31)β€”this returns 0, so the retry path sets dueAt = Date.now() and armWorkTimer schedules setTimeout(0). While the session stays active, the main lane is nonempty, or the gateway is draining, each timer immediately re-consumes and requeues again, turning the intended backoff into a tight event-loop/CPU storm; keep a nonzero floor for busy retries independent of the scheduling ceiling.

Useful? React with πŸ‘Β / πŸ‘Ž.

}),
dueAt: params.dueAt,
...(params.retryCount !== undefined ? { retryCount: params.retryCount } : {}),
...(params.busySkipCount !== undefined ? { busySkipCount: params.busySkipCount } : {}),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset stale busy-skip counts on other retries

Because nextState is copied from the current durable state and this new branch only overwrites busySkipCount when the caller supplies one, a transient-error retry after a busy-deferred flow preserves the old busy counter. If the seat was busy several times, then briefly got far enough to throw from getReplyFromConfig, and is busy again on the next retry, the first busy skip of that new streak immediately uses the old capped backoff (for example 60s) instead of the intended 1s for a consecutive PRE-drive streak, delaying delivery after intermittent errors; clear the counter on non-busy requeues or make callers explicitly reset it.

Useful? React with πŸ‘Β / πŸ‘Ž.

Comment on lines +181 to +182
if (candidate.cancelRequestedAt != null) {
return false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not strand cancelled continuation flows

If cancelFlowById leaves a managed continuation flow non-terminal with cancelRequestedAt after the revision-conflict path described here, this filter now makes the dispatcher ignore it forever, but the liveness checks still count queued/running continuation flows without excluding cancelRequestedAt (hasLiveOrRecentlyDispatchedContinuationWork is what subagent cleanup waits on). The only maintenance finalizer I found is the tasks command path, not an automatic runtime reaper, so a cancelled child-session wake in that state can keep cleanup retrying indefinitely; either finalize/mark it here or exclude cancel-requested flows from the pending/liveness predicates.

Useful? React with πŸ‘Β / πŸ‘Ž.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants