fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction#1471
fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction#1471jonastemplestein wants to merge 1 commit into
Conversation
…duled LLM turns after DO eviction Two root-cause fixes from the 2026-06-10 prd Slack agent outage, where a deploy landed 3s after a Slack message, evicted the agent host DO mid-run, and the agent never responded — to that message or any later one. M1 (streams kernel): batch delivery into a dead subscriber stub was fire-and-forget with rejections discarded, so after a subscriber DO died the broken connection stayed in #connections and #reconcile never re-dialed it — delivery stalled for the rest of the stream incarnation. retainProcessEventBatch now surfaces delivery rejections (and sync stub throws) via onDeliveryError; the Stream DO drops the connection and re-dials, and the subscriber re-handshakes from its durable checkpoint. Also drop the Object.hasOwn guard that kept onRpcBroken from ever being wired for capnweb proxy stubs (defensively, since property access on a native RPC stub can fabricate a pipelined fake). Agent processor: agent/llm-request-scheduled is durable but the debounce timer lived only in memory. A rehydrated instance never re-armed it, and default-policy inputs only reset the nonexistent timer (#resetScheduledLlmRequestTimer no-ops without a scheduledEvent), so the stream wedged permanently. The scheduled state now records scheduledOffset; snapshot() (called on every host re-handshake) and processEventBatch re-arm the lost timer, with a history-read fallback for pre-fix checkpoints; the handoff append retries instead of dropping the turn. All recovery is idempotent — llm-request-requested is keyed off the scheduled event's offset. Regression tests verified to fail on the unfixed code: stream-redial.workers.test.ts (abort the runner DO mid-subscription, assert post-abort appends still deliver and the subscriber checkpoint advances), rpc-lifecycle.test.ts, and four recovery tests in the agent implementation.test.ts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 714b21a. Configure here.
| debounceMs, | ||
| scheduledEvent: { offset: scheduledEvent.offset }, | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Async recovery ignores cancelled schedule
Medium Severity
The history fallback in #ensureScheduledLlmRequestTimer arms a debounce timer from a captured currentRequest after readStreamEvents finishes, without checking whether that schedule was cancelled meanwhile. #requestScheduledLlmWork then appends llm-request-requested from rebuilt history without confirming currentRequest is still scheduled for that requestId, so a cancelled turn can still trigger an LLM handoff.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 714b21a. Configure here.
|
folding into 1460 |
This merge commit carries three things (merged with a dirty tree): 1. Merge of origin/main (itx SSR/loader prefetch, StreamState truth-telling, toAfterOffset rename). Conflicts: kept the subscriber descriptors on the two inbound subscribe call sites while adopting main's renamed helper. 2. Tidy-up: shared/presence-events.ts is gone. The presence event types, StreamSubscriberDescriptor, and ProcessorContractAnnouncement live in the core processor contract (contract files are the schema/type layer); processors that reconcile on subscriber-connected now dep [CoreProcessorContract]. Same theme: the legacy built-in/external-url subscriber shapes and their parse-then-drop transform are deleted — subscription-configured only accepts callable subscribers now. 3. Port of PR #1471 (prd Slack outage root-cause fixes) onto this branch's structure: - rpc-lifecycle: onDeliveryError observes batch-delivery rejections and sync stub throws (dispose only after settle); the Object.hasOwn guard that kept onRpcBroken from wiring for capnweb stubs is gone. Its unit tests ported verbatim. - CoreStreamProcessor.openConnection drops the connection (reason delivery-failed, with the presence fact) and re-dials outbound subscribers on delivery failure. The end-to-end abort/re-dial workerd regression tests ported verbatim and pass against this structure. - Agent scheduler: scheduledOffset is optional (old checkpoints still parse) with a history-read fallback, and the llm-request-requested handoff append re-arms the timer and retries on failure instead of dropping the turn. 1471's snapshot()/processEventBatch re-arm hooks are deliberately NOT ported: in this branch's architecture every re-handshake delivers a subscriber-connected fact, which is the one recovery trigger. Their recovery scenarios are covered by three new connected-driven tests (history fallback, no-double-arm with a live timer, handoff retry). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Absorbed into #1460 (commit 76e501a), adapted to that branch's structure:
|


Context
Root-cause fixes from today's prd Slack agent outage (second incident of the day). A deploy landed at 16:02:09 UTC — three seconds after a Slack message in
#...C08R1SMTZGD— evicted the agent host DO mid-run, and the agent never responded to that message or any later one. Diagnosis trail: thread stream/agents/slack/c08r1smtzgd/ts-1781107326-880549shows full agent setup then nothing (no LLM request, no error), and observability shows zero completedSlackAgentDurableObjectinvocations account-wide after the deploy.Two independent bugs compose into "agent permanently silent":
1. Streams kernel: dead outbound connections never re-dialed (M1 from
tasks/streams-review-fixes.md)Batch delivery into a subscriber stub was fire-and-forget with the result disposed immediately and rejections discarded. When a subscriber DO dies (deploy eviction, abort), the broken connection stays in
#connections, the pump keeps advancing its cursor into the dead stub, and#reconcileskips keys that already have a connection — so the subscriber is never re-dialed and delivery stalls silently for the rest of the stream incarnation.retainProcessEventBatchnow takesonDeliveryErrorand observes the delivery result (rejections and synchronous stub throws), disposing only after settle (disposing a pending Cap'n Web promise cancels the call).Object.hasOwnguard that preventedonRpcBrokenfrom ever being wired for Cap'n Web proxy stubs (they expose no own descriptors). The wiring is defensive because property access on a native Workers RPC stub can fabricate a pipelined method that only fails at call time.2. Agent processor: scheduled LLM turn dies with the DO and wedges the stream
agent/llm-request-scheduledis durable, but the debounce timer lived only in memory. After eviction+rehydration nothing re-armed it — and#resetScheduledLlmRequestTimerno-ops when there is no in-memoryscheduledEvent, so every later default-policy input was silently swallowed too. Streams evicted betweenllm-request-scheduledandllm-request-requestedwere permanently wedged.currentRequeststate now recordsscheduledOffset(optional, so old checkpoints still parse).snapshot()— called by the host on every subscription (re-)handshake, even with no new events — andprocessEventBatchre-arm the lost timer; a history-read fallback covers checkpoints written beforescheduledOffsetexisted, so already-wedged prod streams recover on their next re-handshake after this deploys.llm-request-requestedis keyed off the scheduled event's offset.Regression tests (all verified to fail on the unfixed code)
packages/streams/src/workers/durable-objects/stream-redial.workers.test.ts— end-to-end in workerd: Stream DO → Callable dial →StreamProcessorRunner(echo), thenctx.abort()the runner to simulate a deploy and assert the next append still delivers and the subscriber's durable checkpoint advances. (The Stream-side cursor advances even into a dead stub — that is the bug — so the assertions are on actual delivery and the subscriber checkpoint.)packages/streams/src/workers/rpc-lifecycle.test.ts— proxy-shapedonRpcBrokenwiring, pipelined-fake survival, delivery rejection reporting, settle-then-dispose.apps/os/.../agent/implementation.test.ts— four recovery tests: re-arm viasnapshot()rehydration, re-arm via batch replay, history fallback for old checkpoints, handoff append retry.pnpm typecheck && pnpm lint && pnpm format && pnpm testgreen, plus the streams workers suite.🤖 Generated with Claude Code
Note
High Risk
Changes core stream delivery and agent LLM scheduling paths that run on every deploy/eviction; mistakes could duplicate LLM requests or drop events, though idempotency keys and generation gates mitigate double-handoff.
Overview
Fixes a production outage where deploy eviction left Slack agents permanently silent by addressing two independent failure modes that composed together.
Streams (M1): Outbound batch delivery into a dead subscriber DO no longer leaves a zombie connection in
#connections.retainProcessEventBatchnow acceptsonDeliveryError, observes rejections/sync throws, and disposes RPC results only after settle;onRpcBrokenwiring no longer uses the brokenObject.hasOwnguard. The Stream DO closes the connection and calls#reconcile()on delivery failure so subscribers are re-dialed and re-handshake from durable checkpoints.Agent processor: Scheduled LLM turns persist
scheduledOffsetoncurrentRequestso a rehydrated DO can rebuild the in-memory debounce timer. Recovery runs onsnapshot()(subscription re-handshake) and afterprocessEventBatchreplay; old checkpoints withoutscheduledOffsetfall back to stream history. Failedllm-request-requestedappends re-arm the timer instead of wedging the stream.Regression coverage:
stream-redial.workers.test.ts,rpc-lifecycle.test.ts, and agentimplementation.test.tsscheduled-recovery suite.Reviewed by Cursor Bugbot for commit 714b21a. Bugbot is set up for automated code reviews on this repo. Configure here.
Environment Config Lease
No active environment config lease.
OS
Status: released
Commit:
714b21aPreview: https://os.iterate-preview-4.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T19:11:51.700Z