fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction by jonastemplestein · Pull Request #1471 · iterate/iterate

jonastemplestein · 2026-06-10T18:36:44Z

Context

Root-cause fixes from today's prd Slack agent outage (second incident of the day). A deploy landed at 16:02:09 UTC — three seconds after a Slack message in #...C08R1SMTZGD — evicted the agent host DO mid-run, and the agent never responded to that message or any later one. Diagnosis trail: thread stream /agents/slack/c08r1smtzgd/ts-1781107326-880549 shows full agent setup then nothing (no LLM request, no error), and observability shows zero completed SlackAgentDurableObject invocations account-wide after the deploy.

Two independent bugs compose into "agent permanently silent":

1. Streams kernel: dead outbound connections never re-dialed (M1 from `tasks/streams-review-fixes.md`)

Batch delivery into a subscriber stub was fire-and-forget with the result disposed immediately and rejections discarded. When a subscriber DO dies (deploy eviction, abort), the broken connection stays in #connections, the pump keeps advancing its cursor into the dead stub, and #reconcile skips keys that already have a connection — so the subscriber is never re-dialed and delivery stalls silently for the rest of the stream incarnation.

retainProcessEventBatch now takes onDeliveryError and observes the delivery result (rejections and synchronous stub throws), disposing only after settle (disposing a pending Cap'n Web promise cancels the call).
The Stream DO drops the connection on a failed delivery and re-dials (outbound); the subscriber re-handshakes from its durable checkpoint, and the existing generation gate drops stragglers.
Dropped the Object.hasOwn guard that prevented onRpcBroken from ever being wired for Cap'n Web proxy stubs (they expose no own descriptors). The wiring is defensive because property access on a native Workers RPC stub can fabricate a pipelined method that only fails at call time.

2. Agent processor: scheduled LLM turn dies with the DO and wedges the stream

agent/llm-request-scheduled is durable, but the debounce timer lived only in memory. After eviction+rehydration nothing re-armed it — and #resetScheduledLlmRequestTimer no-ops when there is no in-memory scheduledEvent, so every later default-policy input was silently swallowed too. Streams evicted between llm-request-scheduled and llm-request-requested were permanently wedged.

The scheduled currentRequest state now records scheduledOffset (optional, so old checkpoints still parse).
snapshot() — called by the host on every subscription (re-)handshake, even with no new events — and processEventBatch re-arm the lost timer; a history-read fallback covers checkpoints written before scheduledOffset existed, so already-wedged prod streams recover on their next re-handshake after this deploys.
The handoff append retries instead of dropping the turn. All recovery is idempotent: llm-request-requested is keyed off the scheduled event's offset.

Regression tests (all verified to fail on the unfixed code)

packages/streams/src/workers/durable-objects/stream-redial.workers.test.ts — end-to-end in workerd: Stream DO → Callable dial → StreamProcessorRunner (echo), then ctx.abort() the runner to simulate a deploy and assert the next append still delivers and the subscriber's durable checkpoint advances. (The Stream-side cursor advances even into a dead stub — that is the bug — so the assertions are on actual delivery and the subscriber checkpoint.)
packages/streams/src/workers/rpc-lifecycle.test.ts — proxy-shaped onRpcBroken wiring, pipelined-fake survival, delivery rejection reporting, settle-then-dispose.
apps/os/.../agent/implementation.test.ts — four recovery tests: re-arm via snapshot() rehydration, re-arm via batch replay, history fallback for old checkpoints, handoff append retry.

pnpm typecheck && pnpm lint && pnpm format && pnpm test green, plus the streams workers suite.

🤖 Generated with Claude Code

Note

High Risk
Changes core stream delivery and agent LLM scheduling paths that run on every deploy/eviction; mistakes could duplicate LLM requests or drop events, though idempotency keys and generation gates mitigate double-handoff.

Overview
Fixes a production outage where deploy eviction left Slack agents permanently silent by addressing two independent failure modes that composed together.

Streams (M1): Outbound batch delivery into a dead subscriber DO no longer leaves a zombie connection in #connections. retainProcessEventBatch now accepts onDeliveryError, observes rejections/sync throws, and disposes RPC results only after settle; onRpcBroken wiring no longer uses the broken Object.hasOwn guard. The Stream DO closes the connection and calls #reconcile() on delivery failure so subscribers are re-dialed and re-handshake from durable checkpoints.

Agent processor: Scheduled LLM turns persist scheduledOffset on currentRequest so a rehydrated DO can rebuild the in-memory debounce timer. Recovery runs on snapshot() (subscription re-handshake) and after processEventBatch replay; old checkpoints without scheduledOffset fall back to stream history. Failed llm-request-requested appends re-arm the timer instead of wedging the stream.

Regression coverage: stream-redial.workers.test.ts, rpc-lifecycle.test.ts, and agent implementation.test.ts scheduled-recovery suite.

^{Reviewed by Cursor Bugbot for commit 714b21a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: 714b21a
Preview: https://os.iterate-preview-4.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T19:11:51.700Z

…duled LLM turns after DO eviction Two root-cause fixes from the 2026-06-10 prd Slack agent outage, where a deploy landed 3s after a Slack message, evicted the agent host DO mid-run, and the agent never responded — to that message or any later one. M1 (streams kernel): batch delivery into a dead subscriber stub was fire-and-forget with rejections discarded, so after a subscriber DO died the broken connection stayed in #connections and #reconcile never re-dialed it — delivery stalled for the rest of the stream incarnation. retainProcessEventBatch now surfaces delivery rejections (and sync stub throws) via onDeliveryError; the Stream DO drops the connection and re-dials, and the subscriber re-handshakes from its durable checkpoint. Also drop the Object.hasOwn guard that kept onRpcBroken from ever being wired for capnweb proxy stubs (defensively, since property access on a native RPC stub can fabricate a pipelined fake). Agent processor: agent/llm-request-scheduled is durable but the debounce timer lived only in memory. A rehydrated instance never re-armed it, and default-policy inputs only reset the nonexistent timer (#resetScheduledLlmRequestTimer no-ops without a scheduledEvent), so the stream wedged permanently. The scheduled state now records scheduledOffset; snapshot() (called on every host re-handshake) and processEventBatch re-arm the lost timer, with a history-read fallback for pre-fix checkpoints; the handoff append retries instead of dropping the turn. All recovery is idempotent — llm-request-requested is keyed off the scheduled event's offset. Regression tests verified to fail on the unfixed code: stream-redial.workers.test.ts (abort the runner DO mid-subscription, assert post-abort appends still deliver and the subscriber checkpoint advances), rpc-lifecycle.test.ts, and four recovery tests in the agent implementation.test.ts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 714b21a. Configure here.}

cursor · 2026-06-10T18:39:25Z

+        debounceMs,
+        scheduledEvent: { offset: scheduledEvent.offset },
+      });
+    });


Async recovery ignores cancelled schedule

Medium Severity

The history fallback in #ensureScheduledLlmRequestTimer arms a debounce timer from a captured currentRequest after readStreamEvents finishes, without checking whether that schedule was cancelled meanwhile. #requestScheduledLlmWork then appends llm-request-requested from rebuilt history without confirming currentRequest is still scheduled for that requestId, so a cancelled turn can still trigger an LLM handoff.

Additional Locations (1)

apps/os/src/domains/agents/stream-processors/agent/implementation.ts#L455-L473

^{Reviewed by Cursor Bugbot for commit 714b21a. Configure here.}

jonastemplestein · 2026-06-10T18:58:14Z

folding into 1460

This merge commit carries three things (merged with a dirty tree): 1. Merge of origin/main (itx SSR/loader prefetch, StreamState truth-telling, toAfterOffset rename). Conflicts: kept the subscriber descriptors on the two inbound subscribe call sites while adopting main's renamed helper. 2. Tidy-up: shared/presence-events.ts is gone. The presence event types, StreamSubscriberDescriptor, and ProcessorContractAnnouncement live in the core processor contract (contract files are the schema/type layer); processors that reconcile on subscriber-connected now dep [CoreProcessorContract]. Same theme: the legacy built-in/external-url subscriber shapes and their parse-then-drop transform are deleted — subscription-configured only accepts callable subscribers now. 3. Port of PR #1471 (prd Slack outage root-cause fixes) onto this branch's structure: - rpc-lifecycle: onDeliveryError observes batch-delivery rejections and sync stub throws (dispose only after settle); the Object.hasOwn guard that kept onRpcBroken from wiring for capnweb stubs is gone. Its unit tests ported verbatim. - CoreStreamProcessor.openConnection drops the connection (reason delivery-failed, with the presence fact) and re-dials outbound subscribers on delivery failure. The end-to-end abort/re-dial workerd regression tests ported verbatim and pass against this structure. - Agent scheduler: scheduledOffset is optional (old checkpoints still parse) with a history-read fallback, and the llm-request-requested handoff append re-arms the timer and retries on failure instead of dropping the turn. 1471's snapshot()/processEventBatch re-arm hooks are deliberately NOT ported: in this branch's architecture every re-handshake delivers a subscriber-connected fact, which is the one recovery trigger. Their recovery scenarios are covered by three new connected-driven tests (history fallback, no-double-arm with a live timer, handoff retry). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jonastemplestein · 2026-06-10T19:11:09Z

Absorbed into #1460 (commit 76e501a), adapted to that branch's structure:

M1 kernel fix ported wholesale: rpc-lifecycle.ts (onDeliveryError + dropped Object.hasOwn guard) and its unit tests verbatim; the Stream-DO-side drop-and-redial now lives in CoreStreamProcessor.openConnection (connection management moved there on that branch) and additionally appends a subscriber-disconnected presence fact with reason delivery-failed. The end-to-end abort/re-dial workerd regression tests ported verbatim and pass — they also caught a missing hunk during the port, so they bite.
Agent scheduler: streams: subscriber presence facts + reconciler homogenization #1460 already had the durable scheduled-offset + recovery (triggered by the subscriber-connected presence fact every re-handshake appends, rather than snapshot()/processEventBatch hooks). Adopted from here: scheduledOffset being optional with the history-read fallback (so pre-fix prod checkpoints parse and recover), and the handoff append retry. Your recovery scenarios are covered by three new connected-driven tests.
tasks/streams-review-fixes.md M1 checkboxes carried over.

cursor Bot reviewed Jun 10, 2026

View reviewed changes

jonastemplestein closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction#1471

fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction#1471
jonastemplestein wants to merge 1 commit into
mainfrom
slender-amusement

jonastemplestein commented Jun 10, 2026 •

edited by iterate-bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonastemplestein commented Jun 10, 2026 • edited by iterate-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

1. Streams kernel: dead outbound connections never re-dialed (M1 from tasks/streams-review-fixes.md)

2. Agent processor: scheduled LLM turn dies with the DO and wedges the stream

Regression tests (all verified to fail on the unfixed code)

Environment Config Lease

OS

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Async recovery ignores cancelled schedule

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonastemplestein commented Jun 10, 2026 •

edited by iterate-bot

Loading

1. Streams kernel: dead outbound connections never re-dialed (M1 from `tasks/streams-review-fixes.md`)