Skip to content

fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction#1471

Closed
jonastemplestein wants to merge 1 commit into
mainfrom
slender-amusement
Closed

fix(streams+agent): re-dial dead outbound subscribers, resurrect scheduled LLM turns after DO eviction#1471
jonastemplestein wants to merge 1 commit into
mainfrom
slender-amusement

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Context

Root-cause fixes from today's prd Slack agent outage (second incident of the day). A deploy landed at 16:02:09 UTC — three seconds after a Slack message in #...C08R1SMTZGD — evicted the agent host DO mid-run, and the agent never responded to that message or any later one. Diagnosis trail: thread stream /agents/slack/c08r1smtzgd/ts-1781107326-880549 shows full agent setup then nothing (no LLM request, no error), and observability shows zero completed SlackAgentDurableObject invocations account-wide after the deploy.

Two independent bugs compose into "agent permanently silent":

1. Streams kernel: dead outbound connections never re-dialed (M1 from tasks/streams-review-fixes.md)

Batch delivery into a subscriber stub was fire-and-forget with the result disposed immediately and rejections discarded. When a subscriber DO dies (deploy eviction, abort), the broken connection stays in #connections, the pump keeps advancing its cursor into the dead stub, and #reconcile skips keys that already have a connection — so the subscriber is never re-dialed and delivery stalls silently for the rest of the stream incarnation.

  • retainProcessEventBatch now takes onDeliveryError and observes the delivery result (rejections and synchronous stub throws), disposing only after settle (disposing a pending Cap'n Web promise cancels the call).
  • The Stream DO drops the connection on a failed delivery and re-dials (outbound); the subscriber re-handshakes from its durable checkpoint, and the existing generation gate drops stragglers.
  • Dropped the Object.hasOwn guard that prevented onRpcBroken from ever being wired for Cap'n Web proxy stubs (they expose no own descriptors). The wiring is defensive because property access on a native Workers RPC stub can fabricate a pipelined method that only fails at call time.

2. Agent processor: scheduled LLM turn dies with the DO and wedges the stream

agent/llm-request-scheduled is durable, but the debounce timer lived only in memory. After eviction+rehydration nothing re-armed it — and #resetScheduledLlmRequestTimer no-ops when there is no in-memory scheduledEvent, so every later default-policy input was silently swallowed too. Streams evicted between llm-request-scheduled and llm-request-requested were permanently wedged.

  • The scheduled currentRequest state now records scheduledOffset (optional, so old checkpoints still parse).
  • snapshot() — called by the host on every subscription (re-)handshake, even with no new events — and processEventBatch re-arm the lost timer; a history-read fallback covers checkpoints written before scheduledOffset existed, so already-wedged prod streams recover on their next re-handshake after this deploys.
  • The handoff append retries instead of dropping the turn. All recovery is idempotent: llm-request-requested is keyed off the scheduled event's offset.

Regression tests (all verified to fail on the unfixed code)

  • packages/streams/src/workers/durable-objects/stream-redial.workers.test.ts — end-to-end in workerd: Stream DO → Callable dial → StreamProcessorRunner (echo), then ctx.abort() the runner to simulate a deploy and assert the next append still delivers and the subscriber's durable checkpoint advances. (The Stream-side cursor advances even into a dead stub — that is the bug — so the assertions are on actual delivery and the subscriber checkpoint.)
  • packages/streams/src/workers/rpc-lifecycle.test.ts — proxy-shaped onRpcBroken wiring, pipelined-fake survival, delivery rejection reporting, settle-then-dispose.
  • apps/os/.../agent/implementation.test.ts — four recovery tests: re-arm via snapshot() rehydration, re-arm via batch replay, history fallback for old checkpoints, handoff append retry.

pnpm typecheck && pnpm lint && pnpm format && pnpm test green, plus the streams workers suite.

🤖 Generated with Claude Code


Note

High Risk
Changes core stream delivery and agent LLM scheduling paths that run on every deploy/eviction; mistakes could duplicate LLM requests or drop events, though idempotency keys and generation gates mitigate double-handoff.

Overview
Fixes a production outage where deploy eviction left Slack agents permanently silent by addressing two independent failure modes that composed together.

Streams (M1): Outbound batch delivery into a dead subscriber DO no longer leaves a zombie connection in #connections. retainProcessEventBatch now accepts onDeliveryError, observes rejections/sync throws, and disposes RPC results only after settle; onRpcBroken wiring no longer uses the broken Object.hasOwn guard. The Stream DO closes the connection and calls #reconcile() on delivery failure so subscribers are re-dialed and re-handshake from durable checkpoints.

Agent processor: Scheduled LLM turns persist scheduledOffset on currentRequest so a rehydrated DO can rebuild the in-memory debounce timer. Recovery runs on snapshot() (subscription re-handshake) and after processEventBatch replay; old checkpoints without scheduledOffset fall back to stream history. Failed llm-request-requested appends re-arm the timer instead of wedging the stream.

Regression coverage: stream-redial.workers.test.ts, rpc-lifecycle.test.ts, and agent implementation.test.ts scheduled-recovery suite.

Reviewed by Cursor Bugbot for commit 714b21a. Bugbot is set up for automated code reviews on this repo. Configure here.

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: 714b21a
Preview: https://os.iterate-preview-4.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T19:11:51.700Z

…duled LLM turns after DO eviction

Two root-cause fixes from the 2026-06-10 prd Slack agent outage, where a
deploy landed 3s after a Slack message, evicted the agent host DO mid-run,
and the agent never responded — to that message or any later one.

M1 (streams kernel): batch delivery into a dead subscriber stub was
fire-and-forget with rejections discarded, so after a subscriber DO died
the broken connection stayed in #connections and #reconcile never
re-dialed it — delivery stalled for the rest of the stream incarnation.
retainProcessEventBatch now surfaces delivery rejections (and sync stub
throws) via onDeliveryError; the Stream DO drops the connection and
re-dials, and the subscriber re-handshakes from its durable checkpoint.
Also drop the Object.hasOwn guard that kept onRpcBroken from ever being
wired for capnweb proxy stubs (defensively, since property access on a
native RPC stub can fabricate a pipelined fake).

Agent processor: agent/llm-request-scheduled is durable but the debounce
timer lived only in memory. A rehydrated instance never re-armed it, and
default-policy inputs only reset the nonexistent timer
(#resetScheduledLlmRequestTimer no-ops without a scheduledEvent), so the
stream wedged permanently. The scheduled state now records
scheduledOffset; snapshot() (called on every host re-handshake) and
processEventBatch re-arm the lost timer, with a history-read fallback for
pre-fix checkpoints; the handoff append retries instead of dropping the
turn. All recovery is idempotent — llm-request-requested is keyed off the
scheduled event's offset.

Regression tests verified to fail on the unfixed code:
stream-redial.workers.test.ts (abort the runner DO mid-subscription,
assert post-abort appends still deliver and the subscriber checkpoint
advances), rpc-lifecycle.test.ts, and four recovery tests in the agent
implementation.test.ts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 714b21a. Configure here.

debounceMs,
scheduledEvent: { offset: scheduledEvent.offset },
});
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async recovery ignores cancelled schedule

Medium Severity

The history fallback in #ensureScheduledLlmRequestTimer arms a debounce timer from a captured currentRequest after readStreamEvents finishes, without checking whether that schedule was cancelled meanwhile. #requestScheduledLlmWork then appends llm-request-requested from rebuilt history without confirming currentRequest is still scheduled for that requestId, so a cancelled turn can still trigger an LLM handoff.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 714b21a. Configure here.

@jonastemplestein

Copy link
Copy Markdown
Contributor Author

folding into 1460

jonastemplestein added a commit that referenced this pull request Jun 10, 2026
This merge commit carries three things (merged with a dirty tree):

1. Merge of origin/main (itx SSR/loader prefetch, StreamState truth-telling,
   toAfterOffset rename). Conflicts: kept the subscriber descriptors on the
   two inbound subscribe call sites while adopting main's renamed helper.

2. Tidy-up: shared/presence-events.ts is gone. The presence event types,
   StreamSubscriberDescriptor, and ProcessorContractAnnouncement live in the
   core processor contract (contract files are the schema/type layer);
   processors that reconcile on subscriber-connected now dep
   [CoreProcessorContract]. Same theme: the legacy built-in/external-url
   subscriber shapes and their parse-then-drop transform are deleted —
   subscription-configured only accepts callable subscribers now.

3. Port of PR #1471 (prd Slack outage root-cause fixes) onto this branch's
   structure:
   - rpc-lifecycle: onDeliveryError observes batch-delivery rejections and
     sync stub throws (dispose only after settle); the Object.hasOwn guard
     that kept onRpcBroken from wiring for capnweb stubs is gone. Its unit
     tests ported verbatim.
   - CoreStreamProcessor.openConnection drops the connection (reason
     delivery-failed, with the presence fact) and re-dials outbound
     subscribers on delivery failure. The end-to-end abort/re-dial workerd
     regression tests ported verbatim and pass against this structure.
   - Agent scheduler: scheduledOffset is optional (old checkpoints still
     parse) with a history-read fallback, and the llm-request-requested
     handoff append re-arms the timer and retries on failure instead of
     dropping the turn. 1471's snapshot()/processEventBatch re-arm hooks are
     deliberately NOT ported: in this branch's architecture every
     re-handshake delivers a subscriber-connected fact, which is the one
     recovery trigger. Their recovery scenarios are covered by three new
     connected-driven tests (history fallback, no-double-arm with a live
     timer, handoff retry).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jonastemplestein

Copy link
Copy Markdown
Contributor Author

Absorbed into #1460 (commit 76e501a), adapted to that branch's structure:

  • M1 kernel fix ported wholesale: rpc-lifecycle.ts (onDeliveryError + dropped Object.hasOwn guard) and its unit tests verbatim; the Stream-DO-side drop-and-redial now lives in CoreStreamProcessor.openConnection (connection management moved there on that branch) and additionally appends a subscriber-disconnected presence fact with reason delivery-failed. The end-to-end abort/re-dial workerd regression tests ported verbatim and pass — they also caught a missing hunk during the port, so they bite.
  • Agent scheduler: streams: subscriber presence facts + reconciler homogenization #1460 already had the durable scheduled-offset + recovery (triggered by the subscriber-connected presence fact every re-handshake appends, rather than snapshot()/processEventBatch hooks). Adopted from here: scheduledOffset being optional with the history-read fallback (so pre-fix prod checkpoints parse and recover), and the handoff append retry. Your recovery scenarios are covered by three new connected-driven tests.
  • tasks/streams-review-fixes.md M1 checkboxes carried over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant