agent: recover triggering inputs skipped by the side-effect anchor#1481
Conversation
On a freshly bootstrapped Slack thread stream, the slack-agent processor renders the webhook into a triggering agent/input-added before the agent processor's subscription is configured (the AGENT DO wake hook appends those subscription-configured events after D1 reads and workspace setup, and slack-agent reliably wins that race). The host anchors side effects at the subscription-configured offset, so the trigger is reduced as historical replay and its scheduling side effect never runs: no llm-request-scheduled, no llm-request-requested, no LLM turn — the agent silently never replies to the first message of every new thread. Regression from #1402; observed in prd on 2026-06-10 (/agents/slack/c08r1smtzgd/ts-1781124999-011519: input at offset 9, agent subscription configured at offset 15, stream ends with no request events). The fix makes the trigger a durable obligation in reduced state instead of a fire-and-forget side effect: - AgentState gains pendingTriggerOffset: set by a triggering input-added, cleared by llm-request-scheduled / llm-request-requested / llm-request-queued. If it survives in reduced state, the scheduling side effect never ran. - The subscriber-connected reconciliation (which always runs live — the presence fact lands above the anchor) recovers it: schedule a request when idle, or append the queued fact when a request is in flight. Appends are keyed off the trigger event exactly like the live path, so raced duplicates dedup in the stream. The recovery is gated on pendingTriggerOffset <= sideEffectsAfterOffset so it fires only for anchor-skipped triggers and never races the live input-added handler. - StreamProcessor's processEvent args now expose sideEffectsAfterOffset (the default batch fan-out already had it) so handlers can tell whether an earlier event's side effects were skipped as historical. Covered by unit tests replaying the prod stream shape, plus a token-gated e2e test that drives a plain routed Slack message through the full webhook → input → scheduled → requested → completed chain against a live deployment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Multiple subscriber-connected facts land in quick succession (one per co-hosted processor), so recovery appends can race each other — and a batch retry can re-run the live path's append. The stream dedups them under the shared idempotency key and returns the existing event, but the timer was armed with the local requestId; the handoff re-reads durable history and bails on a mismatch, wedging the turn until the next subscriber-connected. Adopt the committed payload's requestId so every racer converges on the durable schedule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
E2E verification record (run against the leased
Prd smoke test on the real 🤖 Generated with Claude Code |
|
Prd smoke test — passed ✅ Verified on the real
Note: a bot-authored root message (posted with the Doppler CI bot token) correctly does not trigger the agent — 🤖 Generated with Claude Code |
…s at the routing hop, pre-warmed hosts (#1494) ## The problem A Slack message in prd took **~14s to get the 👀 reaction** and ~20s to get a reply (example: `iterate` project, thread `ts-1781170058-112929`). Hop-by-hop, from the message's Slack `ts`: | Δ | what happened | |---|---| | +0.9s | Slack delivered the webhook — Slack was fast | | **+6.5s** | nothing of ours executed anywhere: cold instantiation of SlackIntegrationDO + the integration StreamDO (handler: 8.1s wall, **5ms CPU**). Slack's 3s retry queued behind the same gate and doubled the work | | +2.1s | integration DO init + subscription + append + routing | | **+3.0s** | cold instantiation of the new thread StreamDO | | **+1.4s** | cold dial of the SLACK_AGENT host DO → input rendered → eyes at ~14s | | +6s | LLM leg (openai-ws connect 1.1s, gpt-5.5 ~2s, itx exec) → reply at ~20s | Two multiplying causes: **the deployed script was 89.1 MB** (50 MB sourcemaps + browser-only modules uploaded as worker modules by alchemy's noBundle glob over `dist/server`; the live server graph is ~34 MB, the entrypoint 1.75 MB) — and every cold DO isolate loads all of it — times **3–4 distinct DOs chained serially** on the webhook path. The warm path was always fine (webhook 1–6ms, appends 20–100ms): this is cold-start tax, not stream-architecture tax. ## The fixes (no change to the streams/processors idea) 1. **`prune-server-bundle.ts`** (runs between build and asset preupload): deletes every `dist/server` module unreachable from the entrypoint via import/`new URL` literals (browser web workers + their wasm that the SSR build emits), plus all sourcemaps **except the entrypoint's own** (small; the one Cloudflare can symbolicate worker stack traces with — chunk maps are browser code and pure ballast inside a worker script). Validated against the extracted prd bundle: keeps exactly the 186-module live graph, deletes the 3 browser-only modules + chunk maps. 2. **Append-only webhook ack**: the handler no longer awaits `SlackIntegrationDO.initialize()` before responding — only the durable append gates the 200; initialize + catch-up moved to `waitUntil`. Order-independent (existing integrations have their subscription on the stream; new ones pick the webhook up via replay). Stops the >3s Slack retry storm. 3. **👀 at the routing hop**: the slack router reports routed webhooks to its host (`acknowledgeRoutedWebhook`) and SlackIntegrationDO adds the reaction immediately — one hop from ingress instead of three cold DO hops downstream — gated by the same payload-only rules the slack-agent applies (no bot messages, no reaction events, no bot-user actions). slack-agent still adds it on catch-up; `already_reacted` makes the pair idempotent. 4. **Pre-warmed hosts** (`prewarmRoutedStreamHosts`): for a newly routed thread, the SLACK_AGENT and AGENT host DOs `initialize()` concurrently with the bootstrap append instead of serially after each dial. Everything either side appends is idempotency-keyed and order-independent (the anchor-skip recovery from #1481 covers trigger ordering). ## Measured Dev-stage deploys of this branch (`os-dev-jonas`): - prd today: **89.1 MB** - this branch pre-#1486 baseline: **34.1 MB** - this branch on latest main (includes #1486's SSR-graph shrink, 186→178 live modules): **28.3 MB** — 3.1× smaller; app smoke-tested (sign-in 200) - prune log on the real prd bundle: `kept 186 modules, deleted 3 unreachable modules + 180 sourcemaps (55.0 MB)` Expected effect: each cold DO instantiation drops from multi-second to sub-second, and the eyes ack stops depending on the deepest part of the chain. Worth re-measuring the full message→eyes timing in prd after this deploys. ## Trade-offs / notes - Chunk-level deployed stack traces lose symbolication (entrypoint map kept). Symbolicate locally against the build output if needed. - The prune is conservative: anything referenced by a quoted relative specifier (`from`, `import()`, `export from`, `new URL`) stays. The unreachable set on the real bundle is exactly the browser-only web workers + wasm. - Follow-up idea (not this PR): split app-vs-platform workers so UI deploys stop evicting agent/stream DOs (the 2026-06-10 deploy-race incident), and consider per-DO-class workers for deploy isolation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes production Slack webhook timing, adds best-effort Slack API calls on the routing path, and alters deploy artifacts via bundle pruning; behavior is designed to be idempotent but affects a critical user-visible path. > > **Overview** > Cuts Slack cold-path latency by shrinking the deployed worker and parallelizing work on the webhook path. > > **Deploy:** Adds `prune-server-bundle` to the Alchemy build (after Vite, before asset preupload). It strips unreachable `dist/server` modules and most sourcemaps so each cold Durable Object isolate loads a much smaller script. > > **Webhook ingress:** The Slack webhook handler now returns `{ ok: true }` after the durable stream append only; `SlackIntegrationDO.initialize()` / `ensureReady()` run in `waitUntil`, avoiding >3s acks and Slack retries. > > **Routing hop:** `SlackProcessor` gains optional `acknowledgeRoutedWebhook` and `prewarmRoutedStreamHosts`. The integration DO adds the 👀 reaction at route time (via `eyesReactionTargetFromWebhookPayload` + `reactions.add`) and pre-initializes `SLACK_AGENT` and `AGENT` DOs in parallel with new-thread bootstrap. Downstream slack-agent behavior stays idempotent (`already_reacted`). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 8cf05b1. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- CLOUDFLARE_PREVIEW --> ## Environment Config Lease <!-- CLOUDFLARE_PREVIEW_STATE --> <!-- { "apps": { "os": { "appDisplayName": "OS", "appSlug": "os", "status": "deployed", "updatedAt": "2026-06-11T11:02:23.553Z", "headSha": "8cf05b16d08e47333866be25d49508ddcf145a9b", "message": null, "publicUrl": "https://os.iterate-preview-3.com", "runUrl": "https://github.com/iterate/iterate/actions/runs/27342005408", "shortSha": "8cf05b1" }, "semaphore": { "appDisplayName": "Semaphore", "appSlug": "semaphore", "status": "deployed", "updatedAt": "2026-06-11T10:59:58.014Z", "headSha": "8cf05b16d08e47333866be25d49508ddcf145a9b", "message": null, "publicUrl": "https://semaphore.iterate-preview-3.com", "runUrl": "https://github.com/iterate/iterate/actions/runs/27342005408", "shortSha": "8cf05b1" } }, "environmentConfigLease": { "dopplerConfig": "preview_3", "leasedUntil": 1781179095710, "leaseId": "699c7e52-ad4d-4c0a-a337-a6b9397144b7", "slug": "preview-3", "type": "environment-config-lease" } } --> <!-- /CLOUDFLARE_PREVIEW_STATE --> Lease: `preview-3` Doppler config: `preview_3` Type: `environment-config-lease` Leased until: 2026-06-11T11:58:15.710Z ### OS Status: deployed Commit: `8cf05b1` Preview: https://os.iterate-preview-3.com [Workflow run](https://github.com/iterate/iterate/actions/runs/27342005408) Updated: 2026-06-11T11:02:23.553Z ### Semaphore Status: deployed Commit: `8cf05b1` Preview: https://semaphore.iterate-preview-3.com [Workflow run](https://github.com/iterate/iterate/actions/runs/27342005408) Updated: 2026-06-11T10:59:58.014Z <!-- /CLOUDFLARE_PREVIEW --> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
What was broken
Slack agents in prd receive messages but never reply — no LLM request is ever made. Observed live on 2026-06-10 in
iterateproject stream/agents/slack/c08r1smtzgd/ts-1781124999-011519:agent/input-addedat offset 9 (20:56:46.2)subscription-configuredevent landed at offset 15 (20:56:46.7) — the AGENT DO wake hook appends it after D1 reads and workspace setup, so slack-agent reliably wins this race on a cold threadstream-processor-host.ts), so the input at offset 9 was reduced as historical replay and its scheduling side effect was skipped: nollm-request-scheduled, nollm-request-requested, noopenai-wsactivity, no reply — and nothing ever retriggers itThe anchor mechanism is correct for re-attach (don't re-fire historical LLM requests), but it shipped in #1402 without anything making the first message of a new thread durable. Regression from #1402, same symptom as #1372 but a different mechanism. Every first message of every new prod Slack thread is dropped.
The fix
Make the trigger a durable obligation in reduced state instead of a fire-and-forget side effect:
AgentState.pendingTriggerOffset— set by a triggeringinput-added, cleared byllm-request-scheduled/llm-request-requested/llm-request-queued. If it survives in reduced state, the scheduling side effect never ran.subscriber-connectedreconciliation recovers it (the presence fact always lands above the anchor, so this handler always runs live): schedule a request when idle, append the queued fact when a request is in flight (never interrupts in-flight work). Appends are keyed off the trigger event exactly like the live path (agent/llm-request-scheduled@<offset>), so raced duplicates dedup in the stream.pendingTriggerOffset <= sideEffectsAfterOffsetso recovery fires only for anchor-skipped triggers and never races the liveinput-addedhandler. Crash/restart cases above the anchor remain owned by the existing scheduled-phase reconciliation.StreamProcessor.processEventargs now exposesideEffectsAfterOffset(the batch-level hook already had it); the core processor's inline path passes 0 (inline appends are always live).The scheduled phase needs no queued fact on recovery: its handoff rebuilds the request body from full committed history, which already includes the skipped trigger.
Verification
llm-request-scheduled@9; non-triggering inputs don't recover; in-flight requests get a queued fact; live triggers aren't double-scheduled.schedules and completes an LLM request for a plain routed Slack message) drives a real Slack root message + routed webhook through webhook → input → scheduled → requested → completed(success) against a live deployment.pnpm typecheck && pnpm lint && pnpm format && pnpm testall green.🤖 Generated with Claude Code
Note
High Risk
Changes core agent LLM scheduling and subscriber-connected reconciliation on a production outage path; incorrect gating could double-schedule or miss triggers on every new Slack thread.
Overview
Fixes first-message silence on new Slack thread streams when a triggering
input-addedlands before the agent subscription is configured: the host’s side-effect anchor replays that input into state but skips scheduling, so no LLM turn ever starts.Agent processor now records
pendingTriggerOffsetin reduced state for triggering inputs and clears it when a durable schedule/request/queue fact exists. Onsubscriber-connected, when that offset is at or below the anchor, it recovers the missed obligation—llm-request-scheduledwhen idle (same idempotency key as the live path) orllm-request-queuedwhen a request is already in flight—without double-scheduling live triggers above the anchor.#appendLlmRequestScheduledarms the debounce timer with the committedrequestIdafter idempotent dedup so raced recovery paths don’t wedge the handoff.Streams:
processEventreceivessideEffectsAfterOffsetso reconcilers can detect anchor-skipped side effects; the core inline path passes0(always live).Verification: new unit coverage for anchor-skip recovery, deduped schedule, queue-when-busy, and no recovery for non-triggering inputs; token-gated e2e asserts routed Slack webhook → scheduled → requested → completed(success).
Reviewed by Cursor Bugbot for commit eb3a7ac. Bugbot is set up for automated code reviews on this repo. Configure here.
Environment Config Lease
No active environment config lease.
OS
Status: released
Commit:
eb3a7acPreview: https://os.iterate-preview-4.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T22:04:58.830Z