agent: recover triggering inputs skipped by the side-effect anchor by jonastemplestein · Pull Request #1481 · iterate/iterate

jonastemplestein · 2026-06-10T21:50:55Z

What was broken

Slack agents in prd receive messages but never reply — no LLM request is ever made. Observed live on 2026-06-10 in iterate project stream /agents/slack/c08r1smtzgd/ts-1781124999-011519:

the slack-agent processor rendered the webhook into a triggering agent/input-added at offset 9 (20:56:46.2)
the agent processor's subscription-configured event landed at offset 15 (20:56:46.7) — the AGENT DO wake hook appends it after D1 reads and workspace setup, so slack-agent reliably wins this race on a cold thread
the host anchors side effects at the subscription-configured offset (stream-processor-host.ts), so the input at offset 9 was reduced as historical replay and its scheduling side effect was skipped: no llm-request-scheduled, no llm-request-requested, no openai-ws activity, no reply — and nothing ever retriggers it
visible fingerprint in the stream: capability-noted renders exist only for offsets above the anchor (18–23), none for 8–9

The anchor mechanism is correct for re-attach (don't re-fire historical LLM requests), but it shipped in #1402 without anything making the first message of a new thread durable. Regression from #1402, same symptom as #1372 but a different mechanism. Every first message of every new prod Slack thread is dropped.

The fix

Make the trigger a durable obligation in reduced state instead of a fire-and-forget side effect:

AgentState.pendingTriggerOffset — set by a triggering input-added, cleared by llm-request-scheduled / llm-request-requested / llm-request-queued. If it survives in reduced state, the scheduling side effect never ran.
subscriber-connected reconciliation recovers it (the presence fact always lands above the anchor, so this handler always runs live): schedule a request when idle, append the queued fact when a request is in flight (never interrupts in-flight work). Appends are keyed off the trigger event exactly like the live path (agent/llm-request-scheduled@<offset>), so raced duplicates dedup in the stream.
Gated on pendingTriggerOffset <= sideEffectsAfterOffset so recovery fires only for anchor-skipped triggers and never races the live input-added handler. Crash/restart cases above the anchor remain owned by the existing scheduled-phase reconciliation.
StreamProcessor.processEvent args now expose sideEffectsAfterOffset (the batch-level hook already had it); the core processor's inline path passes 0 (inline appends are always live).

The scheduled phase needs no queued fact on recovery: its handoff rebuilds the request body from full committed history, which already includes the skipped trigger.

Verification

Unit tests replay the prod stream shape: trigger below anchor + subscriber-connected above → exactly one llm-request-scheduled@9; non-triggering inputs don't recover; in-flight requests get a queued fact; live triggers aren't double-scheduled.
New token-gated e2e (schedules and completes an LLM request for a plain routed Slack message) drives a real Slack root message + routed webhook through webhook → input → scheduled → requested → completed(success) against a live deployment.
pnpm typecheck && pnpm lint && pnpm format && pnpm test all green.
E2E run against the preview deployment with the real Slack bot token: results to follow in a comment.

🤖 Generated with Claude Code

Note

High Risk
Changes core agent LLM scheduling and subscriber-connected reconciliation on a production outage path; incorrect gating could double-schedule or miss triggers on every new Slack thread.

Overview
Fixes first-message silence on new Slack thread streams when a triggering input-added lands before the agent subscription is configured: the host’s side-effect anchor replays that input into state but skips scheduling, so no LLM turn ever starts.

Agent processor now records pendingTriggerOffset in reduced state for triggering inputs and clears it when a durable schedule/request/queue fact exists. On subscriber-connected, when that offset is at or below the anchor, it recovers the missed obligation—llm-request-scheduled when idle (same idempotency key as the live path) or llm-request-queued when a request is already in flight—without double-scheduling live triggers above the anchor. #appendLlmRequestScheduled arms the debounce timer with the committed requestId after idempotent dedup so raced recovery paths don’t wedge the handoff.

Streams: processEvent receives sideEffectsAfterOffset so reconcilers can detect anchor-skipped side effects; the core inline path passes 0 (always live).

Verification: new unit coverage for anchor-skip recovery, deduped schedule, queue-when-busy, and no recovery for non-triggering inputs; token-gated e2e asserts routed Slack webhook → scheduled → requested → completed(success).

^{Reviewed by Cursor Bugbot for commit eb3a7ac. Bugbot is set up for automated code reviews on this repo. Configure here.}

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: eb3a7ac
Preview: https://os.iterate-preview-4.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T22:04:58.830Z

On a freshly bootstrapped Slack thread stream, the slack-agent processor renders the webhook into a triggering agent/input-added before the agent processor's subscription is configured (the AGENT DO wake hook appends those subscription-configured events after D1 reads and workspace setup, and slack-agent reliably wins that race). The host anchors side effects at the subscription-configured offset, so the trigger is reduced as historical replay and its scheduling side effect never runs: no llm-request-scheduled, no llm-request-requested, no LLM turn — the agent silently never replies to the first message of every new thread. Regression from #1402; observed in prd on 2026-06-10 (/agents/slack/c08r1smtzgd/ts-1781124999-011519: input at offset 9, agent subscription configured at offset 15, stream ends with no request events). The fix makes the trigger a durable obligation in reduced state instead of a fire-and-forget side effect: - AgentState gains pendingTriggerOffset: set by a triggering input-added, cleared by llm-request-scheduled / llm-request-requested / llm-request-queued. If it survives in reduced state, the scheduling side effect never ran. - The subscriber-connected reconciliation (which always runs live — the presence fact lands above the anchor) recovers it: schedule a request when idle, or append the queued fact when a request is in flight. Appends are keyed off the trigger event exactly like the live path, so raced duplicates dedup in the stream. The recovery is gated on pendingTriggerOffset <= sideEffectsAfterOffset so it fires only for anchor-skipped triggers and never races the live input-added handler. - StreamProcessor's processEvent args now expose sideEffectsAfterOffset (the default batch fan-out already had it) so handlers can tell whether an earlier event's side effects were skipped as historical. Covered by unit tests replaying the prod stream shape, plus a token-gated e2e test that drives a plain routed Slack message through the full webhook → input → scheduled → requested → completed chain against a live deployment. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Multiple subscriber-connected facts land in quick succession (one per co-hosted processor), so recovery appends can race each other — and a batch retry can re-run the live path's append. The stream dedups them under the shared idempotency key and returns the existing event, but the timer was armed with the local requestId; the handoff re-reads durable history and bails on a mismatch, wedging the turn until the next subscriber-connected. Adopt the committed payload's requestId so every racer converges on the durable schedule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jonastemplestein · 2026-06-10T22:10:32Z

E2E verification record (run against the leased preview-4 deployment of eb3a7ac, with the real Slack bot token from Doppler preview_4):

✅ schedules and completes an LLM request for a plain routed Slack message — the new regression e2e — passed in 16.7s: real root message posted to #slack-agent-e2e-test, routed webhook, agent/input-added → llm-request-scheduled → llm-request-requested → llm-request-completed(success), no stream/error-occurred.
✅ routes Slack webhooks into slack-agent streams and executes bang command replies
✅ lets a real agent conversation post to Slack through codemode
Full agents.e2e.test.ts run: 8/10 passed. The 2 failures (recovers and still replies when the agent host DO is killed mid-turn timeout; project config worker customizes fresh agents git-push ok:false) coincided exactly with the post-merge Preview / cleanup destroying the preview worker mid-suite (preview-4 now 522s; isolated retries fail at projects.create with MALFORMED_ORPC_ERROR_RESPONSE). Not code regressions — the kill-recovery branch is condition-for-condition identical and covered by unit tests.

Prd smoke test on the real iterate project to follow after the main deploy.

🤖 Generated with Claude Code

jonastemplestein · 2026-06-10T22:17:04Z

Prd smoke test — passed ✅

Verified on the real iterate project (prj_d871bac9722d45aba4e3dbb50057900d) after the main deploy, in the exact thread from the original incident (#test-blank, root ts 1781124999.011519):

Injected a human-shaped slack/webhook-received (thread reply mentioning the bot) into /integrations/slack; the real prd pipeline routed it to the existing thread stream.
The chain that was silently missing during the outage now fires end to end: agent/input-added@42 → agent/llm-request-scheduled@42 → llm-request-requested@47 → openai-ws/llm-request-started → agent/llm-request-completed (success) → itx/execution-completed { ok: true, durationMs: 7077 }.
The agent posted a real reply in the real Slack thread with the project's bot token: "Acknowledged — smoke test reply received in this thread." (1781129770.658629), confirmed via conversations.replies using the Doppler prd Slack token.

Note: a bot-authored root message (posted with the Doppler CI bot token) correctly does not trigger the agent — slack-agent skips bot messages at input render. The thread streams still bootstrap with both subscription sets in either race order (verified in the fresh stream ts-1781129521-178669: agent subscriptions at offsets 13–15, input at 26).

🤖 Generated with Claude Code

…s at the routing hop, pre-warmed hosts (#1494) ## The problem A Slack message in prd took **~14s to get the 👀 reaction** and ~20s to get a reply (example: `iterate` project, thread `ts-1781170058-112929`). Hop-by-hop, from the message's Slack `ts`: | Δ | what happened | |---|---| | +0.9s | Slack delivered the webhook — Slack was fast | | **+6.5s** | nothing of ours executed anywhere: cold instantiation of SlackIntegrationDO + the integration StreamDO (handler: 8.1s wall, **5ms CPU**). Slack's 3s retry queued behind the same gate and doubled the work | | +2.1s | integration DO init + subscription + append + routing | | **+3.0s** | cold instantiation of the new thread StreamDO | | **+1.4s** | cold dial of the SLACK_AGENT host DO → input rendered → eyes at ~14s | | +6s | LLM leg (openai-ws connect 1.1s, gpt-5.5 ~2s, itx exec) → reply at ~20s | Two multiplying causes: **the deployed script was 89.1 MB** (50 MB sourcemaps + browser-only modules uploaded as worker modules by alchemy's noBundle glob over `dist/server`; the live server graph is ~34 MB, the entrypoint 1.75 MB) — and every cold DO isolate loads all of it — times **3–4 distinct DOs chained serially** on the webhook path. The warm path was always fine (webhook 1–6ms, appends 20–100ms): this is cold-start tax, not stream-architecture tax. ## The fixes (no change to the streams/processors idea) 1. **`prune-server-bundle.ts`** (runs between build and asset preupload): deletes every `dist/server` module unreachable from the entrypoint via import/`new URL` literals (browser web workers + their wasm that the SSR build emits), plus all sourcemaps **except the entrypoint's own** (small; the one Cloudflare can symbolicate worker stack traces with — chunk maps are browser code and pure ballast inside a worker script). Validated against the extracted prd bundle: keeps exactly the 186-module live graph, deletes the 3 browser-only modules + chunk maps. 2. **Append-only webhook ack**: the handler no longer awaits `SlackIntegrationDO.initialize()` before responding — only the durable append gates the 200; initialize + catch-up moved to `waitUntil`. Order-independent (existing integrations have their subscription on the stream; new ones pick the webhook up via replay). Stops the >3s Slack retry storm. 3. **👀 at the routing hop**: the slack router reports routed webhooks to its host (`acknowledgeRoutedWebhook`) and SlackIntegrationDO adds the reaction immediately — one hop from ingress instead of three cold DO hops downstream — gated by the same payload-only rules the slack-agent applies (no bot messages, no reaction events, no bot-user actions). slack-agent still adds it on catch-up; `already_reacted` makes the pair idempotent. 4. **Pre-warmed hosts** (`prewarmRoutedStreamHosts`): for a newly routed thread, the SLACK_AGENT and AGENT host DOs `initialize()` concurrently with the bootstrap append instead of serially after each dial. Everything either side appends is idempotency-keyed and order-independent (the anchor-skip recovery from #1481 covers trigger ordering). ## Measured Dev-stage deploys of this branch (`os-dev-jonas`): - prd today: **89.1 MB** - this branch pre-#1486 baseline: **34.1 MB** - this branch on latest main (includes #1486's SSR-graph shrink, 186→178 live modules): **28.3 MB** — 3.1× smaller; app smoke-tested (sign-in 200) - prune log on the real prd bundle: `kept 186 modules, deleted 3 unreachable modules + 180 sourcemaps (55.0 MB)` Expected effect: each cold DO instantiation drops from multi-second to sub-second, and the eyes ack stops depending on the deepest part of the chain. Worth re-measuring the full message→eyes timing in prd after this deploys. ## Trade-offs / notes - Chunk-level deployed stack traces lose symbolication (entrypoint map kept). Symbolicate locally against the build output if needed. - The prune is conservative: anything referenced by a quoted relative specifier (`from`, `import()`, `export from`, `new URL`) stays. The unreachable set on the real bundle is exactly the browser-only web workers + wasm. - Follow-up idea (not this PR): split app-vs-platform workers so UI deploys stop evicting agent/stream DOs (the 2026-06-10 deploy-race incident), and consider per-DO-class workers for deploy isolation. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- > [!NOTE] > **Medium Risk** > Changes production Slack webhook timing, adds best-effort Slack API calls on the routing path, and alters deploy artifacts via bundle pruning; behavior is designed to be idempotent but affects a critical user-visible path. > > **Overview** > Cuts Slack cold-path latency by shrinking the deployed worker and parallelizing work on the webhook path. > > **Deploy:** Adds `prune-server-bundle` to the Alchemy build (after Vite, before asset preupload). It strips unreachable `dist/server` modules and most sourcemaps so each cold Durable Object isolate loads a much smaller script. > > **Webhook ingress:** The Slack webhook handler now returns `{ ok: true }` after the durable stream append only; `SlackIntegrationDO.initialize()` / `ensureReady()` run in `waitUntil`, avoiding >3s acks and Slack retries. > > **Routing hop:** `SlackProcessor` gains optional `acknowledgeRoutedWebhook` and `prewarmRoutedStreamHosts`. The integration DO adds the 👀 reaction at route time (via `eyesReactionTargetFromWebhookPayload` + `reactions.add`) and pre-initializes `SLACK_AGENT` and `AGENT` DOs in parallel with new-thread bootstrap. Downstream slack-agent behavior stays idempotent (`already_reacted`). > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 8cf05b1. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>   ## Environment Config Lease    Lease: `preview-3` Doppler config: `preview_3` Type: `environment-config-lease` Leased until: 2026-06-11T11:58:15.710Z ### OS Status: deployed Commit: `8cf05b1` Preview: https://os.iterate-preview-3.com [Workflow run](https://github.com/iterate/iterate/actions/runs/27342005408) Updated: 2026-06-11T11:02:23.553Z ### Semaphore Status: deployed Commit: `8cf05b1` Preview: https://semaphore.iterate-preview-3.com [Workflow run](https://github.com/iterate/iterate/actions/runs/27342005408) Updated: 2026-06-11T10:59:58.014Z  --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

jonastemplestein and others added 2 commits June 10, 2026 22:50

jonastemplestein merged commit 9ad6f0d into main Jun 10, 2026
9 checks passed

jonastemplestein deleted the ahead-nautilus branch June 10, 2026 22:03

jonastemplestein mentioned this pull request Jun 11, 2026

Slack latency: 3× smaller worker script, append-only webhook ack, eyes at the routing hop, pre-warmed hosts #1494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: recover triggering inputs skipped by the side-effect anchor#1481

agent: recover triggering inputs skipped by the side-effect anchor#1481
jonastemplestein merged 2 commits into
mainfrom
ahead-nautilus

jonastemplestein commented Jun 10, 2026 •

edited by iterate-bot

Loading

Uh oh!

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonastemplestein commented Jun 10, 2026 • edited by iterate-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was broken

The fix

Verification

Environment Config Lease

OS

Uh oh!

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonastemplestein commented Jun 10, 2026 •

edited by iterate-bot

Loading