Skip to content

agent: recover triggering inputs skipped by the side-effect anchor#1481

Merged
jonastemplestein merged 2 commits into
mainfrom
ahead-nautilus
Jun 10, 2026
Merged

agent: recover triggering inputs skipped by the side-effect anchor#1481
jonastemplestein merged 2 commits into
mainfrom
ahead-nautilus

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What was broken

Slack agents in prd receive messages but never reply — no LLM request is ever made. Observed live on 2026-06-10 in iterate project stream /agents/slack/c08r1smtzgd/ts-1781124999-011519:

  • the slack-agent processor rendered the webhook into a triggering agent/input-added at offset 9 (20:56:46.2)
  • the agent processor's subscription-configured event landed at offset 15 (20:56:46.7) — the AGENT DO wake hook appends it after D1 reads and workspace setup, so slack-agent reliably wins this race on a cold thread
  • the host anchors side effects at the subscription-configured offset (stream-processor-host.ts), so the input at offset 9 was reduced as historical replay and its scheduling side effect was skipped: no llm-request-scheduled, no llm-request-requested, no openai-ws activity, no reply — and nothing ever retriggers it
  • visible fingerprint in the stream: capability-noted renders exist only for offsets above the anchor (18–23), none for 8–9

The anchor mechanism is correct for re-attach (don't re-fire historical LLM requests), but it shipped in #1402 without anything making the first message of a new thread durable. Regression from #1402, same symptom as #1372 but a different mechanism. Every first message of every new prod Slack thread is dropped.

The fix

Make the trigger a durable obligation in reduced state instead of a fire-and-forget side effect:

  • AgentState.pendingTriggerOffset — set by a triggering input-added, cleared by llm-request-scheduled / llm-request-requested / llm-request-queued. If it survives in reduced state, the scheduling side effect never ran.
  • subscriber-connected reconciliation recovers it (the presence fact always lands above the anchor, so this handler always runs live): schedule a request when idle, append the queued fact when a request is in flight (never interrupts in-flight work). Appends are keyed off the trigger event exactly like the live path (agent/llm-request-scheduled@<offset>), so raced duplicates dedup in the stream.
  • Gated on pendingTriggerOffset <= sideEffectsAfterOffset so recovery fires only for anchor-skipped triggers and never races the live input-added handler. Crash/restart cases above the anchor remain owned by the existing scheduled-phase reconciliation.
  • StreamProcessor.processEvent args now expose sideEffectsAfterOffset (the batch-level hook already had it); the core processor's inline path passes 0 (inline appends are always live).

The scheduled phase needs no queued fact on recovery: its handoff rebuilds the request body from full committed history, which already includes the skipped trigger.

Verification

  • Unit tests replay the prod stream shape: trigger below anchor + subscriber-connected above → exactly one llm-request-scheduled@9; non-triggering inputs don't recover; in-flight requests get a queued fact; live triggers aren't double-scheduled.
  • New token-gated e2e (schedules and completes an LLM request for a plain routed Slack message) drives a real Slack root message + routed webhook through webhook → input → scheduled → requested → completed(success) against a live deployment.
  • pnpm typecheck && pnpm lint && pnpm format && pnpm test all green.
  • E2E run against the preview deployment with the real Slack bot token: results to follow in a comment.

🤖 Generated with Claude Code


Note

High Risk
Changes core agent LLM scheduling and subscriber-connected reconciliation on a production outage path; incorrect gating could double-schedule or miss triggers on every new Slack thread.

Overview
Fixes first-message silence on new Slack thread streams when a triggering input-added lands before the agent subscription is configured: the host’s side-effect anchor replays that input into state but skips scheduling, so no LLM turn ever starts.

Agent processor now records pendingTriggerOffset in reduced state for triggering inputs and clears it when a durable schedule/request/queue fact exists. On subscriber-connected, when that offset is at or below the anchor, it recovers the missed obligation—llm-request-scheduled when idle (same idempotency key as the live path) or llm-request-queued when a request is already in flight—without double-scheduling live triggers above the anchor. #appendLlmRequestScheduled arms the debounce timer with the committed requestId after idempotent dedup so raced recovery paths don’t wedge the handoff.

Streams: processEvent receives sideEffectsAfterOffset so reconcilers can detect anchor-skipped side effects; the core inline path passes 0 (always live).

Verification: new unit coverage for anchor-skip recovery, deduped schedule, queue-when-busy, and no recovery for non-triggering inputs; token-gated e2e asserts routed Slack webhook → scheduled → requested → completed(success).

Reviewed by Cursor Bugbot for commit eb3a7ac. Bugbot is set up for automated code reviews on this repo. Configure here.

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: eb3a7ac
Preview: https://os.iterate-preview-4.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T22:04:58.830Z

jonastemplestein and others added 2 commits June 10, 2026 22:50
On a freshly bootstrapped Slack thread stream, the slack-agent processor
renders the webhook into a triggering agent/input-added before the agent
processor's subscription is configured (the AGENT DO wake hook appends
those subscription-configured events after D1 reads and workspace setup,
and slack-agent reliably wins that race). The host anchors side effects
at the subscription-configured offset, so the trigger is reduced as
historical replay and its scheduling side effect never runs: no
llm-request-scheduled, no llm-request-requested, no LLM turn — the agent
silently never replies to the first message of every new thread.
Regression from #1402; observed in prd on 2026-06-10
(/agents/slack/c08r1smtzgd/ts-1781124999-011519: input at offset 9,
agent subscription configured at offset 15, stream ends with no request
events).

The fix makes the trigger a durable obligation in reduced state instead
of a fire-and-forget side effect:

- AgentState gains pendingTriggerOffset: set by a triggering
  input-added, cleared by llm-request-scheduled / llm-request-requested
  / llm-request-queued. If it survives in reduced state, the scheduling
  side effect never ran.
- The subscriber-connected reconciliation (which always runs live — the
  presence fact lands above the anchor) recovers it: schedule a request
  when idle, or append the queued fact when a request is in flight.
  Appends are keyed off the trigger event exactly like the live path,
  so raced duplicates dedup in the stream. The recovery is gated on
  pendingTriggerOffset <= sideEffectsAfterOffset so it fires only for
  anchor-skipped triggers and never races the live input-added handler.
- StreamProcessor's processEvent args now expose sideEffectsAfterOffset
  (the default batch fan-out already had it) so handlers can tell
  whether an earlier event's side effects were skipped as historical.

Covered by unit tests replaying the prod stream shape, plus a
token-gated e2e test that drives a plain routed Slack message through
the full webhook → input → scheduled → requested → completed chain
against a live deployment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Multiple subscriber-connected facts land in quick succession (one per
co-hosted processor), so recovery appends can race each other — and a
batch retry can re-run the live path's append. The stream dedups them
under the shared idempotency key and returns the existing event, but the
timer was armed with the local requestId; the handoff re-reads durable
history and bails on a mismatch, wedging the turn until the next
subscriber-connected. Adopt the committed payload's requestId so every
racer converges on the durable schedule.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jonastemplestein jonastemplestein merged commit 9ad6f0d into main Jun 10, 2026
9 checks passed
@jonastemplestein jonastemplestein deleted the ahead-nautilus branch June 10, 2026 22:03
@jonastemplestein

Copy link
Copy Markdown
Contributor Author

E2E verification record (run against the leased preview-4 deployment of eb3a7ac, with the real Slack bot token from Doppler preview_4):

  • schedules and completes an LLM request for a plain routed Slack message — the new regression e2e — passed in 16.7s: real root message posted to #slack-agent-e2e-test, routed webhook, agent/input-addedllm-request-scheduledllm-request-requestedllm-request-completed(success), no stream/error-occurred.
  • routes Slack webhooks into slack-agent streams and executes bang command replies
  • lets a real agent conversation post to Slack through codemode
  • Full agents.e2e.test.ts run: 8/10 passed. The 2 failures (recovers and still replies when the agent host DO is killed mid-turn timeout; project config worker customizes fresh agents git-push ok:false) coincided exactly with the post-merge Preview / cleanup destroying the preview worker mid-suite (preview-4 now 522s; isolated retries fail at projects.create with MALFORMED_ORPC_ERROR_RESPONSE). Not code regressions — the kill-recovery branch is condition-for-condition identical and covered by unit tests.

Prd smoke test on the real iterate project to follow after the main deploy.

🤖 Generated with Claude Code

@jonastemplestein

Copy link
Copy Markdown
Contributor Author

Prd smoke test — passed

Verified on the real iterate project (prj_d871bac9722d45aba4e3dbb50057900d) after the main deploy, in the exact thread from the original incident (#test-blank, root ts 1781124999.011519):

  • Injected a human-shaped slack/webhook-received (thread reply mentioning the bot) into /integrations/slack; the real prd pipeline routed it to the existing thread stream.
  • The chain that was silently missing during the outage now fires end to end: agent/input-added@42agent/llm-request-scheduled@42llm-request-requested@47openai-ws/llm-request-startedagent/llm-request-completed (success)itx/execution-completed { ok: true, durationMs: 7077 }.
  • The agent posted a real reply in the real Slack thread with the project's bot token: "Acknowledged — smoke test reply received in this thread." (1781129770.658629), confirmed via conversations.replies using the Doppler prd Slack token.

Note: a bot-authored root message (posted with the Doppler CI bot token) correctly does not trigger the agent — slack-agent skips bot messages at input render. The thread streams still bootstrap with both subscription sets in either race order (verified in the fresh stream ts-1781129521-178669: agent subscriptions at offsets 13–15, input at 26).

🤖 Generated with Claude Code

jonastemplestein added a commit that referenced this pull request Jun 11, 2026
…s at the routing hop, pre-warmed hosts (#1494)

## The problem

A Slack message in prd took **~14s to get the 👀 reaction** and ~20s to
get a reply (example: `iterate` project, thread `ts-1781170058-112929`).
Hop-by-hop, from the message's Slack `ts`:

| Δ | what happened |
|---|---|
| +0.9s | Slack delivered the webhook — Slack was fast |
| **+6.5s** | nothing of ours executed anywhere: cold instantiation of
SlackIntegrationDO + the integration StreamDO (handler: 8.1s wall, **5ms
CPU**). Slack's 3s retry queued behind the same gate and doubled the
work |
| +2.1s | integration DO init + subscription + append + routing |
| **+3.0s** | cold instantiation of the new thread StreamDO |
| **+1.4s** | cold dial of the SLACK_AGENT host DO → input rendered →
eyes at ~14s |
| +6s | LLM leg (openai-ws connect 1.1s, gpt-5.5 ~2s, itx exec) → reply
at ~20s |

Two multiplying causes: **the deployed script was 89.1 MB** (50 MB
sourcemaps + browser-only modules uploaded as worker modules by
alchemy's noBundle glob over `dist/server`; the live server graph is ~34
MB, the entrypoint 1.75 MB) — and every cold DO isolate loads all of it
— times **3–4 distinct DOs chained serially** on the webhook path. The
warm path was always fine (webhook 1–6ms, appends 20–100ms): this is
cold-start tax, not stream-architecture tax.

## The fixes (no change to the streams/processors idea)

1. **`prune-server-bundle.ts`** (runs between build and asset
preupload): deletes every `dist/server` module unreachable from the
entrypoint via import/`new URL` literals (browser web workers + their
wasm that the SSR build emits), plus all sourcemaps **except the
entrypoint's own** (small; the one Cloudflare can symbolicate worker
stack traces with — chunk maps are browser code and pure ballast inside
a worker script). Validated against the extracted prd bundle: keeps
exactly the 186-module live graph, deletes the 3 browser-only modules +
chunk maps.
2. **Append-only webhook ack**: the handler no longer awaits
`SlackIntegrationDO.initialize()` before responding — only the durable
append gates the 200; initialize + catch-up moved to `waitUntil`.
Order-independent (existing integrations have their subscription on the
stream; new ones pick the webhook up via replay). Stops the >3s Slack
retry storm.
3. **👀 at the routing hop**: the slack router reports routed webhooks to
its host (`acknowledgeRoutedWebhook`) and SlackIntegrationDO adds the
reaction immediately — one hop from ingress instead of three cold DO
hops downstream — gated by the same payload-only rules the slack-agent
applies (no bot messages, no reaction events, no bot-user actions).
slack-agent still adds it on catch-up; `already_reacted` makes the pair
idempotent.
4. **Pre-warmed hosts** (`prewarmRoutedStreamHosts`): for a newly routed
thread, the SLACK_AGENT and AGENT host DOs `initialize()` concurrently
with the bootstrap append instead of serially after each dial.
Everything either side appends is idempotency-keyed and
order-independent (the anchor-skip recovery from #1481 covers trigger
ordering).

## Measured

Dev-stage deploys of this branch (`os-dev-jonas`):

- prd today: **89.1 MB**
- this branch pre-#1486 baseline: **34.1 MB**
- this branch on latest main (includes #1486's SSR-graph shrink, 186→178
live modules): **28.3 MB** — 3.1× smaller; app smoke-tested (sign-in
200)
- prune log on the real prd bundle: `kept 186 modules, deleted 3
unreachable modules + 180 sourcemaps (55.0 MB)`

Expected effect: each cold DO instantiation drops from multi-second to
sub-second, and the eyes ack stops depending on the deepest part of the
chain. Worth re-measuring the full message→eyes timing in prd after this
deploys.

## Trade-offs / notes

- Chunk-level deployed stack traces lose symbolication (entrypoint map
kept). Symbolicate locally against the build output if needed.
- The prune is conservative: anything referenced by a quoted relative
specifier (`from`, `import()`, `export from`, `new URL`) stays. The
unreachable set on the real bundle is exactly the browser-only web
workers + wasm.
- Follow-up idea (not this PR): split app-vs-platform workers so UI
deploys stop evicting agent/stream DOs (the 2026-06-10 deploy-race
incident), and consider per-DO-class workers for deploy isolation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes production Slack webhook timing, adds best-effort Slack API
calls on the routing path, and alters deploy artifacts via bundle
pruning; behavior is designed to be idempotent but affects a critical
user-visible path.
> 
> **Overview**
> Cuts Slack cold-path latency by shrinking the deployed worker and
parallelizing work on the webhook path.
> 
> **Deploy:** Adds `prune-server-bundle` to the Alchemy build (after
Vite, before asset preupload). It strips unreachable `dist/server`
modules and most sourcemaps so each cold Durable Object isolate loads a
much smaller script.
> 
> **Webhook ingress:** The Slack webhook handler now returns `{ ok: true
}` after the durable stream append only;
`SlackIntegrationDO.initialize()` / `ensureReady()` run in `waitUntil`,
avoiding >3s acks and Slack retries.
> 
> **Routing hop:** `SlackProcessor` gains optional
`acknowledgeRoutedWebhook` and `prewarmRoutedStreamHosts`. The
integration DO adds the 👀 reaction at route time (via
`eyesReactionTargetFromWebhookPayload` + `reactions.add`) and
pre-initializes `SLACK_AGENT` and `AGENT` DOs in parallel with
new-thread bootstrap. Downstream slack-agent behavior stays idempotent
(`already_reacted`).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
8cf05b1. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-11T11:02:23.553Z",
      "headSha": "8cf05b16d08e47333866be25d49508ddcf145a9b",
      "message": null,
      "publicUrl": "https://os.iterate-preview-3.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27342005408",
      "shortSha": "8cf05b1"
    },
    "semaphore": {
      "appDisplayName": "Semaphore",
      "appSlug": "semaphore",
      "status": "deployed",
      "updatedAt": "2026-06-11T10:59:58.014Z",
      "headSha": "8cf05b16d08e47333866be25d49508ddcf145a9b",
      "message": null,
      "publicUrl": "https://semaphore.iterate-preview-3.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27342005408",
      "shortSha": "8cf05b1"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_3",
    "leasedUntil": 1781179095710,
    "leaseId": "699c7e52-ad4d-4c0a-a337-a6b9397144b7",
    "slug": "preview-3",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-3`
Doppler config: `preview_3`
Type: `environment-config-lease`
Leased until: 2026-06-11T11:58:15.710Z

### OS
Status: deployed
Commit: `8cf05b1`
Preview: https://os.iterate-preview-3.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27342005408)
Updated: 2026-06-11T11:02:23.553Z

### Semaphore
Status: deployed
Commit: `8cf05b1`
Preview: https://semaphore.iterate-preview-3.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27342005408)
Updated: 2026-06-11T10:59:58.014Z
<!-- /CLOUDFLARE_PREVIEW -->

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant