Skip to content

Fix Slack agents never responding (regression from #1370)#1372

Merged
jonastemplestein merged 1 commit into
mainfrom
caterwauling-periodical
Jun 5, 2026
Merged

Fix Slack agents never responding (regression from #1370)#1372
jonastemplestein merged 1 commit into
mainfrom
caterwauling-periodical

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

What was broken

Slack agents stopped responding. A user @mention becomes an agent/input-added event on the routed agent stream, but nothing consumes it — the LLM processors (agent-chat / agent / the provider processor) were never registered on Slack-routed streams. Observed live on templestein2 stream /agents/slack/c09trdv61v4/ts-1780670924-517029: slack-agent ran (produced the input), but no agent/LLM processor existed, so no reply.

Root cause — regression from #1370 (streams runtime cutover)

  • Old runtime: routedStreamBootstrapEvents subscribed a callable to AGENT DO afterAppend. Invoking it woke AgentDurableObjectonInstanceWake registered agent-chat/agent/LLM/agent-host + seeded setup events.
  • After [codex] cut over OS streams runtime #1370: that was replaced with a built-in agent-host processor whose afterAppend only runs ensureChildAgentRunner (+ codemode handlers). It never wakes the agent for its own stream, so the LLM processors are never registered.
  • What actually starts a processor is the subscription-configured event (Stream#reconcileOutboundConnections dials a runner per subscription key). Those were never appended for the LLM processors on routed streams.
  • Compounding: ensureChildAgentRunner compared against the legacy events.iterate.com/core/child-stream-created, but the new runtime emits events.iterate.com/stream/child-stream-created.

Dashboard-created agents were unaffected because new.tsx explicitly subscribes the full processor set. PR #1371 would not have fixed this — it only adds stream-processor-registered marker events (which don't start processors) and only to the UI flow.

The fix

  • Wake the agent for routed streams (ensureAgentRunnerForOwnStream): when agent-host runs on a routed agent stream, on the stream/created event it initializes that stream's AgentDurableObjectonInstanceWake registers the LLM processors and setup events; the resulting subscription-configured events are what reconcileOutboundConnections dials. Verified the runner replays from offset 0 (replayAfterOffset: snapshot?.offset ?? 0), so agent-host reliably sees stream/created (always offset 1).
  • Dedupe the agent-host runner: align the bootstrap agent-host subscription key with the canonical AgentDurableObject key (runner DOs are keyed by ${namespace}:${path}:${subscriptionKey}), so the two declarations resolve to one runner.
  • Use the new-runtime events.iterate.com/stream/ prefix at the OS call-sites that compare against new-runtime core events: the broken child-stream-created check (local constants), the project agents-root jsonata matcher, and the stream-composer UI examples.

Verification

  • Full-repo pnpm typecheck (18 projects), oxlint, oxfmt
  • Affected OS unit tests pass
  • Not yet deployed / round-tripped on a preview — recommend a preview deploy + one test @mention before prod (Preview / e2e is otherwise skipped on PRs).

Scope note

An earlier revision also did a broad events.iterate.com/core//stream/ rename across the shared package and the separate events.iterate.com app. That broke the events runtime e2e (append → 500) and is unrelated to this bug, so it was reverted out of this PR. If we still want the events platform on the /stream/ prefix it needs its own investigation — tracked as a follow-up.

Follow-ups (found while auditing #1370)

  • ProjectDurableObject.afterAppend is orphaned: the old runtime forwarded lifecycle events to the project config-worker afterAppend hook; the new built-in project-lifecycle processor has no afterAppend and nothing calls it.
  • CodemodeSession.afterAppend is orphaned (likely benign — resolves locally via appendAndConsume).
  • agents.e2e.test.ts has vacuous /core/error-occurred assertions; worth real Slack-path e2e coverage so this can't regress silently.

🤖 Generated with Claude Code

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: 2ba51d8
Preview: https://os.iterate-preview-2.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-05T19:28:38.611Z


Note

Medium Risk
Touches agent stream bootstrap and durable-object initialization on every routed agent stream; wrong event matching or wake ordering could affect non-Slack agents, but changes are narrowly scoped to host processor and Slack routing.

Overview
Restores Slack-routed agent streams after the streams runtime cutover (#1370): routed streams only got slack-agent + agent-host, so agent/input-added events were never consumed because LLM processors were never registered.

ensureAgentRunnerForOwnStream initializes the stream’s AgentDurableObject on events.iterate.com/stream/created (via agent-host afterAppend, using keepAlive to avoid deadlocking catch-up). That runs onInstanceWake, which appends the LLM processor subscriptions and setup events.

Slack bootstrap now uses agentProcessorSubscriptionConfiguredEvent so the routed agent-host subscription key matches AgentDurableObject, deduping to a single runner.

Event type alignment for the new runtime: child-stream-created and related core lifecycle types use events.iterate.com/stream/… instead of legacy …/core/… in agent host logic, the agents-root jsonata matcher, and stream composer presets.

Reviewed by Cursor Bugbot for commit 2ba51d8. Bugbot is set up for automated code reviews on this repo. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 24d0074. Configure here.

Comment thread apps/os/src/domains/agents/durable-objects/agent-durable-object.ts
@jonastemplestein jonastemplestein force-pushed the caterwauling-periodical branch from 24d0074 to 31f5867 Compare June 5, 2026 19:10
@jonastemplestein jonastemplestein changed the title Fix Slack agents never responding (regression from #1370) + /stream/ event prefix everywhere Fix Slack agents never responding (regression from #1370) Jun 5, 2026
Slack-routed agent streams never registered the LLM processors
(agent-chat/agent/provider), so a user message landed as an
agent/input-added event that nothing consumed — the agent never replied.

Regression from #1370 (streams runtime cutover): the routed bootstrap used
to subscribe a callable to AgentDurableObject.afterAppend (which woke the
agent and registered its processors via onInstanceWake); the cutover
replaced it with a built-in agent-host processor that never wakes the agent
for its own stream. What actually starts a processor is the
subscription-configured event, which was never appended for the LLM
processors on routed streams.

Fix:
- agent-host now wakes the AgentDurableObject for its own stream on
  stream/created (ensureAgentRunnerForOwnStream); onInstanceWake registers
  agent-chat/agent/LLM + setup events, whose subscription-configured events
  Stream#reconcileOutboundConnections then dials into runners. Verified the
  runner replays from offset 0, so agent-host reliably sees stream/created
  (always offset 1).
- Align the bootstrap agent-host subscription key with the canonical
  AgentDurableObject key so the two declarations dedupe to a single runner.
- Use the new-runtime event prefix (events.iterate.com/stream/) in the OS
  call-sites that compare against new-runtime core events: the
  child-stream-created check (was the legacy /core/ prefix, which the new
  runtime never emits), the project agents-root jsonata matcher, and the
  stream-composer UI examples.

Scope note: a broader /core/ -> /stream/ rename across the shared package and
the separate events.iterate.com app was reverted — it broke the events
runtime e2e (append 500s) and is unrelated to this bug. Tracked as a
follow-up.

Verified: full-repo typecheck, oxlint, oxfmt, affected OS unit tests pass.
Not yet deployed/round-tripped on a preview.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jonastemplestein jonastemplestein force-pushed the caterwauling-periodical branch from 31f5867 to 2ba51d8 Compare June 5, 2026 19:18
@jonastemplestein jonastemplestein merged commit 88d26e2 into main Jun 5, 2026
10 checks passed
@jonastemplestein jonastemplestein deleted the caterwauling-periodical branch June 5, 2026 19:26
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
…1481)

## What was broken

Slack agents in prd receive messages but never reply — no LLM request is
ever made. Observed live on 2026-06-10 in `iterate` project stream
`/agents/slack/c08r1smtzgd/ts-1781124999-011519`:

- the slack-agent processor rendered the webhook into a triggering
`agent/input-added` at **offset 9** (20:56:46.2)
- the agent processor's `subscription-configured` event landed at
**offset 15** (20:56:46.7) — the AGENT DO wake hook appends it after D1
reads and workspace setup, so slack-agent reliably wins this race on a
cold thread
- the host anchors side effects at the subscription-configured offset
(`stream-processor-host.ts`), so the input at offset 9 was reduced as
historical replay and its scheduling side effect was skipped: no
`llm-request-scheduled`, no `llm-request-requested`, no `openai-ws`
activity, no reply — and nothing ever retriggers it
- visible fingerprint in the stream: capability-noted renders exist only
for offsets above the anchor (18–23), none for 8–9

The anchor mechanism is correct for re-attach (don't re-fire historical
LLM requests), but it shipped in #1402 without anything making the
*first* message of a new thread durable. Regression from #1402, same
symptom as #1372 but a different mechanism. **Every first message of
every new prod Slack thread is dropped.**

## The fix

Make the trigger a durable obligation in reduced state instead of a
fire-and-forget side effect:

- **`AgentState.pendingTriggerOffset`** — set by a triggering
`input-added`, cleared by `llm-request-scheduled` /
`llm-request-requested` / `llm-request-queued`. If it survives in
reduced state, the scheduling side effect never ran.
- **`subscriber-connected` reconciliation recovers it** (the presence
fact always lands above the anchor, so this handler always runs live):
schedule a request when idle, append the queued fact when a request is
in flight (never interrupts in-flight work). Appends are keyed off the
trigger event exactly like the live path
(`agent/llm-request-scheduled@<offset>`), so raced duplicates dedup in
the stream.
- **Gated on `pendingTriggerOffset <= sideEffectsAfterOffset`** so
recovery fires only for anchor-skipped triggers and never races the live
`input-added` handler. Crash/restart cases above the anchor remain owned
by the existing scheduled-phase reconciliation.
- `StreamProcessor.processEvent` args now expose
`sideEffectsAfterOffset` (the batch-level hook already had it); the core
processor's inline path passes 0 (inline appends are always live).

The scheduled phase needs no queued fact on recovery: its handoff
rebuilds the request body from full committed history, which already
includes the skipped trigger.

## Verification

- Unit tests replay the prod stream shape: trigger below anchor +
subscriber-connected above → exactly one `llm-request-scheduled@9`;
non-triggering inputs don't recover; in-flight requests get a queued
fact; live triggers aren't double-scheduled.
- New token-gated e2e (`schedules and completes an LLM request for a
plain routed Slack message`) drives a real Slack root message + routed
webhook through webhook → input → scheduled → requested →
completed(success) against a live deployment.
- `pnpm typecheck && pnpm lint && pnpm format && pnpm test` all green.
- E2E run against the preview deployment with the real Slack bot token:
results to follow in a comment.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **High Risk**
> Changes core agent LLM scheduling and subscriber-connected
reconciliation on a production outage path; incorrect gating could
double-schedule or miss triggers on every new Slack thread.
> 
> **Overview**
> Fixes **first-message silence** on new Slack thread streams when a
triggering `input-added` lands **before** the agent subscription is
configured: the host’s side-effect anchor replays that input into state
but skips scheduling, so no LLM turn ever starts.
> 
> **Agent processor** now records **`pendingTriggerOffset`** in reduced
state for triggering inputs and clears it when a durable
schedule/request/queue fact exists. On **`subscriber-connected`**, when
that offset is at or below the anchor, it **recovers** the missed
obligation—`llm-request-scheduled` when idle (same idempotency key as
the live path) or **`llm-request-queued`** when a request is already in
flight—without double-scheduling live triggers above the anchor.
**`#appendLlmRequestScheduled`** arms the debounce timer with the
**committed** `requestId` after idempotent dedup so raced recovery paths
don’t wedge the handoff.
> 
> **Streams**: `processEvent` receives **`sideEffectsAfterOffset`** so
reconcilers can detect anchor-skipped side effects; the core inline path
passes **`0`** (always live).
> 
> **Verification**: new unit coverage for anchor-skip recovery, deduped
schedule, queue-when-busy, and no recovery for non-triggering inputs;
token-gated e2e asserts routed Slack webhook → scheduled → requested →
completed(success).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
eb3a7ac. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-10T21:58:59.268Z",
      "headSha": "eb3a7ac0bb17f468c1d5490f0b6951bfe612374e",
      "message": null,
      "publicUrl": "https://os.iterate-preview-4.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27308869802",
      "shortSha": "eb3a7ac"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_4",
    "leasedUntil": 1781132168766,
    "leaseId": "29fbdda0-4a62-44f1-8b9f-ebe4adac552c",
    "slug": "preview-4",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-4`
Doppler config: `preview_4`
Type: `environment-config-lease`
Leased until: 2026-06-10T22:56:08.766Z

### OS
Status: deployed
Commit: `eb3a7ac`
Preview: https://os.iterate-preview-4.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27308869802)
Updated: 2026-06-10T21:58:59.268Z
<!-- /CLOUDFLARE_PREVIEW -->

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant