Skip to content

streams: subscriber presence facts + reconciler homogenization#1460

Merged
jonastemplestein merged 10 commits into
mainfrom
dedicated-barometer
Jun 10, 2026
Merged

streams: subscriber presence facts + reconciler homogenization#1460
jonastemplestein merged 10 commits into
mainfrom
dedicated-barometer

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

Implements the plan of record in tasks/streams-core-processor-host-homogenization.md (one combined PR, as planned). The unifying idea: every processor's processEvent is a reconciler — it reconciles non-serializable runtime state (in-flight LLM calls, debounce timers, delivery connections) against reduced state, by appending events or mutating runtime state. This PR lands the missing trigger for that reconciliation and fixes the two verified agent-wedge bugs from the audit (tasks/agents-system-audit-and-reconciler-design.md).

Presence facts

  • New events stream/subscriber-connected / stream/subscriber-disconnected, appended by the stream itself in openConnection/close — exactly once per actual open/close, hence no idempotency keys. One per connection (per subscriptionKey), covering processor subscriptions and inbound (browser / orpc-bridge) subscribe() alike.
  • Key safety property: a connected event is appended after the replay cursor is fixed, so it is always the tail of any batch it shares — state-at-event == batch-final state, making per-event reconciliation checks safe.
  • Subscribers pass an optional descriptor (incarnationId, label, processor contract announcement). stream/processor-registered is gone; contract announcements ride the connect event and feed processorsBySlug.

Core homogenization

  • The live connections map + outbound dialing move from the Stream DO into CoreStreamProcessor as instance state + processEvent side effects ("configured but not connected → dial", triggered by the woken and subscription-configured facts). Boot reconciliation is no longer special-cased in the constructor — the woken fact drives it.
  • Core reduced state gains a live presence roster (connectionsByKey): woken clears it (all connections died with the previous incarnation; survivors re-dial and re-land), connected adds, disconnected removes. No heartbeats, no staleness.

Bug fixes (both had red tests first)

  • Crash mid-LLM-request (audit §2.2): provider recovery now triggers on subscriber-connected instead of incidentally on the next consumed batch, and appends an explicit llm-request-attempt-failed {reason: "host-restarted"} before re-executing. Both processEventBatch overrides deleted. Previously the conversation stalled until the next user message.
  • Debounce wedge (audit §2.1): the scheduled phase now carries scheduledAtOffset; a fresh incarnation converts a timer-less scheduled request into llm-request-requested on connect, with the same idempotency key the dead timer would have used (the two paths converge). The reset-timer path re-arms from durable state instead of silently bailing. Previously a crash in the ~1s debounce window wedged the agent until an interrupt-policy input.

Crash injection

  • AgentDurableObject.kill() + project.agents.kill oRPC procedure.
  • New e2e test kills the agent host twice in one conversation — inside the debounce window and mid-LLM-request — and asserts the agent still replies.

Breaking

Existing stream data is incompatible (core state schema + event vocabulary changed). Streams must be reset on deploy. Decided explicitly in planning: no backwards compatibility, prod DB is disposable.

Testing

  • pnpm typecheck, pnpm lint, pnpm format, pnpm test (root) all green.
  • New unit tests: provider recovery with zero new domain events (+ negative: ordinary batches don't trigger recovery), scheduler recovery on connect (+ cancelled-in-history negative, + re-arm-after-restart), roster reduction (connect/disconnect/woken/reconnect), contract announcements from connect events.
  • Workerd e2e (packages/streams/example-app, STREAM_STAGING_E2E=true against local dev): all green after updating expectations to the new wire behavior (presence events visible in delivery).
  • NOT yet run: apps/os e2e against a deployed environment (including the new crash test) — needs a preview slot; draft until then.

🤖 Generated with Claude Code


Note

High Risk
Breaking stream core state (v4) requires reset on deploy; changes agent turn scheduling, LLM retry semantics, and stream delivery/re-dial behavior on production Durable Objects.

Overview
Replaces stream/processor-registered with durable stream/subscriber-connected / stream/subscriber-disconnected presence facts on every subscribe/open, carrying optional subscriber identity and processor contract announcements into core reduced state (connectionsByKey, reshaped processorsBySlug). Core stream processor now owns live delivery connections and outbound dial reconciliation (moved out of the Stream DO); woken clears the roster and drives reconnect instead of constructor special-casing. Core state version 4 — existing streams must be reset on deploy.

Agent crash recovery: scheduled LLM work gains scheduledOffset and reconciles on subscriber-connected (same idempotency key as the debounce timer); handoff append failures re-arm instead of wedging. Cloudflare AI / OpenAI WS reconcile dangling in-flight requests on connect only (not every batch), append llm-request-attempt-failed before retry, and claim IDs to avoid duplicate executions when several connect facts land in one batch.

Delivery reliability: outbound batch delivery rejections are observed via retainProcessEventBatch onDeliveryError; onRpcBroken wiring fixed for Cap'n Web proxy stubs; dead outbound connections drop and re-dial (new workerd regression test).

Ops / testing: project.agents.kill / AgentDurableObject.kill() for aborting the host DO; new OS e2e kills mid-debounce and mid-LLM-request. Subscribe callers pass subscriber descriptors (browser, orpc-bridge, streams-capability).

Reviewed by Cursor Bugbot for commit 095dd2b. Bugbot is set up for automated code reviews on this repo. Configure here.

Environment Config Lease

No active environment config lease.

OS

Status: released
Commit: 095dd2b
Preview: https://os.iterate-preview-2.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T19:41:37.977Z

Semaphore

Status: released
Commit: 76e501a
Preview: https://semaphore.iterate-preview-2.com
Summary: Preview app released.
Workflow run
Updated: 2026-06-10T19:41:27.082Z

jonastemplestein and others added 2 commits June 10, 2026 16:02
Every processor's processEvent is now understood as a reconciler: it
reconciles non-serializable runtime state (in-flight LLM calls, debounce
timers, delivery connections) against reduced state, by appending events
or mutating runtime state. This lands the missing trigger for that
reconciliation and fixes two verified agent-wedge bugs.

- New presence facts: stream/subscriber-connected and
  stream/subscriber-disconnected, appended by the stream itself in
  openConnection/close — exactly once per actual open/close, so no
  idempotency keys. One per connection (per subscriptionKey), for
  processor subscriptions and inbound (browser/bridge) subscribe() alike.
  A connected event is always the tail of any batch it shares (appended
  after the replay cursor is fixed), so state-at-event == batch-final
  state and per-event reconciliation checks are safe.

- Core processor homogenized: the live connections map and outbound
  dialing move from the Stream DO into CoreStreamProcessor as instance
  state + processEvent side effects ("configured but not connected →
  dial", triggered by the woken and subscription-configured facts). The
  Stream DO shrinks to log + pump + thin shell; its constructor no longer
  special-cases boot reconciliation — the woken fact drives it.

- Presence roster in core reduced state (connectionsByKey): who is
  attached to this stream right now. woken clears it (all connections
  died with the previous incarnation; survivors re-dial), connected adds,
  disconnected removes. No heartbeats, no staleness.

- stream/processor-registered is gone: processor contract announcements
  ride the connect event's subscriber descriptor and feed processorsBySlug.

- Provider crash recovery (cloudflare-ai + openai-ws) now triggers on
  subscriber-connected instead of incidentally on the next consumed
  batch, and records an explicit llm-request-attempt-failed
  {reason: host-restarted} before re-executing. Both processEventBatch
  overrides deleted. Previously a host crash mid-LLM-request stalled the
  conversation until the next user message.

- Scheduler debounce wedge fixed: the scheduled phase now carries
  scheduledAtOffset, so a fresh incarnation converts a timer-less
  scheduled request into llm-request-requested on connect (same
  idempotency key as the dead timer would have used), and the
  reset-timer path re-arms from durable state instead of silently
  bailing. Previously a crash in the debounce window wedged the agent
  until an interrupt-policy input arrived.

- AgentDurableObject.kill() + project.agents.kill oRPC procedure for
  crash injection; e2e test kills the host both in the debounce window
  and mid-LLM-request and asserts the agent still replies.

BREAKING: existing stream data is incompatible (core state schema and
event vocabulary changed); streams must be reset. Decided explicitly —
no backwards compatibility, prod DB is disposable.

Plan of record: tasks/streams-core-processor-host-homogenization.md
Audit findings: tasks/agents-system-audit-and-reconciler-design.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jonastemplestein jonastemplestein marked this pull request as ready for review June 10, 2026 15:18
Comment thread apps/os/src/domains/agents/stream-processors/cloudflare-ai/implementation.ts Outdated
jonastemplestein and others added 2 commits June 10, 2026 16:40
Conflict resolutions:
- stream-processor-host: keep main's openSubscription/generation-gate
  structure (Stage 1 fixes); the subscriber-connected presence descriptor
  moves into openSubscription, and the processor-registered append stays
  deleted (announcements ride the connect fact). Recovery re-subscriptions
  pass the same host incarnationId — each (re)open genuinely is a new
  connection.
- agent contract/implementation: main's codemode rip replaced
  codemode/tool-provider-registered with agent/capability-noted; combined
  with this branch's StreamPresenceEvents dep, subscriber-connected
  consumption, scheduledAtOffset, and the scheduler reconciliation case.
- stream-review-regressions fake stream now appends a subscriber-connected
  fact on subscribeOutbound like the real Stream DO (the old test relied on
  the deleted processor-registered append for the initial checkpoint write).
- AnyHostedProcessor.contract.events widened structurally (payloadSchema
  member) so main's new test processors satisfy the announcement type.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Addresses the Cursor Bugbot finding on #1460 (parallel recovery duplicate
LLM runs): one batch routinely carries several subscriber-connected events
(an agent host re-handshake appends one per co-hosted processor
subscription), and their blocking reconciles run concurrently under
Promise.all. Each could observe the same dangling started request before
any claimed it, starting duplicate in-flight LLM executions.

The dangling ids are now claimed into #executedLlmRequestIds synchronously,
before the first await; claims still held when a pass fails are released so
a later connected event can retry. Regression tests ingest a two-connect
batch and assert exactly one execution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread packages/streams/src/workers/durable-objects/stream.ts
jonastemplestein and others added 2 commits June 10, 2026 18:53
Conflict resolutions: codemode contract deleted (accepted main's deletion;
the agent contract already dropped that dep in the previous merge), and the
crash-recovery e2e test updated to main's itx script style
(async (itx) => itx.chat.sendMessage).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Addresses the Cursor Bugbot finding on #1460: the core reduced-state shape
changed (connectionsByKey roster added, processorsBySlug reshaped around
connect-event announcements) but the version gate still said 3, so an old
KV snapshot would be schema-parsed on boot and throw instead of rebuilding
from the event log. Replaying old logs through the new reducer is safe:
the removed processor-registered events fall through the wildcard default
and only advance the offset counters.

Adds a workerd regression test that plants a v3-shaped snapshot and
asserts boot rebuilds from SQL without throwing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4174f8a. Configure here.

Comment thread apps/os/src/domains/agents/stream-processors/agent/contract.ts
jonastemplestein and others added 3 commits June 10, 2026 19:03
The subscriber-connected fact's push frame can arrive during the subscribe
RPC round trip (before the test starts capturing frames) on a real network,
but after it on localhost — the CI staging worker's latency flipped the
order and failed the exact two-frame enumeration. The test's actual claim
is unchanged (no subscriber-originated return traffic; only push/release
frames inbound): the published event's frame is asserted exactly, the
connected fact's frame loosely and only when captured.

Verified against the deployed PR staging worker.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…er specs

Every browser stream page now appends a subscriber-connected fact when its
runtime subscribes (and a disconnected fact when the connection tears down),
so the viewer's event counts shift: fresh streams baseline at 3 events
(created + woken + connected), reloads add a disconnected + connected pair,
and a kill adds woken + connected on the rebooted incarnation. Updated the
11 count-asserting specs accordingly; all 26 pass locally against a dev
worker.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This merge commit carries three things (merged with a dirty tree):

1. Merge of origin/main (itx SSR/loader prefetch, StreamState truth-telling,
   toAfterOffset rename). Conflicts: kept the subscriber descriptors on the
   two inbound subscribe call sites while adopting main's renamed helper.

2. Tidy-up: shared/presence-events.ts is gone. The presence event types,
   StreamSubscriberDescriptor, and ProcessorContractAnnouncement live in the
   core processor contract (contract files are the schema/type layer);
   processors that reconcile on subscriber-connected now dep
   [CoreProcessorContract]. Same theme: the legacy built-in/external-url
   subscriber shapes and their parse-then-drop transform are deleted —
   subscription-configured only accepts callable subscribers now.

3. Port of PR #1471 (prd Slack outage root-cause fixes) onto this branch's
   structure:
   - rpc-lifecycle: onDeliveryError observes batch-delivery rejections and
     sync stub throws (dispose only after settle); the Object.hasOwn guard
     that kept onRpcBroken from wiring for capnweb stubs is gone. Its unit
     tests ported verbatim.
   - CoreStreamProcessor.openConnection drops the connection (reason
     delivery-failed, with the presence fact) and re-dials outbound
     subscribers on delivery failure. The end-to-end abort/re-dial workerd
     regression tests ported verbatim and pass against this structure.
   - Agent scheduler: scheduledOffset is optional (old checkpoints still
     parse) with a history-read fallback, and the llm-request-requested
     handoff append re-arms the timer and retries on failure instead of
     dropping the turn. 1471's snapshot()/processEventBatch re-arm hooks are
     deliberately NOT ported: in this branch's architecture every
     re-handshake delivers a subscriber-connected fact, which is the one
     recovery trigger. Their recovery scenarios are covered by three new
     connected-driven tests (history fallback, no-double-arm with a live
     timer, handoff retry).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The ported M1 fix observed every delivery's result to detect dead stubs —
but pulling a Cap'n Web promise makes the remote send a resolve frame per
batch, so every browser tab started producing per-batch return traffic
(caught by the no-return-traffic wire e2e), and the staging run drowned in
internal errors.

Split the concern by direction, matching where each liveness signal is
reliable: outbound connections (native Workers RPC into subscriber host
DOs, where onRpcBroken can be a pipelined fabrication) pass onDeliveryError
and pay the pull; inbound connections (capnweb via the fronting worker,
where onRpcBroken on the terminated socket is dependable) keep the
zero-return-traffic fast path — retainProcessEventBatch only attaches to
the result when a handler is provided, and disposes synchronously
otherwise. Unit test pins the fast path; the abort/re-dial regressions
still pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jonastemplestein jonastemplestein merged commit 0a54e41 into main Jun 10, 2026
9 checks passed
@jonastemplestein jonastemplestein deleted the dedicated-barometer branch June 10, 2026 19:40
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
The presence-roster work (#1460) moved the delivery pump into
CoreStreamProcessor.openConnection; re-applied the D20 protocol there:
batches carry state read alongside streamMaxOffset, every subscription
gets an initial push, events: false is state-only mode. subscribe()
keeps both new options (events + subscriber) end to end, including the
capnweb RpcTarget override.

Tests reconciled with presence facts (every subscribe/unsubscribe now
appends a fact, so exact batch counts/empty-events assertions raced):
assert on delivered content and per-batch invariants instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
…ook (useItx) (#1472)

## What

Two halves of one decision: stream subscriptions become the ONE reactive
primitive, and the browser layer collapses onto it — one hook, no query
cache, no SSR, no reconnect machinery. Net **−951 lines** against main,
with the deleted test files making the suite smaller than it was.

### Half 1 — the protocol: subscriptions carry reduced state (DECISIONS
D20)

- **Every batch carries `state`.** The pump (now in
`CoreStreamProcessor.openConnection` after the presence-roster merge)
attaches the core reduced state to every delivery, read in the same
synchronous block as `streamMaxOffset` so the two always correspond. The
OS capability projects it through the exact mapping `getState()` uses
(`coreStateToStreamState` is the single shared projection), so
subscribe-state === getState-state by construction.
- **`events: false` = state-only mode.** Same batches with `events: []`
(empty array, not omitted — shape stability), one delivery per state
advance with missed appends coalesced. Replay without events is
meaningless, so state-only subscriptions are implicitly live-from-now.
- **The initial push.** EVERY subscription immediately receives one
batch (current state + `streamMaxOffset`, folded into the replay batch
when there is one), so a subscriber paints its first render from the
subscription alone — no separate getState call.
- **Kernel surface.** `ItxStream.subscribe(onBatch, { afterOffset,
events? })` plus `ItxStream.onStateChange(cb)` sugar (state-only
subscribe with explicit callback-stub retention).

### Half 2 — the browser layer: ONE hook (DECISIONS D21, supersedes D19
and one-socket-per-tab)

The TanStack-Query-for-itx + SSR + shared-socket architecture was too
complicated. `apps/os/src/itx/use-itx.ts` replaces all of it:

- **`useItx(context?)`** suspends (React 19 `use()` on a per-context
module-singleton promise) until a WebSocket to `/api/itx[/<ctx>]` is
connected, and returns the live `RpcStub<Itx>`.
`getBrowserItx(context?)` is the same singleton for non-hook code.
That's the whole API.
- **No reconnect machinery.** Socket death evicts the map entry;
subscribers re-render via useSyncExternalStore, dial fresh, re-suspend —
the initial push repaints them with current state. No backoff, no offset
resume, no liveness probes. Multiple sockets per tab are fine.
- **No SSR for itx components — the hook THROWS on the server.** A
forever-pending `use()` during streaming SSR would hold the response
stream open (React waits for suspended boundaries); a throw inside a
Suspense boundary streams the fallback and recovers client-side, and
outside one it fails loudly instead of hanging the worker. Consumers sit
under `ssr: false` routes (streams pages), `<ClientOnly>` (the repl's
activity tail), or the admin layout's client-only connect gate.

Consumers rewritten: the stream tree browser is now LIVE (one
`onStateChange` per loaded node, refresh = re-subscribe; react-query
gone), the streams index/detail routes use `useItx` under Suspense
(loader prefetch deleted), breadcrumb popovers fetch childPaths on open
into local state, and `ItxActivityTail` is a kernel `subscribe` from
"start" into component state (500-event cap, offset-deduped
re-subscribe). The repl keeps its own `createBrowserReplSession` — it
needs dispose/reconnect-on-demand semantics the singleton deliberately
lacks.

### Deleted

`apps/os/src/itx/react/` (provider, backoff-reconnect connection, query
bridge, stream-tail multiplexer, `useStreamEvents`, all their tests),
`itx/server.ts` (`getServerItx`), `itx/loader.ts`
(`getLoaderItx`/`prefetchItxQuery`), `lib/itx-queries.ts`, and the
`itx-server-handle` worker test harness (+ its package.json script and
knip entries). `itx/errors.ts` stays (error codes still matter to catch
blocks); its test moved next to it.

### Docs

DECISIONS **D20** + **D21**, `itx-orpc-replacement-plan.md` browser/SSR
sections rewritten to the one-hook model, `tasks/os-orpc-teardown.md`
updated (getServerItx marked superseded; route-conversion guidance now
references `useItx`).

## Notes

- Merged with `main`'s subscriber-presence work (#1460): the D20 pump
changes were re-applied inside `CoreStreamProcessor.openConnection`, and
`subscribe()` carries both new options (`events`, `subscriber`). Tests
that asserted exact batch counts/empty-events were reconciled with
presence facts (every subscribe/unsubscribe appends one) to assert
delivered content and per-batch invariants instead.
- `subscribeOutbound` (processor hosts) shares the pump, so outbound
connections also get state-bearing batches and an initial `events: []`
batch on (re)connect — `StreamProcessor.ingest` no-ops on empty batches.

## Tests

- **Pump level** (`packages/streams` stream.workers.test.ts): state
matches the DO's own reduced state on every batch; immediate initial
batch on live-only subscribe; initial push folds into replay; `events:
false` state-only mode — all green alongside main's presence-roster
tests.
- **Capability loopback** (`pnpm test:itx-stream-subscribe`, 13
passing): batch state === getState shape end-to-end, live-only initial
push, state-only mode, `onStateChange` across real Workers RPC hops.
- **Hook core** (`apps/os/src/itx/use-itx.test.ts`): the connection map
— stable entry per context, eviction + notification on death,
identity-guarded eviction, fresh dial after eviction.
- **e2e** (itx-subscribe.e2e.test.ts, runs in preview CI) unchanged.
- Full: `pnpm typecheck && pnpm lint && pnpm knip && pnpm format && pnpm
test` all green at root after merging main.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Large architectural cutover across streams protocol, OS UI, and
deleted SSR/prefetch paths; live views may stall until refresh if Stream
DOs evict without server-side subscription recovery.
> 
> **Overview**
> **Stream subscriptions** now carry reduced **`state`** on every batch
(same shape as `getState()`), with an **immediate initial push** so UIs
can paint without a separate fetch. **`events: false`** is state-only
mode (live-from-now, coalesced updates). **`ItxStream.onStateChange`**
is the reactive sugar on top.
> 
> **Browser itx** collapses to **`useItx` / `getBrowserItx`**:
per-context WebSocket singletons, Suspense until connected, **no SSR**
(throws on server), **no** TanStack Query bridge, reconnect backoff,
stream-tail multiplexer, or **`getServerItx`** loader prefetch.
**`ItxProvider`** and **`~/itx/react/`** are removed.
> 
> **UI** is rewired: the stream tree uses live **`onStateChange`** per
node; streams routes are **`ssr: false`** + Suspense; breadcrumbs
**`getState`** on popover open; **`ItxActivityTail`** uses kernel
**`subscribe`** from `"start"`. Docs/decisions (**D20**, **D21**) and
tests are updated; the itx-server-handle harness is deleted.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
3d7754a. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-10T20:20:50.857Z",
      "headSha": "3d7754a7838bb5e0da93ea4cf213644ba5559652",
      "message": null,
      "publicUrl": "https://os.iterate-preview-6.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27303571586",
      "shortSha": "3d7754a"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_6",
    "leasedUntil": 1781126280948,
    "leaseId": "d920039e-82ea-4bbd-b93a-affdf57e3fc1",
    "slug": "preview-6",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-6`
Doppler config: `preview_6`
Type: `environment-config-lease`
Leased until: 2026-06-10T21:18:00.948Z

### OS
Status: deployed
Commit: `3d7754a`
Preview: https://os.iterate-preview-6.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27303571586)
Updated: 2026-06-10T20:20:50.857Z
<!-- /CLOUDFLARE_PREVIEW -->

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
#1479)

## Streams review — Stages 4–7 (browser robustness, perf, dead code,
docs)

The final batch of the `packages/streams` adversarial review
(`tasks/streams-review-fixes.md`), bundled into one PR per request.
Stages 0/1/3 already shipped (#1455, #1458, #1459); **Stage 2 (lazy
init) was abandoned** — #1460's subscriber-presence model appends a
`subscriber-connected` fact on every `subscribe()`, which fundamentally
conflicts with "don't initialize storage until the first `append()`".
`created`/`woken` stay eager.

All findings were re-verified against current `main` after #1460/#1457
moved the codebase; items those PRs already fixed or made obsolete are
called out below.

### Stage 4 — browser-runtime correctness
- **C1-browser** — the server delivery pump is fire-and-forget for
inbound subscribers and never reports a failed `ingest`, so a browser
mirror that fails to apply a batch silently desyncs forever. The browser
now self-heals: `ingestWithSelfHeal` resubscribes from the persisted
checkpoint with bounded exponential backoff on ingest failure.
- **B1** — connection-epoch guard so a stale connection's late status
callback can't clobber the live connection.
- **B2** — `appendBatch`/`runtimeState` await connection readiness
(`whenStreamReady`) instead of throwing "disposed" during a transient
reconnect; only throw when actually disposed.
- **B3** — the `ROLLBACK` in `stream-db.worker.ts` `batch()` is now in
its own try/catch so it can't mask the original error (which
`withBusyRetry`/`isBusyError` need to see).
- **B4** — reconcile guards on stream incarnation (server `createdAt` +
a `mirror_meta` table): rebuild the mirror on incarnation change rather
than trusting the offset comparison after a `reset()`.
- **B5** — `AbortSignal` on the Web Lock request (released on disposal)
+ the request rejection is surfaced instead of `void`-swallowed.
- **B6** — query lifecycle: arm GC at query creation (not only on
unsubscribe), skip listenerless queries in `#onChange`, and
equality-check before notifying to avoid spurious `useSyncExternalStore`
churn.

### Stage 5 — performance
- **P1** — `browser-event-feed` O(n²) write amplification: coalesce to
one op per `local_index` (was a cumulative op per event, each
re-serializing the whole accumulated array).
- **P2** — bound group rows with `MAX_GROUP_EVENTS = 200` so one
dominant event type can't grow a single blob unboundedly.
- **P3** — drop the per-event full-state `stateSchema.parse` in core
`reduce`; state is validated only at the KV/recovery trust boundary now.
- **P4** — already fixed by #1460 (subscribe override forwards
`eventTypes`; filtering is server-side). No change.

### Stage 6 — dead code / elegance (~247 lines deleted)
- **E1** dead exports in `shared/stream-processors.ts`; **E2**
`waitForOpen`; **E3** unreachable circuit-breaker `paused` guard; **E5**
dead `waitForEvent`/`messageInbox.error` machinery in `subscription.ts`;
**E6** redundant circuit-breaker `consumes` entries.
- **E4 obsolete** (#1460 removed `processor-registered`; the shared
`circuit-breaker-types` is a live, separately-imported hierarchy, not
dead code). **E7 declined** — post-#1460 the subscribe override
genuinely needs to retain the client callback / wire `onRpcBroken`, so
folding it into the generic `makeRpcTargetClass` isn't worth it.

### Stage 7 — docs
- **D1** `design.md`: status banner, fixed 1-based offsets + core-event
examples, the callable subscriber spec, SQLite/512KB-chunking storage,
live-tail `replayAfterOffset` default, OPFS mirror; superseded banners
on the never-shipped `implementProcessor`/`connectStream` sections.
- **D2** `README.md`: real browser export
(`withStreamConnectionFromBrowser`), sync `Disposable`, `?path=` route
map; new "Append & subscription semantics" section (D5).
- **D3** ADR 0001 superseded with the shipped per-`(namespace, path,
slug)` singleton model.
- **D4** remaining comment drift (`beforeAppend`→`validateAppend`,
`afterAppendBatch`→`processEventBatch`).

### Testing
`pnpm typecheck`, `pnpm lint`, node (79) + workers (15) suites, and
example-app typecheck all green. Example-app Playwright e2e run locally
against a clean `vite dev` (real Stream DO via Miniflare): **26
passed**. `origin/main` merged in (no `packages/streams` overlap; the
new project-config stream-processor consumers don't touch any deleted
exports).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches browser reconnect, mirror discard, and ingest self-heal paths
that affect live viewers and local SQLite mirrors; server stream DO
behavior is mostly unchanged aside from core reduce hot-path
optimization.
> 
> **Overview**
> Closes out **Stages 4–7** of the `packages/streams` adversarial
review: browser mirror reliability, feed/core hot-path perf, dead-code
removal, and doc alignment. Stage 2 (lazy stream init) stays
**abandoned** and is only noted in `tasks/streams-review-fixes.md`.
> 
> **Browser runtime** (`stream-browser-store.ts` and friends) is
reworked so the OPFS mirror can recover instead of silently drifting.
Inbound delivery is fire-and-forget, so failed `ingest` now triggers
**`ingestWithSelfHeal`** (resubscribe from the last checkpoint with
capped exponential backoff). Reconnects go through one
**`scheduleReconnect`** path with a **connection epoch** so stale
WebSocket callbacks cannot tear down the replacement connection.
**`appendBatch` / `runtimeState`** use **`callWhenReady`** /
**`whenStreamReady`** during transient reconnects instead of throwing
“disposed”. Mirror trust uses server **`createdAt`** as incarnation
identity via new **`mirror_meta`** helpers in
**`stream-browser-db.ts`**. Web Lock **`release()`** aborts pending lock
requests; the SQLite worker preserves the original batch error if
`ROLLBACK` fails; reactive queries GC earlier, skip unobserved
refreshes, and avoid notify when snapshots are unchanged.
> 
> **Performance:** `browser-event-feed` **`planFeedOps`** coalesces one
SQL op per group row and caps groups at **`MAX_GROUP_EVENTS` (200)**.
Core **`reduce`** no longer exit-parses the full `stateSchema` on every
appended event.
> 
> **Cleanup:** ~247 lines removed (unused `stream-processors` exports,
`waitForOpen`, dead subscription `waitForEvent` machinery, trimmed
circuit-breaker `consumes` and an unreachable guard). **Docs:** `README`
(real browser API, `?path=` routes, append/subscription semantics),
`design.md` and ADR 0001 marked superseded where they diverge from
shipped code.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
f52f305. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-10T22:02:58.860Z",
      "headSha": "f52f305a9f1ab43c9749f80e30bcd96705cbce59",
      "message": null,
      "publicUrl": "https://os.iterate-preview-3.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27309066809",
      "shortSha": "f52f305"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_3",
    "leasedUntil": 1781132382486,
    "leaseId": "52648479-bd91-469d-91e5-a3338534feae",
    "slug": "preview-3",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-3`
Doppler config: `preview_3`
Type: `environment-config-lease`
Leased until: 2026-06-10T22:59:42.486Z

### OS
Status: deployed
Commit: `f52f305`
Preview: https://os.iterate-preview-3.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27309066809)
Updated: 2026-06-10T22:02:58.860Z
<!-- /CLOUDFLARE_PREVIEW -->

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
…1483)

Steps B and C of the agents roadmap (after #1460 and #1475): the LLM
request handoff stops embedding the conversation, and the two LLM
request processors become deliberate, tidy siblings sharing pure
helpers.

## Request-by-reference (no more embedded body)

`agent/llm-request-requested` used to carry the full chat request. Since
the conversation grows with the stream, every request stored a complete
copy of it — O(N²) stream growth. The `llmRequestId` already IS the
requested event's offset, so the body is redundant: providers can
rebuild it from committed history.

Before:

```ts
// agent processor, on handoff
payload: {
  model,
  runOpts,
  body: buildLlmChatRequest(stateAtRequest), // full conversation, every time
}
```

After:

```ts
// agent processor: just the reference + how to run it
payload: { model: stateAtRequest.llmConfig.model, runOpts: stateAtRequest.llmConfig.runOpts }

// provider, at execution time (both cloudflare-ai and openai-ws):
// Request-by-reference: the requested event carries no body; rebuild the
// chat request from committed history up to the request's own offset.
const body = buildAgentLlmRequestBody({
  events: await this.deps.readStreamEvents(),
  llmRequestId, // === the requested event's offset
});
```

The rebuild reduces history `events.filter((e) => e.offset <=
llmRequestId)` through the same `reduceAgentEvents` +
`buildLlmChatRequest` pair the agent itself uses, so the model-visible
context is reproducible from the stream forever — including for
crash-recovery retries, which re-derive exactly what the dead
incarnation would have sent.

**Breaking change** to the `agent/llm-request-requested` payload (and
`cloudflare-ai/llm-request-started`, which also embedded the body). No
backcompat bridge; prd gets redeployed.

## Providers as siblings, not an abstraction

`cloudflare-ai` and `openai-ws` were ~500/790-line copy-pasted state
machines. Rather than an abstract base class, they're now deliberate
siblings: same method names, same control flow, same comments where the
logic matches — each keeps its own event types (which scales better as
providers diverge) and its own transport (one `AI.run()` call vs a
shared Responses WebSocket). What they share are four stateless
functions in `llm-request-helpers.ts`:

```ts
buildAgentLlmRequestBody({ events, llmRequestId });     // the request-by-reference rebuild
isAgentLlmRequestStillCurrent({ events, llmRequestId }); // stale-output guard before agent-visible appends
findDanglingLlmRequestIds({ requests, executedLlmRequestIds }); // crash-recovery candidates
parseLlmRequestRequestedEventAt({ events, llmRequestId });      // typed re-derivation for recovery
```

Both implementation files open with the same note: *"When you fix
something here, check whether the sibling needs the same fix."*

## Bounded execution-claims set

`#executedLlmRequestIds` (the instance-scoped set distinguishing "this
incarnation is executing it" from "a dead one was") previously only
grew. Both siblings now drop a claim when the request's own completed
fact reduces back:

```ts
case "events.iterate.com/cloudflare-ai/llm-request-completed":
  // The completed fact is durable; this instance can never need to
  // (re-)execute this request again, so drop the claim — this is what
  // keeps the executed set bounded.
  this.#executedLlmRequestIds.delete(event.payload.llmRequestId);
  return;
```

## Tests

- New in both provider suites: *rebuilds the chat request from history
up to the request's offset* — history rows after the requested event's
offset are excluded from what the model sees.
- The agent handoff test now asserts the requested payload is `{ model,
runOpts }` with no `body`.
- Provider fixtures lost their embedded bodies; conversation content now
flows through `readStreamEvents` history, matching production.

`pnpm typecheck && pnpm lint && pnpm format && pnpm test` all green.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Breaking event payload shapes for `llm-request-requested` and provider
started events; correctness now depends on history reads and
offset-bounded rebuild at execution time, including crash recovery
paths.
> 
> **Overview**
> **Request-by-reference LLM handoff** stops embedding the full
conversation on `agent/llm-request-requested` (and drops `body` from
`cloudflare-ai/llm-request-started`). Handoffs are now `{ model, runOpts
}` only; `llmRequestId` stays the requested event’s offset, and
**cloudflare-ai** / **openai-ws** rebuild model input at execution via
shared `llm-request-helpers.ts` (`buildAgentLlmRequestBody` reduces
history with `offset <= llmRequestId`).
> 
> The two provider processors are aligned as **siblings** (shared
helpers for rebuild, still-current checks, dangling recovery, typed
re-parse) instead of duplicated logic. **`#executedLlmRequestIds`** now
drops entries when each request’s provider **completed** event reduces,
so long-lived instances don’t grow the claim set forever.
> 
> Tests assert handoff payloads have no `body` and that providers
exclude stream rows after the request offset when building chat input.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
83ed554. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-10T22:06:49.277Z",
      "headSha": "83ed55497ec2f4d04aeb172f36d2dddd34b8dcfc",
      "message": null,
      "publicUrl": "https://os.iterate-preview-7.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27309177659",
      "shortSha": "83ed554"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_7",
    "leasedUntil": 1781132576419,
    "leaseId": "7f1ba814-5c43-4cdd-ae72-2f569050a9d5",
    "slug": "preview-7",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-7`
Doppler config: `preview_7`
Type: `environment-config-lease`
Leased until: 2026-06-10T23:02:56.419Z

### OS
Status: deployed
Commit: `83ed554`
Preview: https://os.iterate-preview-7.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27309177659)
Updated: 2026-06-10T22:06:49.277Z
<!-- /CLOUDFLARE_PREVIEW -->

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
jonastemplestein added a commit that referenced this pull request Jun 10, 2026
The last batch of outstanding work from the agents audit (after #1460,
#1475, #1483): the section-3 dead-code sweep and the UI shrink. Net
**−188 lines** (−503/+315).

## One agent setup form

`agents/new.tsx` and `agents/new-preset.tsx` were ~270-line,
~90%-identical forms (provider/model/runOpts/system-prompt/custom-events
fields plus the YAML preview pane). They now share one
`AgentSetupFormPage` component; each route keeps only what genuinely
differs — its path normalization, its preview builder, and its submit
mutation:

```tsx
<AgentSetupFormPage
  title="New Agent"
  pathLabel="Agent path"
  buildPreview={(values) => buildPreviewEvents({ projectId: project.id, values })}
  submitIdleLabel="Create agent"
  isPending={createAgent.isPending}
  onSubmit={({ preview }) => createAgent.mutate(preview)}
  ...
/>
```

The routes drop from ~270 lines each to ~125, and the next form tweak
happens once instead of twice.

## Legacy Slack preset filter deleted

`agent-presets.ts` carried `isLegacyGeneratedSlackOpenAiPreset` — a
content-sniffing filter that suppressed an old auto-generated
`/agents/slack` preset by matching its system-prompt text. Checked prd
before deleting: the iterate project has **zero** stored presets, so the
filter guards nothing. The intentional behavior next to it (Slack agents
never inherit the generic `/agents` preset) stays, with its tests.

## Stale migration headers

Every processor under
`apps/os/src/domains/{agents,slack}/stream-processors/` opened with
"Migrated from `packages/shared/src/stream-processors/...`" — a
directory that no longer exists. Those provenance paragraphs are gone.
Where they carried a live constraint, the constraint survives in its own
words:

```ts
// Appended event types, payload shapes, and idempotency-key derivations
// (`agent/<key>@<sourceOffset>`) are stable wire formats — changing them
// breaks dedup against events already committed to streams.
```

## Audit bug status (no code change needed)

- **2.4 zombie `pendingTriggerCount`** — fixed by construction since
#1460: the reconcilers guarantee dangling requests reach a terminal
event, and a queued count only ever represents real user inputs whose
follow-up turn rebuilds from full history.
- **2.3 cancellation check-then-act race** — still a theoretical window;
closing it needs conditional appends (append-if-still-current at the
stream layer), which is its own design, not a cleanup.

`pnpm typecheck && pnpm lint && pnpm format && pnpm test` all green.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Mostly UI deduplication and comment edits; Slack preset selection is
slightly broader if old auto-generated presets exist in storage, which
the PR assumes is empty.
> 
> **Overview**
> Introduces **`AgentSetupFormPage`** so **New Agent** and **New Agent
Preset** share one form (provider, model, run options, system prompt,
custom events YAML, live preview). Each route only keeps path handling,
its preview builder, and submit logic—roughly halving page size.
> 
> **Removes** the legacy **`isLegacyGeneratedSlackOpenAiPreset`** filter
and its test from `agent-presets.ts`. Slack agents still only match
Slack-scoped presets; stored `/agents/slack` presets are no longer
sniffed and ignored by prompt text.
> 
> **Trims** stale “migrated from `packages/shared`…” headers across
agent and Slack stream-processor modules, leaving short notes where wire
formats and idempotency keys must stay stable.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
c9cf738. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

<!-- CLOUDFLARE_PREVIEW -->
## Environment Config Lease
<!-- CLOUDFLARE_PREVIEW_STATE -->
<!--
{
  "apps": {
    "os": {
      "appDisplayName": "OS",
      "appSlug": "os",
      "status": "deployed",
      "updatedAt": "2026-06-10T22:24:45.981Z",
      "headSha": "c9cf738e2dad4877de8286cf370cf50cedd81eeb",
      "message": null,
      "publicUrl": "https://os.iterate-preview-4.com",
"runUrl": "https://github.com/iterate/iterate/actions/runs/27310094398",
      "shortSha": "c9cf738"
    }
  },
  "environmentConfigLease": {
    "dopplerConfig": "preview_4",
    "leasedUntil": 1781133683378,
    "leaseId": "51ccdfaf-05f8-4f1c-a8ed-1b26dec7500b",
    "slug": "preview-4",
    "type": "environment-config-lease"
  }
}
-->
<!-- /CLOUDFLARE_PREVIEW_STATE -->
Lease: `preview-4`
Doppler config: `preview_4`
Type: `environment-config-lease`
Leased until: 2026-06-10T23:21:23.378Z

### OS
Status: deployed
Commit: `c9cf738`
Preview: https://os.iterate-preview-4.com
[Workflow
run](https://github.com/iterate/iterate/actions/runs/27310094398)
Updated: 2026-06-10T22:24:45.981Z
<!-- /CLOUDFLARE_PREVIEW -->

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
jonastemplestein added a commit that referenced this pull request Jun 11, 2026
Post-merge grooming after the agents workstream landed (#1460, #1475,
#1483, #1484). Grooming rules (docs/tasks-grooming.md) say tasks are
deleted when done:

- **Deleted** `tasks/streams-core-processor-host-homogenization.md` —
the plan of record for what shipped in #1460.
- **Deleted** `tasks/agents-system-audit-and-reconciler-design.md` — the
audit knowledge dump; every verified bug and design direction in it is
now either shipped or carried by a live task file.
- **Updated** the two deferred follow-ups
(`streams-core-clock-durable-timers.md`,
`streams-event-kinds-metadata.md`) to drop their `dependsOn`/background
references to the deleted docs, pointing at the merged PRs instead.
- **Added** `tasks/streams-conditional-appends.md` — the one audit
finding that survived everything: the check-then-act window between a
provider's still-current check and its `agent/output-added` append.
Backlog, with the conditional-append direction written down so it isn't
lost with the audit doc.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Documentation-only changes under `tasks/` with no runtime or API
impact.
> 
> **Overview**
> **Grooms the `tasks/` backlog** after agents/streams work landed in
#1460 and related PRs, per `docs/tasks-grooming.md` (delete tasks when
done).
> 
> **Removes** the shipped plan-of-record
(`streams-core-processor-host-homogenization.md`) and the umbrella
audit/knowledge dump (`agents-system-audit-and-reconciler-design.md`),
since their content is either merged or split elsewhere.
> 
> **Refreshes** deferred follow-ups:
`streams-core-clock-durable-timers.md` and
`streams-event-kinds-metadata.md` drop `dependsOn` on deleted tasks and
cite PR #1460 in background instead of dead links.
> 
> **Adds** `streams-conditional-appends.md` (backlog) to capture the
remaining audit item—the LLM output **check-then-act** race—and the
direction (stream-level conditional append / CAS), so it isn’t lost with
the audit doc.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
d9b9d7f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant