Rebuild voice agent processor on stream host by jonastemplestein · Pull Request #1349 · iterate/iterate

jonastemplestein · 2026-05-19T15:11:15Z

What changed

Stream-based realtime voice agent on the stream-processor host, now as a single unified voice-agent processor:

One processor contract + implementation for Gemini Live, OpenAI Realtime, and Grok Realtime; the backend is selected from setup-configured state. Provider differences live in one endpoint table (URL, headers, session setup message, message handler).
Six generic provider-* audit/status events with a provider payload field replace the previous 27 per-provider event types. consumes is trimmed to the four events the processor acts on, so its own audit appends no longer round-trip through ingest.
Provider message audit events redact large strings (base64 PCM), so audio is stored exactly once per direction. Per-connection promise chains serialize message handling and appends, guaranteeing output frames land in stream order.
Gemini goAway surfaces as a going-away status event; provider error messages are recorded without tearing down the session; agent/input-added is referenced via processorDeps instead of being redeclared.
Gemini "required" messageAgent tool choice is enforced through the system instruction — Gemini Live v1beta rejects toolConfig anywhere in BidiGenerateContentSetup (close 1007), so the previous API-level approach silently killed sessions.
Retired voice-agent/<provider> slugs map to a no-op processor so pre-existing stream subscriptions don't error; new streams subscribe only the unified slug.
Browser console: resubscribes with backoff from the last seen offset, never awaits playback or AudioContext.resume() inside the subscription loop, batches mic appends behind a bounded drop-oldest queue, and tracks played offsets with a monotonic counter.
Audio worklets: allocation-free ring buffers, box low-pass filtering for mic downsampling (no more aliasing decimation), and underruns counted only when audio resumes shortly after a drain.
Voice code agent uses the standard agent workspace and chat tool; it replies into the voice stream by appending input-text-appended events from codemode.

Why

Proves that a voice client can append PCM input frames to a stream, have a realtime voice backend process them, and receive PCM output frames back through the same stream — with one canonical contract instead of per-provider forks.

Validation

pnpm typecheck (repo-wide), pnpm lint, pnpm format
packages/shared test suite incl. new coverage: per-provider audio forwarding, audit redaction, output frame ordering, speaker-buffer-clear, goAway, resampling, reducers (19 tests)
apps/os unit tests (27 files)
Live smoke test against all three real provider APIs (text in → PCM out, in-memory stream): connected/ready, speech-level audio (RMS 1700–3900), correct transcripts, ordered frames, redacted audits with no audio leakage; full messageAgent handoff loop verified on Gemini (tool call → agent/input-added → code-agent reply → spoken audio)

Environment Config Lease

Lease: preview-6
Doppler config: preview_6
Type: environment-config-lease
Leased until: 2026-06-10T12:05:46.980Z

OS

Status: deployed
Commit: 2a894e2
Preview: https://os.iterate-preview-6.com
Workflow run
Updated: 2026-06-10T11:08:10.672Z

Semaphore

Status: deployed
Commit: 2a894e2
Preview: https://semaphore.iterate-preview-6.com
Workflow run
Updated: 2026-06-10T11:08:05.650Z

jonastemplestein · 2026-06-10T10:58:34Z

Rebuilt this PR from current origin/main and folded in the follow-up from #1351. The voice provider adapter now runs as hosted processors on AgentDurableObject via the new stream processor host instead of the old standalone stream processor DO split.\n\nVerified locally:\n- pnpm --dir apps/os typecheck\n- pnpm --dir apps/os exec vitest run src/domains/agents/stream-processors/voice-agent/implementation.test.ts

Collapse the three per-provider voice processors into one voice-agent processor that picks its backend from setup-configured state: - Replace 27 per-provider audit event types with six generic provider-* events carrying a provider field; trim consumes to the four events the processor acts on so its own audit appends no longer re-enter ingest. - Redact large strings (base64 PCM) from provider message audit events; audio now lives exactly once per direction in the stream. - Serialize provider message handling and stream appends per connection so output frames land in stream order, and serialize input forwarding without blocking sends behind audit append round trips. - Surface Gemini goAway as a going-away status event; record provider error messages without tearing down the session; reference the agent contract via processorDeps instead of redeclaring agent/input-added. - Map retired voice-agent/<provider> slugs to a no-op processor so old stream subscriptions stop erroring; new streams subscribe only the unified slug. Console: resubscribe with backoff from the last seen offset, stop awaiting playback (and AudioContext.resume) inside the subscription loop, batch mic appends behind a bounded drop-oldest queue with a dropped-frames metric, and track played offsets with a monotonic counter. Worklets: ring buffers instead of push/splice arrays on the audio thread, box low-pass filtering for mic downsampling instead of aliasing nearest-neighbor decimation, and underruns counted only when audio resumes shortly after a drain. Tests: port to the unified processor and add coverage for audio forwarding per provider, audit redaction, output frame ordering, speaker-buffer-clear, goAway, resampling, and reducers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Live smoke testing showed Gemini Live v1beta closes the socket with 1007 ("Unknown name toolConfig at 'setup'") for any function-calling config in BidiGenerateContentSetup — the field has never been accepted, so the required messageAgent option silently killed Gemini sessions. Enforce it through the system instruction instead, which the live API honors. Verified against the real Gemini Live, OpenAI Realtime, and Grok Realtime APIs: text in -> speech-level PCM out with correct transcripts, ordered frames, redacted audit events, and the full messageAgent handoff loop (tool call -> agent/input-added -> code-agent reply -> spoken audio). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Voice agent streams now get the standard agent workspace and chat tool behavior: drop the voice-specific ctx.chat.sendMessage rerouting and the appendVoiceAgentTextInput helper (the code agent appends voice-agent text input events directly from codemode), and build voice code-agent setup events from the default preset instead of threading baseEvents through.

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 67353f3. Configure here.}

cursor · 2026-06-10T13:56:28Z

+        stopCapture();
+        setProviderStatus("Stream paused");
+      }
+    } finally {


Failed mic batch dropped

Medium Severity

In flushInputFrames, pending input events are removed from the queue before appendBatch completes. If the request fails, those frames are not put back, so mic audio for that batch is lost until the user speaks again.

^{Reviewed by Cursor Bugbot for commit 67353f3. Configure here.}

cursor · 2026-06-10T13:56:28Z

+    lastPlayedOffsetRef.current = event.offset;
+
+    const payload = parseAudioPayload(event.payload);
+    if (!payload || payload.sampleRate !== VOICE_AGENT_OUTPUT_SAMPLE_RATE) return;


Skipped audio marks offset played

Low Severity

playOutputEvent advances lastPlayedOffsetRef before validating the frame payload. Invalid or wrong-rate frames are never played but are treated as already handled, so they cannot be retried if the same offset is seen again.

^{Reviewed by Cursor Bugbot for commit 67353f3. Configure here.}

cursor · 2026-06-10T13:56:28Z

-        typeof systemPrompt === "string" && !systemPrompt.includes("ctx.streams.append({ event:")
-      );
-    });
+    return input.existingEvents.some((event) => event.type === input.event.type);


Default prompt upgrade blocked

Medium Severity

hasEquivalentDefaultSetupEvent now treats any existing system-prompt-updated event as sufficient, so ensureAgentSetupEvents skips appending the current default system prompt when an older prompt is already on the stream.

^{Reviewed by Cursor Bugbot for commit 67353f3. Configure here.}

cursor · 2026-06-10T13:56:29Z

+          const event = Event.parse(value);
+          if (event.type === OUTPUT_AUDIO_FRAME_EVENT_TYPE) {
+            await playOutputEvent(event);
+          }


POC stream blocks playback

Medium Severity

The voice POC event loop awaits playOutputEvent for each output frame, and ensureOutputAudio awaits AudioContext.resume(). Under autoplay policy that can stall the subscription and delay or stop further stream events.

Additional Locations (1)

apps/os/src/routes/_app/orgs/$organizationSlug/projects/$projectSlug/voice-agent-poc.tsx#L144-L148

^{Reviewed by Cursor Bugbot for commit 67353f3. Configure here.}

Replaces the `merge-to-main-slack` workflow (one Slack message per merged PR — noisy on busy days) with a workflow that maintains **at most one message per day** in `#ci`: a one-line PR dashboard summary, with the full per-PR breakdown in a single threaded reply. Both are created on the first PR event of the day and updated in place after that. Channel message: > **PR dashboard 10th June** — 51 merged · 9 closed without merging · 4 opened · 2 older still open (details in thread) Threaded reply (rendered from real data): > **Merged:** > • [#1410 Fix 5-min logout, deploy-time JWKS, and stream append skeleton flash](#1410) by jonas (ad6da76) > • [#1407 itx: contexts, capabilities, and the one true handle](#1407) by jonas (f256768) > … > **Closed without merging:** > • [#1440 Migrate captun to published npm 0.0.3](#1440) by misha > … > **Opened:** > • [#1448 Replace per-merge Slack messages with a daily PR dashboard](#1448) by misha (draft) > … > Old: [#1349](#1349), [#1355](#1355) How it works: - Content is refetched from the GitHub search API on every run (merged / closed-unmerged / opened-and-still-open today, plus older open PRs), so the message is self-healing — no incremental state to corrupt. - The day's message timestamps live in a repo Actions variable (`SLACK_PR_DASHBOARD_STATE`, `{date, channel, ts, details_ts}`), written with the same `ITERATE_BOT_GITHUB_TOKEN` the nag workflow uses. No new Slack scopes needed: `chat.update` uses the `chat:write` the bot already exercises. - Targets `#ci`, adopting #1452's decision to move merge announcements out of `#building` (that PR edited the workflow this one deletes; the conflict is resolved here by keeping the deletion). - The threaded details go out as chunked mrkdwn section blocks rather than one `text` param: on busy days a single text field hits `chat.update`'s `msg_too_long` (`postMessage` truncates, `update` rejects — found by e2e-testing against today's ~50 merges). - Plain-text author names (no @-mentions) since the messages update many times a day. - Testable two ways: pushing any `*pr-dashboard*` branch runs it for real against `#misha-test` with a separate state variable (create, update-in-place, and threading paths all verified this way — e.g. runs [27280068182](https://github.com/iterate/iterate/actions/runs/27280068182), [27288814028](https://github.com/iterate/iterate/actions/runs/27288814028)), and `node cli.ts github-script pr-dashboard.update_dashboard.update_pr_dashboard --github-token ...` does a local dry run that prints both messages. Task file: `tasks/slack-daily-pr-dashboard.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

Add voice agent processor POC

ffa1c6a

jonastemplestein marked this pull request as ready for review May 19, 2026 15:12

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread apps/os/src/routes/_app/orgs/$organizationSlug/projects/$projectSlug/voice-agent-poc.tsx

jonastemplestein added 14 commits May 19, 2026 16:28

Split voice agent stream processors

81487f1

Add Grok ask-agent voice bridge

42c1f92

Smoke test Grok ask-agent bridge

dd5e22c

Add Gemini and OpenAI ask-agent tool bridge

14636d8

Clarify code agent voice responses

4cb9926

Tighten voice agent handoff prompts

7cd2c30

Rename voice tool to messageAgent

446b125

Clarify voice agent relay handoff

9ea5270

Clarify caller terminology for voice handoff

0f09a9f

Cap Alchemy worker tags for OS preview deploys

9be9263

Move voice agents under agents stream path

1895d66

Route voice agent chat responses to voice input

fb48aa9

Honor Gemini required messageAgent tool choice

ed2beef

Show legacy voice agent streams

637cd1a

jonastemplestein force-pushed the electric-artichoke branch from ffa1c6a to 6323aed Compare June 10, 2026 10:58

jonastemplestein mentioned this pull request Jun 10, 2026

Split voice agent stream processors #1351

Closed

jonastemplestein changed the title ~~[codex] Add voice agent processor POC~~ Rebuild voice agent processor on stream host Jun 10, 2026

jonastemplestein force-pushed the electric-artichoke branch from f3e2864 to 8840112 Compare June 10, 2026 11:01

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread apps/os/src/domains/agents/stream-processors/voice-agent/implementation.ts Outdated

Comment thread apps/os/src/components/voice-agent-stream-console.tsx

jonastemplestein force-pushed the electric-artichoke branch from 8840112 to 2a894e2 Compare June 10, 2026 11:04

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread apps/os/src/routes/_app/projects/$projectSlug/voice-agents/index.tsx Outdated

mmkal mentioned this pull request Jun 10, 2026

Replace per-merge Slack messages with a daily PR dashboard #1448

Merged

jonastemplestein and others added 2 commits June 10, 2026 14:51

jonastemplestein force-pushed the electric-artichoke branch from 2a894e2 to 67353f3 Compare June 10, 2026 13:54

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild voice agent processor on stream host#1349

Rebuild voice agent processor on stream host#1349
jonastemplestein wants to merge 18 commits into
mainfrom
electric-artichoke

jonastemplestein commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonastemplestein commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Environment Config Lease

OS

Semaphore

Uh oh!

Uh oh!

jonastemplestein commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Failed mic batch dropped

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Skipped audio marks offset played

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Default prompt upgrade blocked

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

POC stream blocks playback

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonastemplestein commented May 19, 2026 •

edited

Loading