Rebuild voice agent processor on stream host#1349
Conversation
ffa1c6a to
6323aed
Compare
|
Rebuilt this PR from current origin/main and folded in the follow-up from #1351. The voice provider adapter now runs as hosted processors on AgentDurableObject via the new stream processor host instead of the old standalone stream processor DO split.\n\nVerified locally:\n- pnpm --dir apps/os typecheck\n- pnpm --dir apps/os exec vitest run src/domains/agents/stream-processors/voice-agent/implementation.test.ts |
f3e2864 to
8840112
Compare
8840112 to
2a894e2
Compare
Collapse the three per-provider voice processors into one voice-agent processor that picks its backend from setup-configured state: - Replace 27 per-provider audit event types with six generic provider-* events carrying a provider field; trim consumes to the four events the processor acts on so its own audit appends no longer re-enter ingest. - Redact large strings (base64 PCM) from provider message audit events; audio now lives exactly once per direction in the stream. - Serialize provider message handling and stream appends per connection so output frames land in stream order, and serialize input forwarding without blocking sends behind audit append round trips. - Surface Gemini goAway as a going-away status event; record provider error messages without tearing down the session; reference the agent contract via processorDeps instead of redeclaring agent/input-added. - Map retired voice-agent/<provider> slugs to a no-op processor so old stream subscriptions stop erroring; new streams subscribe only the unified slug. Console: resubscribe with backoff from the last seen offset, stop awaiting playback (and AudioContext.resume) inside the subscription loop, batch mic appends behind a bounded drop-oldest queue with a dropped-frames metric, and track played offsets with a monotonic counter. Worklets: ring buffers instead of push/splice arrays on the audio thread, box low-pass filtering for mic downsampling instead of aliasing nearest-neighbor decimation, and underruns counted only when audio resumes shortly after a drain. Tests: port to the unified processor and add coverage for audio forwarding per provider, audit redaction, output frame ordering, speaker-buffer-clear, goAway, resampling, and reducers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Live smoke testing showed Gemini Live v1beta closes the socket with 1007
("Unknown name toolConfig at 'setup'") for any function-calling config in
BidiGenerateContentSetup — the field has never been accepted, so the
required messageAgent option silently killed Gemini sessions. Enforce it
through the system instruction instead, which the live API honors.
Verified against the real Gemini Live, OpenAI Realtime, and Grok Realtime
APIs: text in -> speech-level PCM out with correct transcripts, ordered
frames, redacted audit events, and the full messageAgent handoff loop
(tool call -> agent/input-added -> code-agent reply -> spoken audio).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Voice agent streams now get the standard agent workspace and chat tool behavior: drop the voice-specific ctx.chat.sendMessage rerouting and the appendVoiceAgentTextInput helper (the code agent appends voice-agent text input events directly from codemode), and build voice code-agent setup events from the default preset instead of threading baseEvents through.
2a894e2 to
67353f3
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 67353f3. Configure here.
| stopCapture(); | ||
| setProviderStatus("Stream paused"); | ||
| } | ||
| } finally { |
There was a problem hiding this comment.
Failed mic batch dropped
Medium Severity
In flushInputFrames, pending input events are removed from the queue before appendBatch completes. If the request fails, those frames are not put back, so mic audio for that batch is lost until the user speaks again.
Reviewed by Cursor Bugbot for commit 67353f3. Configure here.
| lastPlayedOffsetRef.current = event.offset; | ||
|
|
||
| const payload = parseAudioPayload(event.payload); | ||
| if (!payload || payload.sampleRate !== VOICE_AGENT_OUTPUT_SAMPLE_RATE) return; |
There was a problem hiding this comment.
Skipped audio marks offset played
Low Severity
playOutputEvent advances lastPlayedOffsetRef before validating the frame payload. Invalid or wrong-rate frames are never played but are treated as already handled, so they cannot be retried if the same offset is seen again.
Reviewed by Cursor Bugbot for commit 67353f3. Configure here.
| typeof systemPrompt === "string" && !systemPrompt.includes("ctx.streams.append({ event:") | ||
| ); | ||
| }); | ||
| return input.existingEvents.some((event) => event.type === input.event.type); |
There was a problem hiding this comment.
Default prompt upgrade blocked
Medium Severity
hasEquivalentDefaultSetupEvent now treats any existing system-prompt-updated event as sufficient, so ensureAgentSetupEvents skips appending the current default system prompt when an older prompt is already on the stream.
Reviewed by Cursor Bugbot for commit 67353f3. Configure here.
| const event = Event.parse(value); | ||
| if (event.type === OUTPUT_AUDIO_FRAME_EVENT_TYPE) { | ||
| await playOutputEvent(event); | ||
| } |
There was a problem hiding this comment.
POC stream blocks playback
Medium Severity
The voice POC event loop awaits playOutputEvent for each output frame, and ensureOutputAudio awaits AudioContext.resume(). Under autoplay policy that can stall the subscription and delay or stop further stream events.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 67353f3. Configure here.
Replaces the `merge-to-main-slack` workflow (one Slack message per merged PR — noisy on busy days) with a workflow that maintains **at most one message per day** in `#ci`: a one-line PR dashboard summary, with the full per-PR breakdown in a single threaded reply. Both are created on the first PR event of the day and updated in place after that. Channel message: > **PR dashboard 10th June** — 51 merged · 9 closed without merging · 4 opened · 2 older still open (details in thread) Threaded reply (rendered from real data): > **Merged:** > • [#1410 Fix 5-min logout, deploy-time JWKS, and stream append skeleton flash](#1410) by jonas (ad6da76) > • [#1407 itx: contexts, capabilities, and the one true handle](#1407) by jonas (f256768) > … > **Closed without merging:** > • [#1440 Migrate captun to published npm 0.0.3](#1440) by misha > … > **Opened:** > • [#1448 Replace per-merge Slack messages with a daily PR dashboard](#1448) by misha (draft) > … > Old: [#1349](#1349), [#1355](#1355) How it works: - Content is refetched from the GitHub search API on every run (merged / closed-unmerged / opened-and-still-open today, plus older open PRs), so the message is self-healing — no incremental state to corrupt. - The day's message timestamps live in a repo Actions variable (`SLACK_PR_DASHBOARD_STATE`, `{date, channel, ts, details_ts}`), written with the same `ITERATE_BOT_GITHUB_TOKEN` the nag workflow uses. No new Slack scopes needed: `chat.update` uses the `chat:write` the bot already exercises. - Targets `#ci`, adopting #1452's decision to move merge announcements out of `#building` (that PR edited the workflow this one deletes; the conflict is resolved here by keeping the deletion). - The threaded details go out as chunked mrkdwn section blocks rather than one `text` param: on busy days a single text field hits `chat.update`'s `msg_too_long` (`postMessage` truncates, `update` rejects — found by e2e-testing against today's ~50 merges). - Plain-text author names (no @-mentions) since the messages update many times a day. - Testable two ways: pushing any `*pr-dashboard*` branch runs it for real against `#misha-test` with a separate state variable (create, update-in-place, and threading paths all verified this way — e.g. runs [27280068182](https://github.com/iterate/iterate/actions/runs/27280068182), [27288814028](https://github.com/iterate/iterate/actions/runs/27288814028)), and `node cli.ts github-script pr-dashboard.update_dashboard.update_pr_dashboard --github-token ...` does a local dry run that prints both messages. Task file: `tasks/slack-daily-pr-dashboard.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>


What changed
Stream-based realtime voice agent on the stream-processor host, now as a single unified
voice-agentprocessor:setup-configuredstate. Provider differences live in one endpoint table (URL, headers, session setup message, message handler).provider-*audit/status events with aproviderpayload field replace the previous 27 per-provider event types.consumesis trimmed to the four events the processor acts on, so its own audit appends no longer round-trip through ingest.goAwaysurfaces as agoing-awaystatus event; provider error messages are recorded without tearing down the session;agent/input-addedis referenced viaprocessorDepsinstead of being redeclared.toolConfiganywhere inBidiGenerateContentSetup(close 1007), so the previous API-level approach silently killed sessions.voice-agent/<provider>slugs map to a no-op processor so pre-existing stream subscriptions don't error; new streams subscribe only the unified slug.AudioContext.resume()inside the subscription loop, batches mic appends behind a bounded drop-oldest queue, and tracks played offsets with a monotonic counter.input-text-appendedevents from codemode.Why
Proves that a voice client can append PCM input frames to a stream, have a realtime voice backend process them, and receive PCM output frames back through the same stream — with one canonical contract instead of per-provider forks.
Validation
pnpm typecheck(repo-wide),pnpm lint,pnpm formatpackages/sharedtest suite incl. new coverage: per-provider audio forwarding, audit redaction, output frame ordering, speaker-buffer-clear,goAway, resampling, reducers (19 tests)agent/input-added→ code-agent reply → spoken audio)Environment Config Lease
Lease:
preview-6Doppler config:
preview_6Type:
environment-config-leaseLeased until: 2026-06-10T12:05:46.980Z
OS
Status: deployed
Commit:
2a894e2Preview: https://os.iterate-preview-6.com
Workflow run
Updated: 2026-06-10T11:08:10.672Z
Semaphore
Status: deployed
Commit:
2a894e2Preview: https://semaphore.iterate-preview-6.com
Workflow run
Updated: 2026-06-10T11:08:05.650Z