Skip to content

Rebuild voice agent processor on stream host#1349

Open
jonastemplestein wants to merge 18 commits into
mainfrom
electric-artichoke
Open

Rebuild voice agent processor on stream host#1349
jonastemplestein wants to merge 18 commits into
mainfrom
electric-artichoke

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented May 19, 2026

Copy link
Copy Markdown
Contributor

What changed

Stream-based realtime voice agent on the stream-processor host, now as a single unified voice-agent processor:

  • One processor contract + implementation for Gemini Live, OpenAI Realtime, and Grok Realtime; the backend is selected from setup-configured state. Provider differences live in one endpoint table (URL, headers, session setup message, message handler).
  • Six generic provider-* audit/status events with a provider payload field replace the previous 27 per-provider event types. consumes is trimmed to the four events the processor acts on, so its own audit appends no longer round-trip through ingest.
  • Provider message audit events redact large strings (base64 PCM), so audio is stored exactly once per direction. Per-connection promise chains serialize message handling and appends, guaranteeing output frames land in stream order.
  • Gemini goAway surfaces as a going-away status event; provider error messages are recorded without tearing down the session; agent/input-added is referenced via processorDeps instead of being redeclared.
  • Gemini "required" messageAgent tool choice is enforced through the system instruction — Gemini Live v1beta rejects toolConfig anywhere in BidiGenerateContentSetup (close 1007), so the previous API-level approach silently killed sessions.
  • Retired voice-agent/<provider> slugs map to a no-op processor so pre-existing stream subscriptions don't error; new streams subscribe only the unified slug.
  • Browser console: resubscribes with backoff from the last seen offset, never awaits playback or AudioContext.resume() inside the subscription loop, batches mic appends behind a bounded drop-oldest queue, and tracks played offsets with a monotonic counter.
  • Audio worklets: allocation-free ring buffers, box low-pass filtering for mic downsampling (no more aliasing decimation), and underruns counted only when audio resumes shortly after a drain.
  • Voice code agent uses the standard agent workspace and chat tool; it replies into the voice stream by appending input-text-appended events from codemode.

Why

Proves that a voice client can append PCM input frames to a stream, have a realtime voice backend process them, and receive PCM output frames back through the same stream — with one canonical contract instead of per-provider forks.

Validation

  • pnpm typecheck (repo-wide), pnpm lint, pnpm format
  • packages/shared test suite incl. new coverage: per-provider audio forwarding, audit redaction, output frame ordering, speaker-buffer-clear, goAway, resampling, reducers (19 tests)
  • apps/os unit tests (27 files)
  • Live smoke test against all three real provider APIs (text in → PCM out, in-memory stream): connected/ready, speech-level audio (RMS 1700–3900), correct transcripts, ordered frames, redacted audits with no audio leakage; full messageAgent handoff loop verified on Gemini (tool call → agent/input-added → code-agent reply → spoken audio)

Environment Config Lease

Lease: preview-6
Doppler config: preview_6
Type: environment-config-lease
Leased until: 2026-06-10T12:05:46.980Z

OS

Status: deployed
Commit: 2a894e2
Preview: https://os.iterate-preview-6.com
Workflow run
Updated: 2026-06-10T11:08:10.672Z

Semaphore

Status: deployed
Commit: 2a894e2
Preview: https://semaphore.iterate-preview-6.com
Workflow run
Updated: 2026-06-10T11:08:05.650Z

@jonastemplestein jonastemplestein marked this pull request as ready for review May 19, 2026 15:12
@jonastemplestein

Copy link
Copy Markdown
Contributor Author

Rebuilt this PR from current origin/main and folded in the follow-up from #1351. The voice provider adapter now runs as hosted processors on AgentDurableObject via the new stream processor host instead of the old standalone stream processor DO split.\n\nVerified locally:\n- pnpm --dir apps/os typecheck\n- pnpm --dir apps/os exec vitest run src/domains/agents/stream-processors/voice-agent/implementation.test.ts

@jonastemplestein jonastemplestein changed the title [codex] Add voice agent processor POC Rebuild voice agent processor on stream host Jun 10, 2026
Comment thread apps/os/src/domains/agents/stream-processors/voice-agent/implementation.ts Outdated
Comment thread apps/os/src/components/voice-agent-stream-console.tsx
Comment thread apps/os/src/routes/_app/projects/$projectSlug/voice-agents/index.tsx Outdated
Collapse the three per-provider voice processors into one voice-agent
processor that picks its backend from setup-configured state:

- Replace 27 per-provider audit event types with six generic provider-*
  events carrying a provider field; trim consumes to the four events the
  processor acts on so its own audit appends no longer re-enter ingest.
- Redact large strings (base64 PCM) from provider message audit events;
  audio now lives exactly once per direction in the stream.
- Serialize provider message handling and stream appends per connection
  so output frames land in stream order, and serialize input forwarding
  without blocking sends behind audit append round trips.
- Surface Gemini goAway as a going-away status event; record provider
  error messages without tearing down the session; reference the agent
  contract via processorDeps instead of redeclaring agent/input-added.
- Map retired voice-agent/<provider> slugs to a no-op processor so old
  stream subscriptions stop erroring; new streams subscribe only the
  unified slug.

Console: resubscribe with backoff from the last seen offset, stop
awaiting playback (and AudioContext.resume) inside the subscription
loop, batch mic appends behind a bounded drop-oldest queue with a
dropped-frames metric, and track played offsets with a monotonic
counter.

Worklets: ring buffers instead of push/splice arrays on the audio
thread, box low-pass filtering for mic downsampling instead of aliasing
nearest-neighbor decimation, and underruns counted only when audio
resumes shortly after a drain.

Tests: port to the unified processor and add coverage for audio
forwarding per provider, audit redaction, output frame ordering,
speaker-buffer-clear, goAway, resampling, and reducers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
jonastemplestein and others added 2 commits June 10, 2026 14:51
Live smoke testing showed Gemini Live v1beta closes the socket with 1007
("Unknown name toolConfig at 'setup'") for any function-calling config in
BidiGenerateContentSetup — the field has never been accepted, so the
required messageAgent option silently killed Gemini sessions. Enforce it
through the system instruction instead, which the live API honors.

Verified against the real Gemini Live, OpenAI Realtime, and Grok Realtime
APIs: text in -> speech-level PCM out with correct transcripts, ordered
frames, redacted audit events, and the full messageAgent handoff loop
(tool call -> agent/input-added -> code-agent reply -> spoken audio).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Voice agent streams now get the standard agent workspace and chat tool
behavior: drop the voice-specific ctx.chat.sendMessage rerouting and the
appendVoiceAgentTextInput helper (the code agent appends voice-agent
text input events directly from codemode), and build voice code-agent
setup events from the default preset instead of threading baseEvents
through.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 67353f3. Configure here.

stopCapture();
setProviderStatus("Stream paused");
}
} finally {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed mic batch dropped

Medium Severity

In flushInputFrames, pending input events are removed from the queue before appendBatch completes. If the request fails, those frames are not put back, so mic audio for that batch is lost until the user speaks again.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67353f3. Configure here.

lastPlayedOffsetRef.current = event.offset;

const payload = parseAudioPayload(event.payload);
if (!payload || payload.sampleRate !== VOICE_AGENT_OUTPUT_SAMPLE_RATE) return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped audio marks offset played

Low Severity

playOutputEvent advances lastPlayedOffsetRef before validating the frame payload. Invalid or wrong-rate frames are never played but are treated as already handled, so they cannot be retried if the same offset is seen again.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67353f3. Configure here.

typeof systemPrompt === "string" && !systemPrompt.includes("ctx.streams.append({ event:")
);
});
return input.existingEvents.some((event) => event.type === input.event.type);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default prompt upgrade blocked

Medium Severity

hasEquivalentDefaultSetupEvent now treats any existing system-prompt-updated event as sufficient, so ensureAgentSetupEvents skips appending the current default system prompt when an older prompt is already on the stream.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67353f3. Configure here.

const event = Event.parse(value);
if (event.type === OUTPUT_AUDIO_FRAME_EVENT_TYPE) {
await playOutputEvent(event);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POC stream blocks playback

Medium Severity

The voice POC event loop awaits playOutputEvent for each output frame, and ensureOutputAudio awaits AudioContext.resume(). Under autoplay policy that can stall the subscription and delay or stop further stream events.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 67353f3. Configure here.

mmkal added a commit that referenced this pull request Jun 10, 2026
Replaces the `merge-to-main-slack` workflow (one Slack message per
merged PR — noisy on busy days) with a workflow that maintains **at most
one message per day** in `#ci`: a one-line PR dashboard summary, with
the full per-PR breakdown in a single threaded reply. Both are created
on the first PR event of the day and updated in place after that.

Channel message:

> **PR dashboard 10th June** — 51 merged · 9 closed without merging · 4
opened · 2 older still open (details in thread)

Threaded reply (rendered from real data):

> **Merged:**
> • [#1410 Fix 5-min logout, deploy-time JWKS, and stream append
skeleton flash](#1410) by jonas
(ad6da76)
> • [#1407 itx: contexts, capabilities, and the one true
handle](#1407) by jonas (f256768)
> …
> **Closed without merging:**
> • [#1440 Migrate captun to published npm
0.0.3](#1440) by misha
> …
> **Opened:**
> • [#1448 Replace per-merge Slack messages with a daily PR
dashboard](#1448) by misha
(draft)
> …
> Old: [#1349](#1349),
[#1355](#1355)

How it works:

- Content is refetched from the GitHub search API on every run (merged /
closed-unmerged / opened-and-still-open today, plus older open PRs), so
the message is self-healing — no incremental state to corrupt.
- The day's message timestamps live in a repo Actions variable
(`SLACK_PR_DASHBOARD_STATE`, `{date, channel, ts, details_ts}`), written
with the same `ITERATE_BOT_GITHUB_TOKEN` the nag workflow uses. No new
Slack scopes needed: `chat.update` uses the `chat:write` the bot already
exercises.
- Targets `#ci`, adopting #1452's decision to move merge announcements
out of `#building` (that PR edited the workflow this one deletes; the
conflict is resolved here by keeping the deletion).
- The threaded details go out as chunked mrkdwn section blocks rather
than one `text` param: on busy days a single text field hits
`chat.update`'s `msg_too_long` (`postMessage` truncates, `update`
rejects — found by e2e-testing against today's ~50 merges).
- Plain-text author names (no @-mentions) since the messages update many
times a day.
- Testable two ways: pushing any `*pr-dashboard*` branch runs it for
real against `#misha-test` with a separate state variable (create,
update-in-place, and threading paths all verified this way — e.g. runs
[27280068182](https://github.com/iterate/iterate/actions/runs/27280068182),
[27288814028](https://github.com/iterate/iterate/actions/runs/27288814028)),
and `node cli.ts github-script
pr-dashboard.update_dashboard.update_pr_dashboard --github-token ...`
does a local dry run that prints both messages.

Task file: `tasks/slack-daily-pr-dashboard.md`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant