Skip to content

Split voice agent stream processors#1351

Closed
jonastemplestein wants to merge 15 commits into
electric-artichokefrom
voice-agent-processor-split
Closed

Split voice agent stream processors#1351
jonastemplestein wants to merge 15 commits into
electric-artichokefrom
voice-agent-processor-split

Conversation

@jonastemplestein

@jonastemplestein jonastemplestein commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • replace the voice-agent-specific Durable Object binding with a generic STREAM_PROCESSOR runner and registry
  • subscribe both the passive voice-agent protocol processor and the selected provider adapter for new voice-agent streams
  • add canonical input/output text events and forward input text to Gemini Live, OpenAI Realtime, and Grok Realtime adapters
  • wire new voice-agent streams to the existing AGENT Durable Object from the passive voice-agent processor side effect
  • expose messageAgent realtime tools for Gemini Live, OpenAI Realtime, and Grok Realtime that append events.iterate.com/agent/input-added
  • wrap code-agent text before sending it to voice providers so the realtime model relays it to the human it is speaking to, including clarifying questions, rather than answering it itself
  • add local smoke modes and a direct Gemini Live tool probe for proving provider tool-call wiring outside the full stream path

Verification

  • pnpm --filter os typecheck
  • pnpm --dir packages/shared test:stream-processors
  • pnpm format:check packages/shared/src/stream-processors/voice-agent/contract.ts packages/shared/src/stream-processors/voice-agent/implementation.ts packages/shared/src/stream-processors/voice-agent/implementation.test.ts apps/os/scripts/voice-agent-e2e.ts apps/os/scripts/gemini-live-tool-probe.ts apps/os/src/domains/voice-agents/voice-agent-code-agent.ts
  • pnpm lint packages/shared/src/stream-processors/voice-agent/contract.ts packages/shared/src/stream-processors/voice-agent/implementation.ts packages/shared/src/stream-processors/voice-agent/implementation.test.ts apps/os/scripts/voice-agent-e2e.ts apps/os/scripts/gemini-live-tool-probe.ts apps/os/src/domains/voice-agents/voice-agent-code-agent.ts
  • focused prompt-handoff checks after follow-ups: packages/shared/src/stream-processors/voice-agent/implementation.ts, packages/shared/src/stream-processors/voice-agent/implementation.test.ts, apps/os/src/domains/voice-agents/voice-agent-code-agent.ts

Local realtime e2e

Ran the original audio in/out smoke against local OS before the message-agent bridge:

  • Gemini Live: passed, provider connected, setup completed, output audio/text returned, outputBytes=10084
  • OpenAI Realtime: passed, provider connected, setup completed, output audio/text returned, outputBytes=19200
  • Grok Realtime: passed, provider connected, setup completed, output audio/text returned, outputBytes=26880

Ran direct Gemini Live tool probe against Doppler config dev_localhost:

doppler run --project os --config dev_localhost -- tsx scripts/gemini-live-tool-probe.ts --timeout-ms 12000

Observed Gemini toolCall.functionCalls[0].name === "messageAgent" using the docs-shaped function declaration.

Ran text-mode message-agent bridge smokes against local OS at http://127.0.0.1:5176 with Doppler config dev_localhost:

doppler run --project os --config dev_localhost -- tsx scripts/voice-agent-e2e.ts \
  --base-url http://127.0.0.1:5176 \
  --provider gemini-live \
  --input-mode text \
  --expect-message-agent \
  --prompt "Call the messageAgent function now. Use message: Fetch example.com and tell me what it says. Do not answer in natural language before the function call." \
  --timeout-ms 180000

Gemini result: ok: true, stream /voice-agents/e2e-mpd08lmh, messageAgentInputAdded: true, agentOutputAdded: true, codeAgentVoiceTextAdded: true, codeAgentVoiceText: "Example.com says: Example Domain. This domain is for use in documentation examples without needing permission. Avoid use in operations."

doppler run --project os --config dev_localhost -- tsx scripts/voice-agent-e2e.ts \
  --base-url http://127.0.0.1:5176 \
  --provider openai-realtime \
  --input-mode text \
  --expect-message-agent \
  --prompt "Call the messageAgent function now. Use message: Fetch example.com and tell me what it says. Do not answer in natural language before the function call." \
  --timeout-ms 180000

OpenAI result: ok: true, stream /voice-agents/e2e-mpd0dax0, messageAgentInputAdded: true, agentOutputAdded: true, codeAgentVoiceTextAdded: true, codeAgentVoiceText: "Example.com says: Example Domain. This domain is for use in documentation examples without needing permission. Avoid use in operations."

doppler run --project os --config dev_localhost -- tsx scripts/voice-agent-e2e.ts \
  --base-url http://127.0.0.1:5176 \
  --provider grok-realtime \
  --input-mode text \
  --expect-message-agent \
  --prompt "Call the messageAgent function now. Use message: Fetch example.com and tell me what it says. Do not answer in natural language before the function call." \
  --timeout-ms 180000

Grok regression result: ok: true, stream /voice-agents/e2e-mpd0du05, messageAgentInputAdded: true, agentOutputAdded: true, codeAgentVoiceTextAdded: true, codeAgentVoiceText: "Example.com says: Example Domain. This domain is for use in documentation examples without needing permission. Avoid use in operations."

Prompt-handoff regression coverage now verifies that if the background agent appends a caller-facing question such as What occupation should I put on your profile?, all three provider adapters send it with instructions to ask the human they are speaking to rather than answer it themselves.

Note: the original audio CLI returned ok=true based on output audio, but turnCompleted=false for the first three provider runs. The audio in/out proof works; the script should not currently treat provider turn-complete events as a reliable completion condition.

Environment Config Lease

Lease: preview-6
Doppler config: preview_6
Type: environment-config-lease
Leased until: 2026-05-19T23:05:00.080Z

OS

Status: deployed
Commit: 637cd1a
Preview: https://os.iterate-preview-6.com
Workflow run
Updated: 2026-05-19T22:07:36.660Z


Note

High Risk
High risk because it refactors durable-object bindings/subscriptions and realtime voice processing/tool-calling paths, which can break live voice sessions or agent handoff if misconfigured.

Overview
Stream processor refactor: Replaces the voice-agent-specific Durable Object binding (VOICE_AGENT) with a generic STREAM_PROCESSOR Durable Object that selects a processor via processorSlug, and updates OS runtime/context/exports accordingly.

Voice agent pipeline split: New voice-agent streams now subscribe to two processors: a canonical voice-agent protocol processor plus a provider-specific adapter (voice-agent/gemini-live, voice-agent/openai-realtime, voice-agent/grok-realtime), with new slugs/helpers to generate subscription names.

Text + tool-call bridge: Introduces canonical voice-agent input/output text events, forwards input text to providers, exposes a messageAgent tool for Gemini/OpenAI/Grok that appends events.iterate.com/agent/input-added, and wraps code-agent text so providers relay it to the human (not answer/acknowledge).

OS/UI/scripts updates: Voice-agent routes and tooling move to /agents/voice/* (while listing supports legacy /voice-agents/*), the stream console displays the new output-text events, and new/updated scripts (voice-agent-e2e text mode + expectations, gemini-live-tool-probe) validate provider tool-calling. Also updates the Alchemy patch to cap tags and include artifacts bindings in metadata.

Reviewed by Cursor Bugbot for commit 637cd1a. Bugbot is set up for automated code reviews on this repo. Configure here.

@jonastemplestein jonastemplestein marked this pull request as ready for review May 19, 2026 15:57
Comment thread apps/os/src/domains/agents/durable-objects/agent-durable-object.ts Outdated
Comment thread packages/shared/src/stream-processors/voice-agent/implementation.ts Outdated
Comment thread apps/os/src/components/voice-agent-stream-console.tsx
Comment thread packages/shared/src/stream-processors/voice-agent/implementation.ts
Comment thread packages/shared/src/stream-processors/voice-agent/implementation.ts
Comment thread apps/os/src/domains/agents/durable-objects/agent-durable-object.ts
Comment thread apps/os/src/domains/voice-agents/voice-agent-subscription.ts
Comment thread apps/os/src/domains/agents/durable-objects/agent-durable-object.ts Outdated
@jonastemplestein jonastemplestein force-pushed the voice-agent-processor-split branch from 27f0a4b to 9be9263 Compare May 19, 2026 21:08

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ed2beef. Configure here.

@jonastemplestein

Copy link
Copy Markdown
Contributor Author

Consolidated this work into #1349, which has been rebuilt on top of current origin/main with the new stream processor host abstraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant