Skip to content

Commit 93bbbe5

Browse files
committed
feat: add browser realtime talk transports
1 parent 5dd1e26 commit 93bbbe5

26 files changed

Lines changed: 2609 additions & 321 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Docs: https://docs.openclaw.ai
99
### Changes
1010

1111
- Plugins/runtime: expose provider-backed thinking policy and normalization through `api.runtime.agent`, letting tool plugins validate thinking overrides without duplicating provider/model level lists. Thanks @openclaw.
12+
- Control UI/Talk: add a generic browser realtime transport contract, Google Live browser Talk sessions with constrained ephemeral tokens, and a Gateway relay for backend-only realtime voice plugins. Thanks @VACInc.
1213
- Providers: add Cerebras as a bundled plugin with onboarding, static model catalog, docs, and manifest-owned endpoint metadata.
1314
- Memory/OpenAI-compatible: add optional `memorySearch.inputType`, `queryInputType`, and `documentInputType` config for asymmetric embedding endpoints, including direct query embeddings and provider batch indexing. Carries forward #63313 and #60727. Thanks @HOYALIM and @prospect1314521.
1415
- Ollama/memory: add model-specific retrieval query prefixes for `nomic-embed-text`, `qwen3-embedding`, and `mxbai-embed-large` memory-search queries while leaving document batches unchanged. Carries forward #45013. Thanks @laolin5564.
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
b81647828ee6599cdd1d76d96ea02c92ccdebb8c1b3b443cefe10ca8bd2ddbfe plugin-sdk-api-baseline.json
2-
ca9f3569352522621857b51872f30b3c31881505fd9eff2451b1b46d77670726 plugin-sdk-api-baseline.jsonl
1+
7178659d932136074130426d08e596738a991c6812b2494149427d1f822f1be8 plugin-sdk-api-baseline.json
2+
fc1e3ab9f21b6f7b6a55498cf5ee322d62dccf4c23322f0ba27559e55a59f901 plugin-sdk-api-baseline.jsonl

docs/providers/google.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -352,11 +352,17 @@ SDK rejects language-code hints on this API path.
352352
</Note>
353353

354354
<Note>
355-
Control UI Talk browser sessions still require a realtime voice provider with a
356-
browser WebRTC session implementation. Today that path is OpenAI Realtime; the
357-
Google provider is for backend realtime bridges.
355+
Control UI Talk supports Google Live browser sessions with constrained one-use
356+
tokens. Backend-only realtime voice providers can also run through the generic
357+
Gateway relay transport, which keeps provider credentials on the Gateway.
358358
</Note>
359359

360+
For maintainer live verification, run
361+
`OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts`.
362+
The Google leg mints the same constrained Live API token shape used by Control
363+
UI Talk, opens the browser WebSocket endpoint, sends the initial setup payload,
364+
and waits for `setupComplete`.
365+
360366
## Advanced configuration
361367

362368
<AccordionGroup>

docs/providers/openai.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -546,7 +546,17 @@ Legacy `plugins.entries.openai.config.personality` is still read as a compatibil
546546
| API key | `...openai.apiKey` | Falls back to `OPENAI_API_KEY` |
547547

548548
<Note>
549-
Supports Azure OpenAI via `azureEndpoint` and `azureDeployment` config keys. Supports bidirectional tool calling. Uses G.711 u-law audio format.
549+
Supports Azure OpenAI via `azureEndpoint` and `azureDeployment` config keys for backend realtime bridges. Supports bidirectional tool calling. Uses G.711 u-law audio format.
550+
</Note>
551+
552+
<Note>
553+
Control UI Talk uses OpenAI browser realtime sessions with a Gateway-minted
554+
ephemeral client secret and a direct browser WebRTC SDP exchange against the
555+
OpenAI Realtime API. Maintainer live verification is available with
556+
`OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts`;
557+
the OpenAI leg mints a client secret in Node, generates a browser SDP offer
558+
with fake microphone media, posts it to OpenAI, and applies the SDP answer
559+
without logging secrets.
550560
</Note>
551561

552562
</Accordion>

docs/web/control-ui.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ The Control UI can localize itself on first load based on your browser locale. T
8787
<AccordionGroup>
8888
<Accordion title="Chat and Talk">
8989
- Chat with the model via Gateway WS (`chat.history`, `chat.send`, `chat.abort`, `chat.inject`).
90-
- Talk to OpenAI Realtime directly from the browser via WebRTC. The Gateway mints a short-lived Realtime client secret with `talk.realtime.session`; the browser sends microphone audio directly to OpenAI and relays `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model.
90+
- Talk through browser realtime sessions. OpenAI uses direct WebRTC, Google Live uses a constrained one-use browser token over WebSocket, and backend-only realtime voice plugins use the Gateway relay transport. The relay keeps provider credentials on the Gateway while the browser streams microphone PCM through `talk.realtime.relay*` RPCs and sends `openclaw_agent_consult` tool calls back through `chat.send` for the larger configured OpenClaw model.
9191
- Stream tool calls + live tool output cards in Chat (agent events).
9292
</Accordion>
9393
<Accordion title="Channels, instances, sessions, dreams">
@@ -144,11 +144,13 @@ The Control UI can localize itself on first load based on your browser locale. T
144144
- The chat header model and thinking pickers patch the active session immediately through `sessions.patch`; they are persistent session overrides, not one-turn-only send options.
145145
- When fresh Gateway session usage reports show high context pressure, the chat composer area shows a context notice and, at recommended compaction levels, a compact button that runs the normal session compaction path. Stale token snapshots are hidden until the Gateway reports fresh usage again.
146146
</Accordion>
147-
<Accordion title="Talk mode (browser WebRTC)">
148-
Talk mode uses a registered realtime voice provider that supports browser WebRTC sessions. Configure OpenAI with `talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or reuse the Voice Call realtime provider config. The browser never receives the standard OpenAI API key; it receives only the ephemeral Realtime client secret. Google Live realtime voice is supported for backend Voice Call and Google Meet bridges, but not this browser WebRTC path yet. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides.
147+
<Accordion title="Talk mode (browser realtime)">
148+
Talk mode uses a registered realtime voice provider. Configure OpenAI with `talk.provider: "openai"` plus `talk.providers.openai.apiKey`, or configure Google with `talk.provider: "google"` plus `talk.providers.google.apiKey`; Voice Call realtime provider config can still be reused as the fallback. The browser never receives a standard provider API key. OpenAI receives an ephemeral Realtime client secret for WebRTC. Google Live receives a one-use constrained Live API auth token for a browser WebSocket session, with instructions and tool declarations locked into the token by the Gateway. Providers that only expose a backend realtime bridge run through the Gateway relay transport, so credentials and vendor sockets stay server-side while browser audio moves through authenticated Gateway RPCs. The Realtime session prompt is assembled by the Gateway; `talk.realtime.session` does not accept caller-provided instruction overrides.
149149

150150
In the Chat composer, the Talk control is the waves button next to the microphone dictation button. When Talk starts, the composer status row shows `Connecting Talk...`, then `Talk live` while audio is connected, or `Asking OpenClaw...` while a realtime tool call is consulting the configured larger model through `chat.send`.
151151

152+
Maintainer live smoke: `OPENAI_API_KEY=... GEMINI_API_KEY=... node --import tsx scripts/dev/realtime-talk-live-smoke.ts` verifies the OpenAI browser WebRTC SDP exchange, Google Live constrained-token browser WebSocket setup, and the Gateway relay browser adapter with fake microphone media. The command prints provider status only and does not log secrets.
153+
152154
</Accordion>
153155
<Accordion title="Stop and abort">
154156
- Click **Stop** (calls `chat.abort`).

extensions/google/realtime-voice-provider.test.ts

Lines changed: 91 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,19 +20,25 @@ type MockGoogleLiveConnectParams = {
2020
};
2121
};
2222

23-
const { connectMock, session } = vi.hoisted(() => {
23+
const { connectMock, createTokenMock, session } = vi.hoisted(() => {
2424
const session: MockGoogleLiveSession = {
2525
close: vi.fn(),
2626
sendClientContent: vi.fn(),
2727
sendRealtimeInput: vi.fn(),
2828
sendToolResponse: vi.fn(),
2929
};
3030
const connectMock = vi.fn(async (_params: MockGoogleLiveConnectParams) => session);
31-
return { connectMock, session };
31+
const createTokenMock = vi.fn(async (_params: unknown) => ({
32+
name: "auth_tokens/browser-session",
33+
}));
34+
return { connectMock, createTokenMock, session };
3235
});
3336

3437
vi.mock("./google-genai-runtime.js", () => ({
3538
createGoogleGenAI: vi.fn(() => ({
39+
authTokens: {
40+
create: createTokenMock,
41+
},
3642
live: {
3743
connect: connectMock,
3844
},
@@ -50,6 +56,7 @@ function lastConnectParams(): MockGoogleLiveConnectParams {
5056
describe("buildGoogleRealtimeVoiceProvider", () => {
5157
beforeEach(() => {
5258
connectMock.mockClear();
59+
createTokenMock.mockClear();
5360
session.close.mockClear();
5461
session.sendClientContent.mockClear();
5562
session.sendRealtimeInput.mockClear();
@@ -223,6 +230,88 @@ describe("buildGoogleRealtimeVoiceProvider", () => {
223230
expect(lastConnectParams().config).not.toHaveProperty("temperature");
224231
});
225232

233+
it("creates constrained browser sessions for Google Live Talk", async () => {
234+
const provider = buildGoogleRealtimeVoiceProvider();
235+
236+
const session = await provider.createBrowserSession?.({
237+
providerConfig: {
238+
apiKey: "gemini-key",
239+
model: "gemini-live-2.5-flash-preview",
240+
voice: "Puck",
241+
temperature: 0.4,
242+
},
243+
instructions: "Speak briefly.",
244+
tools: [
245+
{
246+
type: "function",
247+
name: "openclaw_agent_consult",
248+
description: "Ask OpenClaw",
249+
parameters: {
250+
type: "object",
251+
properties: {
252+
question: { type: "string" },
253+
},
254+
required: ["question"],
255+
},
256+
},
257+
],
258+
});
259+
260+
expect(createTokenMock).toHaveBeenCalledTimes(1);
261+
expect(createTokenMock.mock.calls[0]?.[0]).toMatchObject({
262+
config: {
263+
uses: 1,
264+
liveConnectConstraints: {
265+
model: "gemini-live-2.5-flash-preview",
266+
config: {
267+
responseModalities: ["AUDIO"],
268+
temperature: 0.4,
269+
systemInstruction: "Speak briefly.",
270+
speechConfig: {
271+
voiceConfig: {
272+
prebuiltVoiceConfig: {
273+
voiceName: "Puck",
274+
},
275+
},
276+
},
277+
tools: [
278+
{
279+
functionDeclarations: [
280+
{
281+
name: "openclaw_agent_consult",
282+
behavior: "NON_BLOCKING",
283+
},
284+
],
285+
},
286+
],
287+
},
288+
},
289+
},
290+
});
291+
expect(session).toMatchObject({
292+
provider: "google",
293+
transport: "json-pcm-websocket",
294+
protocol: "google-live-bidi",
295+
clientSecret: "auth_tokens/browser-session",
296+
websocketUrl:
297+
"wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContentConstrained",
298+
audio: {
299+
inputEncoding: "pcm16",
300+
inputSampleRateHz: 16000,
301+
outputEncoding: "pcm16",
302+
outputSampleRateHz: 24000,
303+
},
304+
initialMessage: {
305+
setup: {
306+
model: "models/gemini-live-2.5-flash-preview",
307+
generationConfig: {
308+
responseModalities: ["AUDIO"],
309+
},
310+
},
311+
},
312+
});
313+
});
314+
226315
it("waits for setup completion before draining audio and firing ready", async () => {
227316
const provider = buildGoogleRealtimeVoiceProvider();
228317
const onReady = vi.fn();

0 commit comments

Comments
 (0)