Skip to content

Commit 24853ce

Browse files
committed
docs: outline unified talk API
1 parent 1f7d0ef commit 24853ce

13 files changed

Lines changed: 625 additions & 13 deletions

File tree

docs/.i18n/glossary.zh-CN.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@
3535
"source": "Channel message API",
3636
"target": "频道消息 API"
3737
},
38+
{
39+
"source": "Talk mode",
40+
"target": "Talk 模式"
41+
},
3842
{
3943
"source": "Azure Speech",
4044
"target": "Azure Speech"

docs/gateway/config-agents.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1384,6 +1384,18 @@ Defaults for Talk mode (macOS/iOS/Android).
13841384
speechLocale: "ru-RU",
13851385
silenceTimeoutMs: 1500,
13861386
interruptOnSpeech: true,
1387+
realtime: {
1388+
provider: "openai",
1389+
providers: {
1390+
openai: {
1391+
model: "gpt-realtime",
1392+
voice: "alloy",
1393+
},
1394+
},
1395+
mode: "realtime",
1396+
transport: "webrtc",
1397+
brain: "agent-consult",
1398+
},
13871399
},
13881400
}
13891401
```

docs/gateway/doctor.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ That stages grounded durable candidates into the short-term dreaming store while
166166
<Accordion title="1. Config normalization">
167167
If the config contains legacy value shapes (for example `messages.ackReaction` without a channel-specific override), doctor normalizes them into the current schema.
168168

169-
That includes legacy Talk flat fields. Current public Talk config is `talk.provider` + `talk.providers.<provider>`. Doctor rewrites old `talk.voiceId` / `talk.voiceAliases` / `talk.modelId` / `talk.outputFormat` / `talk.apiKey` shapes into the provider map.
169+
That includes legacy Talk flat fields. Current public Talk speech config is `talk.provider` + `talk.providers.<provider>`, and realtime voice config is `talk.realtime.*`. Doctor rewrites old `talk.voiceId` / `talk.voiceAliases` / `talk.modelId` / `talk.outputFormat` / `talk.apiKey` shapes into the provider map, and rewrites legacy top-level realtime selectors (`talk.mode`, `talk.transport`, `talk.brain`, `talk.model`, `talk.voice`) into `talk.realtime`.
170170

171171
Doctor also warns when `plugins.allow` is non-empty and tool policy uses
172172
wildcard or plugin-owned tool entries. `tools.allow: ["*"]` only matches tools
@@ -199,6 +199,7 @@ That stages grounded durable candidates into the short-term dreaming store while
199199
- `routing.bindings` → top-level `bindings`
200200
- `routing.agents`/`routing.defaultAgentId` → `agents.list` + `agents.list[].default`
201201
- legacy `talk.voiceId`/`talk.voiceAliases`/`talk.modelId`/`talk.outputFormat`/`talk.apiKey` → `talk.provider` + `talk.providers.<provider>`
202+
- legacy top-level realtime Talk selectors (`talk.mode`/`talk.transport`/`talk.brain`/`talk.model`/`talk.voice`) + `talk.provider`/`talk.providers` → `talk.realtime`
202203
- `routing.agentToAgent` → `tools.agentToAgent`
203204
- `routing.transcribeAudio` → `tools.media.audio.models`
204205
- `messages.tts.<provider>` (`openai`/`elevenlabs`/`microsoft`/`edge`) → `messages.tts.providers.<provider>`

docs/gateway/protocol.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,8 @@ base method scope:
253253

254254
Nodes declare capability claims at connect time:
255255

256-
- `caps`: high-level capability categories.
256+
- `caps`: high-level capability categories such as `camera`, `canvas`, `screen`,
257+
`location`, `voice`, and `talk`.
257258
- `commands`: command allowlist for invoke.
258259
- `permissions`: granular toggles (e.g. `screen.record`, `camera.capture`).
259260

@@ -361,8 +362,17 @@ enumeration of `src/gateway/server-methods/*.ts`.
361362
</Accordion>
362363

363364
<Accordion title="Talk and TTS">
365+
- `talk.catalog` returns the read-only Talk provider catalog for speech, streaming transcription, and realtime voice. It includes provider ids, labels, configured state, exposed model/voice ids, canonical modes, transports, brain strategies, and realtime audio/capability flags without returning provider secrets or mutating global config.
364366
- `talk.config` returns the effective Talk config payload; `includeSecrets` requires `operator.talk.secrets` (or `operator.admin`).
367+
- `talk.handoff.create` creates an expiring managed-room handoff for an existing session key. The result contains a room id, room URL, bearer token, optional session-scoped provider/model/voice selection, mode, transport, brain strategy, and expiry for a first-party walkie-talkie client. `brain: "direct-tools"` requires `operator.admin`.
368+
- `talk.handoff.join` validates a handoff id plus bearer token, emits `session.ready` or `session.replaced` room events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash.
369+
- `talk.handoff.turnStart`, `talk.handoff.turnEnd`, and `talk.handoff.turnCancel` let a first-party managed-room client drive the room turn lifecycle with `turn.started`, `turn.ended`, and `turn.cancelled` Talk events.
370+
- `talk.handoff.revoke` invalidates an unexpired handoff, emits `session.closed`, and makes later joins fail.
365371
- `talk.mode` sets/broadcasts the current Talk mode state for WebChat/Control UI clients.
372+
- `talk.realtime.session` creates a browser realtime session using canonical transports (`webrtc`, `provider-websocket`, or `gateway-relay`). It accepts optional `mode`, `transport`, and `brain` selectors, but currently only public browser `mode: "realtime"` plus `brain: "agent-consult"` is supported; `managed-room` remains reserved for handoff clients until the browser owns a real room client.
373+
- `talk.realtime.relayAudio`, `talk.realtime.relayCancel`, `talk.realtime.relayMark`, `talk.realtime.relayStop`, and `talk.realtime.relayToolResult` control Gateway-owned realtime relay sessions. Relay cancellation clears provider output and aborts any linked agent consult run.
374+
- `talk.realtime.toolCall` lets browser-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result. Gateway relay clients include `relaySessionId` so turn cancellation can abort the consult.
375+
- `talk.transcription.session` creates a transcription-only Gateway relay over the configured streaming STT provider. Clients send PCM frames through `talk.transcription.relayAudio`, cancel an active turn with `talk.transcription.relayCancel`, receive `talk.transcription.relay` events with common Talk envelopes, and close with `talk.transcription.relayStop`.
366376
- `talk.speak` synthesizes speech through the active Talk speech provider.
367377
- `tts.status` returns TTS enabled state, active provider, fallback providers, and provider config state.
368378
- `tts.providers` returns the visible TTS provider inventory.

docs/nodes/index.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,9 @@ Node commands must pass two gates before they can be invoked:
197197

198198
Windows and macOS companion nodes allow safe declared commands such as
199199
`canvas.*`, `camera.list`, `location.get`, and `screen.snapshot` by default.
200+
Trusted nodes that advertise the `talk` capability or declare `talk.*` commands
201+
also allow declared push-to-talk commands (`talk.ptt.start`, `talk.ptt.stop`,
202+
`talk.ptt.cancel`, `talk.ptt.once`) by default, independent of platform label.
200203
Dangerous or privacy-heavy commands such as `camera.snap`, `camera.clip`, and
201204
`screen.record` still require explicit opt-in with
202205
`gateway.nodes.allowCommands`. `gateway.nodes.denyCommands` always wins over

docs/nodes/talk.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,28 @@
11
---
2-
summary: "Talk mode: continuous speech conversations with configured TTS providers"
2+
summary: "Talk mode: continuous speech conversations across local STT/TTS and realtime voice"
33
read_when:
44
- Implementing Talk mode on macOS/iOS/Android
55
- Changing voice/TTS/interrupt behavior
66
title: "Talk mode"
77
---
88

9-
Talk mode is a continuous voice conversation loop:
9+
Talk mode has two runtime shapes:
10+
11+
- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and `talk.speak` TTS. Nodes advertise the `talk` capability and declare the `talk.*` commands they support.
12+
- Browser Talk uses `talk.realtime.session` with canonical transports: `webrtc`, `provider-websocket`, or `gateway-relay`. `managed-room` is reserved for Gateway handoff rooms.
13+
- Transcription-only clients use `talk.transcription.session` plus `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop` when they need captions or dictation without an assistant voice response.
14+
15+
Native Talk is a continuous voice conversation loop:
1016

1117
1. Listen for speech
12-
2. Send transcript to the model (main session, chat.send)
18+
2. Send transcript to the model through the active session
1319
3. Wait for the response
1420
4. Speak it via the configured Talk provider (`talk.speak`)
1521

22+
Browser realtime Talk forwards provider tool calls through `talk.realtime.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
23+
24+
Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path.
25+
1626
## Behavior (macOS)
1727

1828
- **Always-on overlay** while Talk mode is enabled.
@@ -66,6 +76,19 @@ Supported keys:
6676
speechLocale: "ru-RU",
6777
silenceTimeoutMs: 1500,
6878
interruptOnSpeech: true,
79+
realtime: {
80+
provider: "openai",
81+
providers: {
82+
openai: {
83+
apiKey: "openai_api_key",
84+
model: "gpt-realtime",
85+
voice: "alloy",
86+
},
87+
},
88+
mode: "realtime",
89+
transport: "webrtc",
90+
brain: "agent-consult",
91+
},
6992
},
7093
}
7194
```
@@ -79,6 +102,11 @@ Defaults:
79102
- `providers.elevenlabs.modelId`: defaults to `eleven_v3` when unset.
80103
- `providers.mlx.modelId`: defaults to `mlx-community/Soprano-80M-bf16` when unset.
81104
- `providers.elevenlabs.apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available).
105+
- `realtime.provider`: selects the active browser/server realtime voice provider. Use `openai` for WebRTC, `google` for provider WebSocket, or a bridge-only provider through Gateway relay.
106+
- `realtime.providers.<provider>` stores provider-owned realtime config. The browser receives only ephemeral or constrained session credentials, never a standard API key.
107+
- `realtime.brain`: `agent-consult` routes realtime tool calls through Gateway policy; `direct-tools` is owner-only compatibility behavior; `none` is for transcription or external orchestration.
108+
- `talk.catalog` exposes each provider's valid modes, transports, brain strategies, realtime audio formats, and capability flags so first-party Talk clients can avoid unsupported combinations.
109+
- Streaming transcription providers are discovered through `talk.catalog.transcription`. The current Gateway relay uses the Voice Call streaming provider config until the dedicated Talk transcription config surface is added.
82110
- `speechLocale`: optional BCP 47 locale id for on-device Talk speech recognition on iOS/macOS. Leave unset to use the device default.
83111
- `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)
84112

@@ -103,7 +131,9 @@ Defaults:
103131
## Notes
104132

105133
- Requires Speech + Microphone permissions.
106-
- Uses `chat.send` against session key `main`.
134+
- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable.
135+
- Browser realtime Talk uses `talk.realtime.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
136+
- Transcription-only Talk uses `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`; clients subscribe to `talk.transcription.relay` events for partial/final transcript updates.
107137
- The gateway resolves Talk playback through `talk.speak` using the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable.
108138
- macOS local MLX playback uses the bundled `openclaw-mlx-tts` helper when present, or an executable on `PATH`. Set `OPENCLAW_MLX_TTS_BIN` to point at a custom helper binary during development.
109139
- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.

docs/platforms/ios.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,6 +263,10 @@ openclaw nodes invoke --node "iOS Node" --command canvas.snapshot --params '{"ma
263263
## Voice wake + talk mode
264264

265265
- Voice wake and talk mode are available in Settings.
266+
- Talk-capable iOS nodes advertise the `talk` capability and can declare
267+
`talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once`;
268+
the Gateway allows those push-to-talk commands by default for trusted
269+
Talk-capable nodes.
266270
- iOS may suspend background audio; treat voice features as best-effort when the app is not active.
267271

268272
## Common errors

0 commit comments

Comments
 (0)