openclaw
diff --git a/‎docs/.i18n/glossary.zh-CN.json‎
Lines changed: 4 additions & 0 deletions b/‎docs/.i18n/glossary.zh-CN.json‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/gateway/config-agents.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/gateway/config-agents.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/gateway/doctor.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/gateway/doctor.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/gateway/protocol.md‎
Lines changed: 11 additions & 1 deletion b/‎docs/gateway/protocol.md‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎docs/nodes/index.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/nodes/index.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/nodes/talk.md‎
Lines changed: 34 additions & 4 deletions b/‎docs/nodes/talk.md‎
Lines changed: 34 additions & 4 deletions
diff --git a/‎docs/platforms/ios.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/platforms/ios.md‎
Lines changed: 4 additions & 0 deletions
@@ -35,6 +35,10 @@
     "source": "Channel message API",
     "target": "频道消息 API"
   },
+  {
+    "source": "Talk mode",
+    "target": "Talk 模式"
+  },
   {
     "source": "Azure Speech",
     "target": "Azure Speech"
 
@@ -1384,6 +1384,18 @@ Defaults for Talk mode (macOS/iOS/Android).
     speechLocale: "ru-RU",
     silenceTimeoutMs: 1500,
     interruptOnSpeech: true,
+    realtime: {
+      provider: "openai",
+      providers: {
+        openai: {
+          model: "gpt-realtime",
+          voice: "alloy",
+        },
+      },
+      mode: "realtime",
+      transport: "webrtc",
+      brain: "agent-consult",
+    },
   },
 }
 ```
 
@@ -166,7 +166,7 @@ That stages grounded durable candidates into the short-term dreaming store while
   <Accordion title="1. Config normalization">
     If the config contains legacy value shapes (for example `messages.ackReaction` without a channel-specific override), doctor normalizes them into the current schema.
 
-    That includes legacy Talk flat fields. Current public Talk config is `talk.provider` + `talk.providers.<provider>`. Doctor rewrites old `talk.voiceId` / `talk.voiceAliases` / `talk.modelId` / `talk.outputFormat` / `talk.apiKey` shapes into the provider map.
+    That includes legacy Talk flat fields. Current public Talk speech config is `talk.provider` + `talk.providers.<provider>`, and realtime voice config is `talk.realtime.*`. Doctor rewrites old `talk.voiceId` / `talk.voiceAliases` / `talk.modelId` / `talk.outputFormat` / `talk.apiKey` shapes into the provider map, and rewrites legacy top-level realtime selectors (`talk.mode`, `talk.transport`, `talk.brain`, `talk.model`, `talk.voice`) into `talk.realtime`.
 
     Doctor also warns when `plugins.allow` is non-empty and tool policy uses
     wildcard or plugin-owned tool entries. `tools.allow: ["*"]` only matches tools
@@ -199,6 +199,7 @@ That stages grounded durable candidates into the short-term dreaming store while
     - `routing.bindings` → top-level `bindings`
     - `routing.agents`/`routing.defaultAgentId` → `agents.list` + `agents.list[].default`
     - legacy `talk.voiceId`/`talk.voiceAliases`/`talk.modelId`/`talk.outputFormat`/`talk.apiKey` → `talk.provider` + `talk.providers.<provider>`
+    - legacy top-level realtime Talk selectors (`talk.mode`/`talk.transport`/`talk.brain`/`talk.model`/`talk.voice`) + `talk.provider`/`talk.providers` → `talk.realtime`
     - `routing.agentToAgent` → `tools.agentToAgent`
     - `routing.transcribeAudio` → `tools.media.audio.models`
     - `messages.tts.<provider>` (`openai`/`elevenlabs`/`microsoft`/`edge`) → `messages.tts.providers.<provider>`
 
@@ -253,7 +253,8 @@ base method scope:
 
 Nodes declare capability claims at connect time:
 
-- `caps`: high-level capability categories.
+- `caps`: high-level capability categories such as `camera`, `canvas`, `screen`,
+  `location`, `voice`, and `talk`.
 - `commands`: command allowlist for invoke.
 - `permissions`: granular toggles (e.g. `screen.record`, `camera.capture`).
 
@@ -361,8 +362,17 @@ enumeration of `src/gateway/server-methods/*.ts`.
   </Accordion>
 
   <Accordion title="Talk and TTS">
+    - `talk.catalog` returns the read-only Talk provider catalog for speech, streaming transcription, and realtime voice. It includes provider ids, labels, configured state, exposed model/voice ids, canonical modes, transports, brain strategies, and realtime audio/capability flags without returning provider secrets or mutating global config.
     - `talk.config` returns the effective Talk config payload; `includeSecrets` requires `operator.talk.secrets` (or `operator.admin`).
+    - `talk.handoff.create` creates an expiring managed-room handoff for an existing session key. The result contains a room id, room URL, bearer token, optional session-scoped provider/model/voice selection, mode, transport, brain strategy, and expiry for a first-party walkie-talkie client. `brain: "direct-tools"` requires `operator.admin`.
+    - `talk.handoff.join` validates a handoff id plus bearer token, emits `session.ready` or `session.replaced` room events as needed, and returns room/session metadata plus recent Talk events without the plaintext token or stored token hash.
+    - `talk.handoff.turnStart`, `talk.handoff.turnEnd`, and `talk.handoff.turnCancel` let a first-party managed-room client drive the room turn lifecycle with `turn.started`, `turn.ended`, and `turn.cancelled` Talk events.
+    - `talk.handoff.revoke` invalidates an unexpired handoff, emits `session.closed`, and makes later joins fail.
     - `talk.mode` sets/broadcasts the current Talk mode state for WebChat/Control UI clients.
+    - `talk.realtime.session` creates a browser realtime session using canonical transports (`webrtc`, `provider-websocket`, or `gateway-relay`). It accepts optional `mode`, `transport`, and `brain` selectors, but currently only public browser `mode: "realtime"` plus `brain: "agent-consult"` is supported; `managed-room` remains reserved for handoff clients until the browser owns a real room client.
+    - `talk.realtime.relayAudio`, `talk.realtime.relayCancel`, `talk.realtime.relayMark`, `talk.realtime.relayStop`, and `talk.realtime.relayToolResult` control Gateway-owned realtime relay sessions. Relay cancellation clears provider output and aborts any linked agent consult run.
+    - `talk.realtime.toolCall` lets browser-owned realtime transports forward provider tool calls to Gateway policy. The first supported tool is `openclaw_agent_consult`; clients receive a run id and wait for normal chat lifecycle events before submitting the provider-specific tool result. Gateway relay clients include `relaySessionId` so turn cancellation can abort the consult.
+    - `talk.transcription.session` creates a transcription-only Gateway relay over the configured streaming STT provider. Clients send PCM frames through `talk.transcription.relayAudio`, cancel an active turn with `talk.transcription.relayCancel`, receive `talk.transcription.relay` events with common Talk envelopes, and close with `talk.transcription.relayStop`.
     - `talk.speak` synthesizes speech through the active Talk speech provider.
     - `tts.status` returns TTS enabled state, active provider, fallback providers, and provider config state.
     - `tts.providers` returns the visible TTS provider inventory.
 
@@ -197,6 +197,9 @@ Node commands must pass two gates before they can be invoked:
 
 Windows and macOS companion nodes allow safe declared commands such as
 `canvas.*`, `camera.list`, `location.get`, and `screen.snapshot` by default.
+Trusted nodes that advertise the `talk` capability or declare `talk.*` commands
+also allow declared push-to-talk commands (`talk.ptt.start`, `talk.ptt.stop`,
+`talk.ptt.cancel`, `talk.ptt.once`) by default, independent of platform label.
 Dangerous or privacy-heavy commands such as `camera.snap`, `camera.clip`, and
 `screen.record` still require explicit opt-in with
 `gateway.nodes.allowCommands`. `gateway.nodes.denyCommands` always wins over
 
@@ -1,18 +1,28 @@
 ---
-summary: "Talk mode: continuous speech conversations with configured TTS providers"
+summary: "Talk mode: continuous speech conversations across local STT/TTS and realtime voice"
 read_when:
   - Implementing Talk mode on macOS/iOS/Android
   - Changing voice/TTS/interrupt behavior
 title: "Talk mode"
 ---
 
-Talk mode is a continuous voice conversation loop:
+Talk mode has two runtime shapes:
+
+- Native macOS/iOS/Android Talk uses local speech recognition, Gateway chat, and `talk.speak` TTS. Nodes advertise the `talk` capability and declare the `talk.*` commands they support.
+- Browser Talk uses `talk.realtime.session` with canonical transports: `webrtc`, `provider-websocket`, or `gateway-relay`. `managed-room` is reserved for Gateway handoff rooms.
+- Transcription-only clients use `talk.transcription.session` plus `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop` when they need captions or dictation without an assistant voice response.
+
+Native Talk is a continuous voice conversation loop:
 
 1. Listen for speech
-2. Send transcript to the model (main session, chat.send)
+2. Send transcript to the model through the active session
 3. Wait for the response
 4. Speak it via the configured Talk provider (`talk.speak`)
 
+Browser realtime Talk forwards provider tool calls through `talk.realtime.toolCall`; browser clients do not call `chat.send` directly for realtime consults.
+
+Transcription-only Talk emits the same common Talk event envelope as realtime and STT/TTS sessions, but uses `mode: "transcription"` and `brain: "none"`. It is for captions, dictation, and observe-only speech capture; one-shot uploaded voice notes still use the media/audio path.
+
 ## Behavior (macOS)
 
 - **Always-on overlay** while Talk mode is enabled.
@@ -66,6 +76,19 @@ Supported keys:
     speechLocale: "ru-RU",
     silenceTimeoutMs: 1500,
     interruptOnSpeech: true,
+    realtime: {
+      provider: "openai",
+      providers: {
+        openai: {
+          apiKey: "openai_api_key",
+          model: "gpt-realtime",
+          voice: "alloy",
+        },
+      },
+      mode: "realtime",
+      transport: "webrtc",
+      brain: "agent-consult",
+    },
   },
 }
 ```
@@ -79,6 +102,11 @@ Defaults:
 - `providers.elevenlabs.modelId`: defaults to `eleven_v3` when unset.
 - `providers.mlx.modelId`: defaults to `mlx-community/Soprano-80M-bf16` when unset.
 - `providers.elevenlabs.apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available).
+- `realtime.provider`: selects the active browser/server realtime voice provider. Use `openai` for WebRTC, `google` for provider WebSocket, or a bridge-only provider through Gateway relay.
+- `realtime.providers.<provider>` stores provider-owned realtime config. The browser receives only ephemeral or constrained session credentials, never a standard API key.
+- `realtime.brain`: `agent-consult` routes realtime tool calls through Gateway policy; `direct-tools` is owner-only compatibility behavior; `none` is for transcription or external orchestration.
+- `talk.catalog` exposes each provider's valid modes, transports, brain strategies, realtime audio formats, and capability flags so first-party Talk clients can avoid unsupported combinations.
+- Streaming transcription providers are discovered through `talk.catalog.transcription`. The current Gateway relay uses the Voice Call streaming provider config until the dedicated Talk transcription config surface is added.
 - `speechLocale`: optional BCP 47 locale id for on-device Talk speech recognition on iOS/macOS. Leave unset to use the device default.
 - `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)
 
@@ -103,7 +131,9 @@ Defaults:
 ## Notes
 
 - Requires Speech + Microphone permissions.
-- Uses `chat.send` against session key `main`.
+- Native Talk uses the active Gateway session and only falls back to history polling when response events are unavailable.
+- Browser realtime Talk uses `talk.realtime.toolCall` for `openclaw_agent_consult` instead of exposing `chat.send` to provider-owned browser sessions.
+- Transcription-only Talk uses `talk.transcription.session`, `talk.transcription.relayAudio`, `talk.transcription.relayCancel`, and `talk.transcription.relayStop`; clients subscribe to `talk.transcription.relay` events for partial/final transcript updates.
 - The gateway resolves Talk playback through `talk.speak` using the active Talk provider. Android falls back to local system TTS only when that RPC is unavailable.
 - macOS local MLX playback uses the bundled `openclaw-mlx-tts` helper when present, or an executable on `PATH`. Set `OPENCLAW_MLX_TTS_BIN` to point at a custom helper binary during development.
 - `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.
 
@@ -263,6 +263,10 @@ openclaw nodes invoke --node "iOS Node" --command canvas.snapshot --params '{"ma
 ## Voice wake + talk mode
 
 - Voice wake and talk mode are available in Settings.
+- Talk-capable iOS nodes advertise the `talk` capability and can declare
+  `talk.ptt.start`, `talk.ptt.stop`, `talk.ptt.cancel`, and `talk.ptt.once`;
+  the Gateway allows those push-to-talk commands by default for trusted
+  Talk-capable nodes.
 - iOS may suspend background audio; treat voice features as best-effort when the app is not active.
 
 ## Common errors