openclaw
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/.generated/plugin-sdk-api-baseline.sha256‎
Lines changed: 2 additions & 2 deletions b/‎docs/.generated/plugin-sdk-api-baseline.sha256‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/channels/discord.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/channels/discord.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/providers/elevenlabs.md‎
Lines changed: 7 additions & 0 deletions b/‎docs/providers/elevenlabs.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/providers/google.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/providers/google.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/tools/tts.md‎
Lines changed: 17 additions & 17 deletions b/‎docs/tools/tts.md‎
Lines changed: 17 additions & 17 deletions
diff --git a/‎extensions/discord/src/voice/manager.e2e.test.ts‎
Lines changed: 54 additions & 1 deletion b/‎extensions/discord/src/voice/manager.e2e.test.ts‎
Lines changed: 54 additions & 1 deletion
diff --git a/‎extensions/discord/src/voice/segment.ts‎
Lines changed: 28 additions & 12 deletions b/‎extensions/discord/src/voice/segment.ts‎
Lines changed: 28 additions & 12 deletions
@@ -6,6 +6,7 @@ Docs: https://docs.openclaw.ai
 
 ### Changes
 
+- Discord/voice: stream ElevenLabs TTS directly into Discord playback and send ElevenLabs latency optimization as the documented query parameter so spoken replies can start sooner.
 - Discord/voice: keep TTS playback running when another user starts speaking, ignore new capture during playback to avoid feedback loops, and downgrade expected receive-stream aborts to verbose diagnostics.
 - Telegram: treat successful same-chat `message` tool outbound sends during an inbound telegram turn as delivered when deciding whether to emit the rewritten silent reply fallback (#78685). Thanks @neeravmakwana.
 - Gateway/tasks: reconcile stale CLI run-context tasks whose live run context disappeared even when a child session row remains, and apply the default bounded reload deferral timeout to channel hot reloads so stale task records cannot block Discord/Slack/Telegram reloads forever.
 
@@ -1,2 +1,2 @@
-bf73d3d6b83410753ee782289e4748c96d97bc76459b116e5e03c678996da360  plugin-sdk-api-baseline.json
-f6a9f57d7b632391061c5bac78366bcb01318e0fde26a437e48606bdb70fe9fa  plugin-sdk-api-baseline.jsonl
+e26753b5aaa10cd98cb0e07fca4034c091471cf434239cc3597b62b5a62b082b  plugin-sdk-api-baseline.json
+7b998abde706a1afe4d1e4475a87069c31f673c3c90b8a7f23f7ba8cff6d1c85  plugin-sdk-api-baseline.jsonl
@@ -1207,6 +1207,7 @@ Notes:
 - `voice.reconnectGraceMs` controls how long OpenClaw waits for a disconnected voice session to begin reconnecting before destroying it. Default: `15000`.
 - Voice playback does not stop just because another user starts speaking. To avoid feedback loops, OpenClaw ignores new voice capture while TTS is playing; speak after playback finishes for the next turn.
 - `voice.captureSilenceGraceMs` controls how long OpenClaw waits after Discord reports a speaker has stopped before finalizing that audio segment for STT. Default: `2500`; raise this if Discord splits normal pauses into choppy partial transcripts.
+- When ElevenLabs is the selected TTS provider, Discord voice playback uses streaming TTS and starts from the provider response stream. Providers without streaming support fall back to the synthesized temp-file path.
 - OpenClaw also watches receive decrypt failures and auto-recovers by leaving/rejoining the voice channel after repeated failures in a short window.
 - If receive logs repeatedly show `DecryptionFailed(UnencryptedWhenPassthroughDisabled)` after updating, collect a dependency report and logs. The bundled `@discordjs/voice` line includes the upstream padding fix from discord.js PR #11449, which closed discord.js issue #11419.
 - `The operation was aborted` receive events are expected when OpenClaw finalizes a captured speaker segment; they are verbose diagnostics, not warnings.
@@ -1217,7 +1218,7 @@ Voice channel pipeline:
 - `tools.media.audio` handles STT, for example `openai/gpt-4o-mini-transcribe`.
 - The transcript is sent through Discord ingress and routing while the response LLM runs with a voice-output policy that hides the agent `tts` tool and asks for returned text, because Discord voice owns final TTS playback.
 - `voice.model`, when set, overrides only the response LLM for this voice-channel turn.
-- `voice.tts` is merged over `messages.tts`; the resulting audio is played in the joined channel.
+- `voice.tts` is merged over `messages.tts`; streaming-capable providers feed the player directly, otherwise the resulting audio file is played in the joined channel.
 
 Credentials are resolved per component: LLM route auth for `voice.model`, STT auth for `tools.media.audio`, and TTS auth for `messages.tts`/`voice.tts`.
 
 
@@ -46,6 +46,13 @@ export ELEVENLABS_API_KEY="..."
 Set `modelId` to `eleven_v3` to use ElevenLabs v3 TTS. OpenClaw keeps
 `eleven_multilingual_v2` as the default for existing installs.
 
+Discord voice channels use ElevenLabs' streaming TTS endpoint when ElevenLabs is
+the selected `voice.tts`/`messages.tts` provider. Playback starts from the
+returned audio stream instead of waiting for OpenClaw to download and write the
+whole audio file first. `latencyTier` maps to ElevenLabs'
+`optimize_streaming_latency` query parameter for models that accept it; OpenClaw
+omits that parameter for `eleven_v3`, which rejects it.
+
 ## Speech-to-text
 
 Use Scribe v2 for inbound audio attachments and short recorded voice segments:
 
@@ -287,6 +287,11 @@ The bundled `google` speech provider uses the Gemini API TTS path with
 - Output: WAV for regular TTS attachments, Opus for voice-note targets, PCM for Talk/telephony
 - Voice-note output: Google PCM is wrapped as WAV and transcoded to 48 kHz Opus with `ffmpeg`
 
+Google's batch Gemini TTS path returns generated audio in the completed
+`generateContent` response. For lowest-latency spoken conversations, use the
+Google realtime voice provider backed by the Gemini Live API instead of batch
+TTS.
+
 To use Google as the default TTS provider:
 
 ```json5
 
@@ -60,23 +60,23 @@ speech.
 
 ## Supported providers
 
-| Provider          | Auth                                                                                                             | Notes                                                                   |
-| ----------------- | ---------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
-| **Azure Speech**  | `AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION` (also `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, `SPEECH_REGION`)          | Native Ogg/Opus voice-note output and telephony.                        |
-| **DeepInfra**     | `DEEPINFRA_API_KEY`                                                                                              | OpenAI-compatible TTS. Defaults to `hexgrad/Kokoro-82M`.                |
-| **ElevenLabs**    | `ELEVENLABS_API_KEY` or `XI_API_KEY`                                                                             | Voice cloning, multilingual, deterministic via `seed`.                  |
-| **Google Gemini** | `GEMINI_API_KEY` or `GOOGLE_API_KEY`                                                                             | Gemini API TTS; persona-aware via `promptTemplate: "audio-profile-v1"`. |
-| **Gradium**       | `GRADIUM_API_KEY`                                                                                                | Voice-note and telephony output.                                        |
-| **Inworld**       | `INWORLD_API_KEY`                                                                                                | Streaming TTS API. Native Opus voice-note and PCM telephony.            |
-| **Local CLI**     | none                                                                                                             | Runs a configured local TTS command.                                    |
-| **Microsoft**     | none                                                                                                             | Public Edge neural TTS via `node-edge-tts`. Best-effort, no SLA.        |
-| **MiniMax**       | `MINIMAX_API_KEY` (or Token Plan: `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, `MINIMAX_CODING_API_KEY`)      | T2A v2 API. Defaults to `speech-2.8-hd`.                                |
-| **OpenAI**        | `OPENAI_API_KEY`                                                                                                 | Also used for auto-summary; supports persona `instructions`.            |
-| **OpenRouter**    | `OPENROUTER_API_KEY` (can reuse `models.providers.openrouter.apiKey`)                                            | Default model `hexgrad/kokoro-82m`.                                     |
-| **Volcengine**    | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` (legacy AppID/token: `VOLCENGINE_TTS_APPID`/`_TOKEN`) | BytePlus Seed Speech HTTP API.                                          |
-| **Vydra**         | `VYDRA_API_KEY`                                                                                                  | Shared image, video, and speech provider.                               |
-| **xAI**           | `XAI_API_KEY`                                                                                                    | xAI batch TTS. Native Opus voice-note is **not** supported.             |
-| **Xiaomi MiMo**   | `XIAOMI_API_KEY`                                                                                                 | MiMo TTS through Xiaomi chat completions.                               |
+| Provider          | Auth                                                                                                             | Notes                                                                                       |
+| ----------------- | ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
+| **Azure Speech**  | `AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION` (also `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, `SPEECH_REGION`)          | Native Ogg/Opus voice-note output and telephony.                                            |
+| **DeepInfra**     | `DEEPINFRA_API_KEY`                                                                                              | OpenAI-compatible TTS. Defaults to `hexgrad/Kokoro-82M`.                                    |
+| **ElevenLabs**    | `ELEVENLABS_API_KEY` or `XI_API_KEY`                                                                             | Voice cloning, multilingual, deterministic via `seed`; streamed for Discord voice playback. |
+| **Google Gemini** | `GEMINI_API_KEY` or `GOOGLE_API_KEY`                                                                             | Gemini API batch TTS; persona-aware via `promptTemplate: "audio-profile-v1"`.               |
+| **Gradium**       | `GRADIUM_API_KEY`                                                                                                | Voice-note and telephony output.                                                            |
+| **Inworld**       | `INWORLD_API_KEY`                                                                                                | Streaming TTS API. Native Opus voice-note and PCM telephony.                                |
+| **Local CLI**     | none                                                                                                             | Runs a configured local TTS command.                                                        |
+| **Microsoft**     | none                                                                                                             | Public Edge neural TTS via `node-edge-tts`. Best-effort, no SLA.                            |
+| **MiniMax**       | `MINIMAX_API_KEY` (or Token Plan: `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, `MINIMAX_CODING_API_KEY`)      | T2A v2 API. Defaults to `speech-2.8-hd`.                                                    |
+| **OpenAI**        | `OPENAI_API_KEY`                                                                                                 | Also used for auto-summary; supports persona `instructions`.                                |
+| **OpenRouter**    | `OPENROUTER_API_KEY` (can reuse `models.providers.openrouter.apiKey`)                                            | Default model `hexgrad/kokoro-82m`.                                                         |
+| **Volcengine**    | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` (legacy AppID/token: `VOLCENGINE_TTS_APPID`/`_TOKEN`) | BytePlus Seed Speech HTTP API.                                                              |
+| **Vydra**         | `VYDRA_API_KEY`                                                                                                  | Shared image, video, and speech provider.                                                   |
+| **xAI**           | `XAI_API_KEY`                                                                                                    | xAI batch TTS. Native Opus voice-note is **not** supported.                                 |
+| **Xiaomi MiMo**   | `XIAOMI_API_KEY`                                                                                                 | MiMo TTS through Xiaomi chat completions.                                                   |
 
 If multiple providers are configured, the selected one is used first and the
 others are fallback options. Auto-summary uses `summaryModel` (or
 
@@ -10,9 +10,11 @@ const {
   joinVoiceChannelMock,
   entersStateMock,
   createAudioPlayerMock,
+  createAudioResourceMock,
   resolveAgentRouteMock,
   agentCommandMock,
   transcribeAudioFileMock,
+  textToSpeechStreamMock,
   textToSpeechMock,
 } = vi.hoisted(() => {
   type EventHandler = (...args: unknown[]) => unknown;
@@ -94,6 +96,7 @@ const {
     entersStateMock: vi.fn(async (_target?: unknown, _state?: string, _timeoutMs?: number) => {
       return undefined;
     }),
+    createAudioResourceMock: vi.fn(),
     createAudioPlayerMock: vi.fn(() => ({
       on: vi.fn(),
       off: vi.fn(),
@@ -104,6 +107,9 @@ const {
     resolveAgentRouteMock: vi.fn(() => ({ agentId: "agent-1", sessionKey: "discord:g1:c1" })),
     agentCommandMock: vi.fn(async (_opts?: unknown, _runtime?: unknown) => ({ payloads: [] })),
     transcribeAudioFileMock: vi.fn(async () => ({ text: "hello from voice" })),
+    textToSpeechStreamMock: vi.fn(
+      async (): Promise<unknown> => ({ success: false, error: "stream unavailable" }),
+    ),
     textToSpeechMock: vi.fn(async () => ({ success: true, audioPath: "/tmp/voice.mp3" })),
   };
 });
@@ -121,7 +127,7 @@ vi.mock("./sdk-runtime.js", () => ({
       Connecting: "connecting",
     },
     createAudioPlayer: createAudioPlayerMock,
-    createAudioResource: vi.fn(),
+    createAudioResource: createAudioResourceMock,
     entersState: entersStateMock,
     getVoiceConnection: getVoiceConnectionMock,
     joinVoiceChannel: joinVoiceChannelMock,
@@ -154,6 +160,7 @@ vi.mock("../runtime.js", () => ({
       transcribeAudioFile: transcribeAudioFileMock,
     },
     tts: {
+      textToSpeechStream: textToSpeechStreamMock,
       textToSpeech: textToSpeechMock,
     },
   }),
@@ -207,8 +214,11 @@ describe("DiscordVoiceManager", () => {
     agentCommandMock.mockResolvedValue({ payloads: [] });
     transcribeAudioFileMock.mockReset();
     transcribeAudioFileMock.mockResolvedValue({ text: "hello from voice" });
+    textToSpeechStreamMock.mockReset();
+    textToSpeechStreamMock.mockResolvedValue({ success: false, error: "stream unavailable" });
     textToSpeechMock.mockReset();
     textToSpeechMock.mockResolvedValue({ success: true, audioPath: "/tmp/voice.mp3" });
+    createAudioResourceMock.mockClear();
   });
 
   const createManager = (
@@ -750,6 +760,49 @@ describe("DiscordVoiceManager", () => {
     );
   });
 
+  it("plays streaming TTS audio before falling back to a synthesized file", async () => {
+    const release = vi.fn(async () => undefined);
+    textToSpeechStreamMock.mockResolvedValue({
+      success: true,
+      audioStream: new ReadableStream<Uint8Array>({
+        start(controller) {
+          controller.enqueue(new Uint8Array([1, 2, 3]));
+          controller.close();
+        },
+      }),
+      release,
+    });
+    agentCommandMock.mockResolvedValueOnce({
+      payloads: [{ text: "hello back" }],
+    } as never);
+
+    const client = createClient();
+    client.fetchMember.mockResolvedValue({
+      nickname: "Guest Nick",
+      user: {
+        id: "u-guest",
+        username: "guest",
+        globalName: "Guest",
+        discriminator: "4321",
+      },
+    });
+    const manager = createManager({ groupPolicy: "open" }, client, {
+      commands: { useAccessGroups: false },
+    });
+    await processVoiceSegment(manager, "u-guest");
+
+    expect(textToSpeechStreamMock).toHaveBeenCalledWith(
+      expect.objectContaining({
+        channel: "discord",
+        disableFallback: true,
+        text: "hello back",
+      }),
+    );
+    expect(textToSpeechMock).not.toHaveBeenCalled();
+    expect(createAudioResourceMock).toHaveBeenCalledWith(expect.anything());
+    await vi.waitFor(() => expect(release).toHaveBeenCalledTimes(1));
+  });
+
   it("passes per-channel system prompt overrides to voice agent runs", async () => {
     const client = createClient();
     client.fetchMember.mockResolvedValue({
 
@@ -1,4 +1,5 @@
 import path from "node:path";
+import { Readable } from "node:stream";
 import { agentCommandFromIngress } from "openclaw/plugin-sdk/agent-runtime";
 import type { DiscordAccountConfig, OpenClawConfig } from "openclaw/plugin-sdk/config-types";
 import type { RuntimeEnv } from "openclaw/plugin-sdk/runtime-env";
@@ -139,18 +140,33 @@ export async function processDiscordVoiceSegment(params: {
   );
 
   params.enqueuePlayback(entry, async () => {
-    logVoiceVerbose(
-      `playback start: guild ${entry.guildId} channel ${entry.channelId} file ${path.basename(voiceReplyAudio.audioPath)}`,
-    );
     const voiceSdk = loadDiscordVoiceSdk();
-    const resource = voiceSdk.createAudioResource(voiceReplyAudio.audioPath);
-    entry.player.play(resource);
-    await voiceSdk
-      .entersState(entry.player, voiceSdk.AudioPlayerStatus.Playing, PLAYBACK_READY_TIMEOUT_MS)
-      .catch(() => undefined);
-    await voiceSdk
-      .entersState(entry.player, voiceSdk.AudioPlayerStatus.Idle, SPEAKING_READY_TIMEOUT_MS)
-      .catch(() => undefined);
-    logVoiceVerbose(`playback done: guild ${entry.guildId} channel ${entry.channelId}`);
+    const releaseAudioStream =
+      voiceReplyAudio.mode === "stream" ? voiceReplyAudio.release : undefined;
+    try {
+      if (voiceReplyAudio.mode === "stream") {
+        logVoiceVerbose(`playback start: guild ${entry.guildId} channel ${entry.channelId} stream`);
+        const nodeStream = Readable.fromWeb(
+          voiceReplyAudio.audioStream as import("node:stream/web").ReadableStream<Uint8Array>,
+        );
+        const resource = voiceSdk.createAudioResource(nodeStream);
+        entry.player.play(resource);
+      } else {
+        logVoiceVerbose(
+          `playback start: guild ${entry.guildId} channel ${entry.channelId} file ${path.basename(voiceReplyAudio.audioPath)}`,
+        );
+        const resource = voiceSdk.createAudioResource(voiceReplyAudio.audioPath);
+        entry.player.play(resource);
+      }
+      await voiceSdk
+        .entersState(entry.player, voiceSdk.AudioPlayerStatus.Playing, PLAYBACK_READY_TIMEOUT_MS)
+        .catch(() => undefined);
+      await voiceSdk
+        .entersState(entry.player, voiceSdk.AudioPlayerStatus.Idle, SPEAKING_READY_TIMEOUT_MS)
+        .catch(() => undefined);
+      logVoiceVerbose(`playback done: guild ${entry.guildId} channel ${entry.channelId}`);
+    } finally {
+      await releaseAudioStream?.();
+    }
   });
 }