Skip to content

Commit 5aefe6a

Browse files
committed
feat: stream elevenlabs tts into discord voice
1 parent 85b914a commit 5aefe6a

25 files changed

Lines changed: 607 additions & 57 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Docs: https://docs.openclaw.ai
66

77
### Changes
88

9+
- Discord/voice: stream ElevenLabs TTS directly into Discord playback and send ElevenLabs latency optimization as the documented query parameter so spoken replies can start sooner.
910
- Discord/voice: keep TTS playback running when another user starts speaking, ignore new capture during playback to avoid feedback loops, and downgrade expected receive-stream aborts to verbose diagnostics.
1011
- Telegram: treat successful same-chat `message` tool outbound sends during an inbound telegram turn as delivered when deciding whether to emit the rewritten silent reply fallback (#78685). Thanks @neeravmakwana.
1112
- Gateway/tasks: reconcile stale CLI run-context tasks whose live run context disappeared even when a child session row remains, and apply the default bounded reload deferral timeout to channel hot reloads so stale task records cannot block Discord/Slack/Telegram reloads forever.
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
bf73d3d6b83410753ee782289e4748c96d97bc76459b116e5e03c678996da360 plugin-sdk-api-baseline.json
2-
f6a9f57d7b632391061c5bac78366bcb01318e0fde26a437e48606bdb70fe9fa plugin-sdk-api-baseline.jsonl
1+
e26753b5aaa10cd98cb0e07fca4034c091471cf434239cc3597b62b5a62b082b plugin-sdk-api-baseline.json
2+
7b998abde706a1afe4d1e4475a87069c31f673c3c90b8a7f23f7ba8cff6d1c85 plugin-sdk-api-baseline.jsonl

docs/channels/discord.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1207,6 +1207,7 @@ Notes:
12071207
- `voice.reconnectGraceMs` controls how long OpenClaw waits for a disconnected voice session to begin reconnecting before destroying it. Default: `15000`.
12081208
- Voice playback does not stop just because another user starts speaking. To avoid feedback loops, OpenClaw ignores new voice capture while TTS is playing; speak after playback finishes for the next turn.
12091209
- `voice.captureSilenceGraceMs` controls how long OpenClaw waits after Discord reports a speaker has stopped before finalizing that audio segment for STT. Default: `2500`; raise this if Discord splits normal pauses into choppy partial transcripts.
1210+
- When ElevenLabs is the selected TTS provider, Discord voice playback uses streaming TTS and starts from the provider response stream. Providers without streaming support fall back to the synthesized temp-file path.
12101211
- OpenClaw also watches receive decrypt failures and auto-recovers by leaving/rejoining the voice channel after repeated failures in a short window.
12111212
- If receive logs repeatedly show `DecryptionFailed(UnencryptedWhenPassthroughDisabled)` after updating, collect a dependency report and logs. The bundled `@discordjs/voice` line includes the upstream padding fix from discord.js PR #11449, which closed discord.js issue #11419.
12121213
- `The operation was aborted` receive events are expected when OpenClaw finalizes a captured speaker segment; they are verbose diagnostics, not warnings.
@@ -1217,7 +1218,7 @@ Voice channel pipeline:
12171218
- `tools.media.audio` handles STT, for example `openai/gpt-4o-mini-transcribe`.
12181219
- The transcript is sent through Discord ingress and routing while the response LLM runs with a voice-output policy that hides the agent `tts` tool and asks for returned text, because Discord voice owns final TTS playback.
12191220
- `voice.model`, when set, overrides only the response LLM for this voice-channel turn.
1220-
- `voice.tts` is merged over `messages.tts`; the resulting audio is played in the joined channel.
1221+
- `voice.tts` is merged over `messages.tts`; streaming-capable providers feed the player directly, otherwise the resulting audio file is played in the joined channel.
12211222

12221223
Credentials are resolved per component: LLM route auth for `voice.model`, STT auth for `tools.media.audio`, and TTS auth for `messages.tts`/`voice.tts`.
12231224

docs/providers/elevenlabs.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,13 @@ export ELEVENLABS_API_KEY="..."
4646
Set `modelId` to `eleven_v3` to use ElevenLabs v3 TTS. OpenClaw keeps
4747
`eleven_multilingual_v2` as the default for existing installs.
4848

49+
Discord voice channels use ElevenLabs' streaming TTS endpoint when ElevenLabs is
50+
the selected `voice.tts`/`messages.tts` provider. Playback starts from the
51+
returned audio stream instead of waiting for OpenClaw to download and write the
52+
whole audio file first. `latencyTier` maps to ElevenLabs'
53+
`optimize_streaming_latency` query parameter for models that accept it; OpenClaw
54+
omits that parameter for `eleven_v3`, which rejects it.
55+
4956
## Speech-to-text
5057

5158
Use Scribe v2 for inbound audio attachments and short recorded voice segments:

docs/providers/google.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,11 @@ The bundled `google` speech provider uses the Gemini API TTS path with
287287
- Output: WAV for regular TTS attachments, Opus for voice-note targets, PCM for Talk/telephony
288288
- Voice-note output: Google PCM is wrapped as WAV and transcoded to 48 kHz Opus with `ffmpeg`
289289

290+
Google's batch Gemini TTS path returns generated audio in the completed
291+
`generateContent` response. For lowest-latency spoken conversations, use the
292+
Google realtime voice provider backed by the Gemini Live API instead of batch
293+
TTS.
294+
290295
To use Google as the default TTS provider:
291296

292297
```json5

docs/tools/tts.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -60,23 +60,23 @@ speech.
6060

6161
## Supported providers
6262

63-
| Provider | Auth | Notes |
64-
| ----------------- | ---------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- |
65-
| **Azure Speech** | `AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION` (also `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, `SPEECH_REGION`) | Native Ogg/Opus voice-note output and telephony. |
66-
| **DeepInfra** | `DEEPINFRA_API_KEY` | OpenAI-compatible TTS. Defaults to `hexgrad/Kokoro-82M`. |
67-
| **ElevenLabs** | `ELEVENLABS_API_KEY` or `XI_API_KEY` | Voice cloning, multilingual, deterministic via `seed`. |
68-
| **Google Gemini** | `GEMINI_API_KEY` or `GOOGLE_API_KEY` | Gemini API TTS; persona-aware via `promptTemplate: "audio-profile-v1"`. |
69-
| **Gradium** | `GRADIUM_API_KEY` | Voice-note and telephony output. |
70-
| **Inworld** | `INWORLD_API_KEY` | Streaming TTS API. Native Opus voice-note and PCM telephony. |
71-
| **Local CLI** | none | Runs a configured local TTS command. |
72-
| **Microsoft** | none | Public Edge neural TTS via `node-edge-tts`. Best-effort, no SLA. |
73-
| **MiniMax** | `MINIMAX_API_KEY` (or Token Plan: `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, `MINIMAX_CODING_API_KEY`) | T2A v2 API. Defaults to `speech-2.8-hd`. |
74-
| **OpenAI** | `OPENAI_API_KEY` | Also used for auto-summary; supports persona `instructions`. |
75-
| **OpenRouter** | `OPENROUTER_API_KEY` (can reuse `models.providers.openrouter.apiKey`) | Default model `hexgrad/kokoro-82m`. |
76-
| **Volcengine** | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` (legacy AppID/token: `VOLCENGINE_TTS_APPID`/`_TOKEN`) | BytePlus Seed Speech HTTP API. |
77-
| **Vydra** | `VYDRA_API_KEY` | Shared image, video, and speech provider. |
78-
| **xAI** | `XAI_API_KEY` | xAI batch TTS. Native Opus voice-note is **not** supported. |
79-
| **Xiaomi MiMo** | `XIAOMI_API_KEY` | MiMo TTS through Xiaomi chat completions. |
63+
| Provider | Auth | Notes |
64+
| ----------------- | ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
65+
| **Azure Speech** | `AZURE_SPEECH_KEY` + `AZURE_SPEECH_REGION` (also `AZURE_SPEECH_API_KEY`, `SPEECH_KEY`, `SPEECH_REGION`) | Native Ogg/Opus voice-note output and telephony. |
66+
| **DeepInfra** | `DEEPINFRA_API_KEY` | OpenAI-compatible TTS. Defaults to `hexgrad/Kokoro-82M`. |
67+
| **ElevenLabs** | `ELEVENLABS_API_KEY` or `XI_API_KEY` | Voice cloning, multilingual, deterministic via `seed`; streamed for Discord voice playback. |
68+
| **Google Gemini** | `GEMINI_API_KEY` or `GOOGLE_API_KEY` | Gemini API batch TTS; persona-aware via `promptTemplate: "audio-profile-v1"`. |
69+
| **Gradium** | `GRADIUM_API_KEY` | Voice-note and telephony output. |
70+
| **Inworld** | `INWORLD_API_KEY` | Streaming TTS API. Native Opus voice-note and PCM telephony. |
71+
| **Local CLI** | none | Runs a configured local TTS command. |
72+
| **Microsoft** | none | Public Edge neural TTS via `node-edge-tts`. Best-effort, no SLA. |
73+
| **MiniMax** | `MINIMAX_API_KEY` (or Token Plan: `MINIMAX_OAUTH_TOKEN`, `MINIMAX_CODE_PLAN_KEY`, `MINIMAX_CODING_API_KEY`) | T2A v2 API. Defaults to `speech-2.8-hd`. |
74+
| **OpenAI** | `OPENAI_API_KEY` | Also used for auto-summary; supports persona `instructions`. |
75+
| **OpenRouter** | `OPENROUTER_API_KEY` (can reuse `models.providers.openrouter.apiKey`) | Default model `hexgrad/kokoro-82m`. |
76+
| **Volcengine** | `VOLCENGINE_TTS_API_KEY` or `BYTEPLUS_SEED_SPEECH_API_KEY` (legacy AppID/token: `VOLCENGINE_TTS_APPID`/`_TOKEN`) | BytePlus Seed Speech HTTP API. |
77+
| **Vydra** | `VYDRA_API_KEY` | Shared image, video, and speech provider. |
78+
| **xAI** | `XAI_API_KEY` | xAI batch TTS. Native Opus voice-note is **not** supported. |
79+
| **Xiaomi MiMo** | `XIAOMI_API_KEY` | MiMo TTS through Xiaomi chat completions. |
8080

8181
If multiple providers are configured, the selected one is used first and the
8282
others are fallback options. Auto-summary uses `summaryModel` (or

extensions/discord/src/voice/manager.e2e.test.ts

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,11 @@ const {
1010
joinVoiceChannelMock,
1111
entersStateMock,
1212
createAudioPlayerMock,
13+
createAudioResourceMock,
1314
resolveAgentRouteMock,
1415
agentCommandMock,
1516
transcribeAudioFileMock,
17+
textToSpeechStreamMock,
1618
textToSpeechMock,
1719
} = vi.hoisted(() => {
1820
type EventHandler = (...args: unknown[]) => unknown;
@@ -94,6 +96,7 @@ const {
9496
entersStateMock: vi.fn(async (_target?: unknown, _state?: string, _timeoutMs?: number) => {
9597
return undefined;
9698
}),
99+
createAudioResourceMock: vi.fn(),
97100
createAudioPlayerMock: vi.fn(() => ({
98101
on: vi.fn(),
99102
off: vi.fn(),
@@ -104,6 +107,9 @@ const {
104107
resolveAgentRouteMock: vi.fn(() => ({ agentId: "agent-1", sessionKey: "discord:g1:c1" })),
105108
agentCommandMock: vi.fn(async (_opts?: unknown, _runtime?: unknown) => ({ payloads: [] })),
106109
transcribeAudioFileMock: vi.fn(async () => ({ text: "hello from voice" })),
110+
textToSpeechStreamMock: vi.fn(
111+
async (): Promise<unknown> => ({ success: false, error: "stream unavailable" }),
112+
),
107113
textToSpeechMock: vi.fn(async () => ({ success: true, audioPath: "/tmp/voice.mp3" })),
108114
};
109115
});
@@ -121,7 +127,7 @@ vi.mock("./sdk-runtime.js", () => ({
121127
Connecting: "connecting",
122128
},
123129
createAudioPlayer: createAudioPlayerMock,
124-
createAudioResource: vi.fn(),
130+
createAudioResource: createAudioResourceMock,
125131
entersState: entersStateMock,
126132
getVoiceConnection: getVoiceConnectionMock,
127133
joinVoiceChannel: joinVoiceChannelMock,
@@ -154,6 +160,7 @@ vi.mock("../runtime.js", () => ({
154160
transcribeAudioFile: transcribeAudioFileMock,
155161
},
156162
tts: {
163+
textToSpeechStream: textToSpeechStreamMock,
157164
textToSpeech: textToSpeechMock,
158165
},
159166
}),
@@ -207,8 +214,11 @@ describe("DiscordVoiceManager", () => {
207214
agentCommandMock.mockResolvedValue({ payloads: [] });
208215
transcribeAudioFileMock.mockReset();
209216
transcribeAudioFileMock.mockResolvedValue({ text: "hello from voice" });
217+
textToSpeechStreamMock.mockReset();
218+
textToSpeechStreamMock.mockResolvedValue({ success: false, error: "stream unavailable" });
210219
textToSpeechMock.mockReset();
211220
textToSpeechMock.mockResolvedValue({ success: true, audioPath: "/tmp/voice.mp3" });
221+
createAudioResourceMock.mockClear();
212222
});
213223

214224
const createManager = (
@@ -750,6 +760,49 @@ describe("DiscordVoiceManager", () => {
750760
);
751761
});
752762

763+
it("plays streaming TTS audio before falling back to a synthesized file", async () => {
764+
const release = vi.fn(async () => undefined);
765+
textToSpeechStreamMock.mockResolvedValue({
766+
success: true,
767+
audioStream: new ReadableStream<Uint8Array>({
768+
start(controller) {
769+
controller.enqueue(new Uint8Array([1, 2, 3]));
770+
controller.close();
771+
},
772+
}),
773+
release,
774+
});
775+
agentCommandMock.mockResolvedValueOnce({
776+
payloads: [{ text: "hello back" }],
777+
} as never);
778+
779+
const client = createClient();
780+
client.fetchMember.mockResolvedValue({
781+
nickname: "Guest Nick",
782+
user: {
783+
id: "u-guest",
784+
username: "guest",
785+
globalName: "Guest",
786+
discriminator: "4321",
787+
},
788+
});
789+
const manager = createManager({ groupPolicy: "open" }, client, {
790+
commands: { useAccessGroups: false },
791+
});
792+
await processVoiceSegment(manager, "u-guest");
793+
794+
expect(textToSpeechStreamMock).toHaveBeenCalledWith(
795+
expect.objectContaining({
796+
channel: "discord",
797+
disableFallback: true,
798+
text: "hello back",
799+
}),
800+
);
801+
expect(textToSpeechMock).not.toHaveBeenCalled();
802+
expect(createAudioResourceMock).toHaveBeenCalledWith(expect.anything());
803+
await vi.waitFor(() => expect(release).toHaveBeenCalledTimes(1));
804+
});
805+
753806
it("passes per-channel system prompt overrides to voice agent runs", async () => {
754807
const client = createClient();
755808
client.fetchMember.mockResolvedValue({

extensions/discord/src/voice/segment.ts

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import path from "node:path";
2+
import { Readable } from "node:stream";
23
import { agentCommandFromIngress } from "openclaw/plugin-sdk/agent-runtime";
34
import type { DiscordAccountConfig, OpenClawConfig } from "openclaw/plugin-sdk/config-types";
45
import type { RuntimeEnv } from "openclaw/plugin-sdk/runtime-env";
@@ -139,18 +140,33 @@ export async function processDiscordVoiceSegment(params: {
139140
);
140141

141142
params.enqueuePlayback(entry, async () => {
142-
logVoiceVerbose(
143-
`playback start: guild ${entry.guildId} channel ${entry.channelId} file ${path.basename(voiceReplyAudio.audioPath)}`,
144-
);
145143
const voiceSdk = loadDiscordVoiceSdk();
146-
const resource = voiceSdk.createAudioResource(voiceReplyAudio.audioPath);
147-
entry.player.play(resource);
148-
await voiceSdk
149-
.entersState(entry.player, voiceSdk.AudioPlayerStatus.Playing, PLAYBACK_READY_TIMEOUT_MS)
150-
.catch(() => undefined);
151-
await voiceSdk
152-
.entersState(entry.player, voiceSdk.AudioPlayerStatus.Idle, SPEAKING_READY_TIMEOUT_MS)
153-
.catch(() => undefined);
154-
logVoiceVerbose(`playback done: guild ${entry.guildId} channel ${entry.channelId}`);
144+
const releaseAudioStream =
145+
voiceReplyAudio.mode === "stream" ? voiceReplyAudio.release : undefined;
146+
try {
147+
if (voiceReplyAudio.mode === "stream") {
148+
logVoiceVerbose(`playback start: guild ${entry.guildId} channel ${entry.channelId} stream`);
149+
const nodeStream = Readable.fromWeb(
150+
voiceReplyAudio.audioStream as import("node:stream/web").ReadableStream<Uint8Array>,
151+
);
152+
const resource = voiceSdk.createAudioResource(nodeStream);
153+
entry.player.play(resource);
154+
} else {
155+
logVoiceVerbose(
156+
`playback start: guild ${entry.guildId} channel ${entry.channelId} file ${path.basename(voiceReplyAudio.audioPath)}`,
157+
);
158+
const resource = voiceSdk.createAudioResource(voiceReplyAudio.audioPath);
159+
entry.player.play(resource);
160+
}
161+
await voiceSdk
162+
.entersState(entry.player, voiceSdk.AudioPlayerStatus.Playing, PLAYBACK_READY_TIMEOUT_MS)
163+
.catch(() => undefined);
164+
await voiceSdk
165+
.entersState(entry.player, voiceSdk.AudioPlayerStatus.Idle, SPEAKING_READY_TIMEOUT_MS)
166+
.catch(() => undefined);
167+
logVoiceVerbose(`playback done: guild ${entry.guildId} channel ${entry.channelId}`);
168+
} finally {
169+
await releaseAudioStream?.();
170+
}
155171
});
156172
}

0 commit comments

Comments
 (0)