Skip to content

Commit ec8dbc4

Browse files
committed
feat(tts): add xiaomi mimo speech provider
1 parent e10f200 commit ec8dbc4

10 files changed

Lines changed: 789 additions & 10 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Docs: https://docs.openclaw.ai
1111
- Diagnostics/OTEL: support `OPENCLAW_OTEL_PRELOADED=1` so the plugin can reuse an already-registered OpenTelemetry SDK while keeping OpenClaw diagnostic listeners wired. (#71450) Thanks @vincentkoc and @jlapenna.
1212
- Control UI: refine the agent Tool Access panel with compact live-tool chips, collapsible tool groups, direct per-tool toggles, and clearer runtime/source provenance. (#71405) Thanks @BunsDev.
1313
- Memory-core/hybrid search: expose raw `vectorScore` and `textScore` alongside the combined `score` on hybrid memory search results, so callers can inspect vector-versus-text retrieval contribution before temporal decay or MMR reordering. Fixes #68166. (#68286) Thanks @ajfonthemove.
14+
- Providers/Xiaomi: add MiMo TTS as a bundled speech provider with MP3/WAV output and voice-note Opus transcoding. Fixes #52376. (#55614) Thanks @zoujiejun.
1415

1516
### Fixes
1617

docs/providers/xiaomi.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,46 @@ OpenAI-compatible endpoint with API-key authentication.
5353
The default model ref is `xiaomi/mimo-v2-flash`. The provider is injected automatically when `XIAOMI_API_KEY` is set or an auth profile exists.
5454
</Tip>
5555

56+
## Text-to-speech
57+
58+
The bundled `xiaomi` plugin also registers Xiaomi MiMo as a speech provider for
59+
`messages.tts`. It calls Xiaomi's chat-completions TTS contract with the text as
60+
an `assistant` message and optional style guidance as a `user` message.
61+
62+
| Property | Value |
63+
| -------- | ---------------------------------------- |
64+
| TTS id | `xiaomi` (`mimo` alias) |
65+
| Auth | `XIAOMI_API_KEY` |
66+
| API | `POST /v1/chat/completions` with `audio` |
67+
| Default | `mimo-v2.5-tts`, voice `mimo_default` |
68+
| Output | MP3 by default; WAV when configured |
69+
70+
```json5
71+
{
72+
messages: {
73+
tts: {
74+
auto: "always",
75+
provider: "xiaomi",
76+
providers: {
77+
xiaomi: {
78+
apiKey: "xiaomi_api_key",
79+
model: "mimo-v2.5-tts",
80+
voice: "mimo_default",
81+
format: "mp3",
82+
style: "Bright, natural, conversational tone.",
83+
},
84+
},
85+
},
86+
},
87+
}
88+
```
89+
90+
Supported built-in voices include `mimo_default`, `default_zh`, `default_en`,
91+
`Mia`, `Chloe`, `Milo`, and `Dean`. `mimo-v2-tts` is supported for older MiMo
92+
TTS accounts; the default uses the current MiMo-V2.5 TTS model. For voice-note
93+
targets such as Feishu and Telegram, OpenClaw transcodes Xiaomi output to 48kHz
94+
Opus with `ffmpeg` before delivery.
95+
5696
## Config example
5797

5898
```json5

docs/tools/tts.md

Lines changed: 45 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ read_when:
77
title: "Text-to-speech"
88
---
99

10-
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Microsoft, MiniMax, OpenAI, Vydra, or xAI.
10+
OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
1111
It works anywhere OpenClaw can send audio.
1212

1313
## Supported services
@@ -20,6 +20,7 @@ It works anywhere OpenClaw can send audio.
2020
- **OpenAI** (primary or fallback provider; also used for summaries)
2121
- **Vydra** (primary or fallback provider; shared image, video, and speech provider)
2222
- **xAI** (primary or fallback provider; uses the xAI TTS API)
23+
- **Xiaomi MiMo** (primary or fallback provider; uses MiMo TTS through Xiaomi chat completions)
2324

2425
### Microsoft speech notes
2526

@@ -36,7 +37,7 @@ or ElevenLabs.
3637

3738
## Optional keys
3839

39-
If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, or xAI:
40+
If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, xAI, or Xiaomi MiMo:
4041

4142
- `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
4243
- `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
@@ -45,6 +46,7 @@ If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, or xAI:
4546
- `OPENAI_API_KEY`
4647
- `VYDRA_API_KEY`
4748
- `XAI_API_KEY`
49+
- `XIAOMI_API_KEY`
4850

4951
Microsoft speech does **not** require an API key.
5052

@@ -60,6 +62,7 @@ so that provider must also be authenticated if you enable summaries.
6062
- [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
6163
- [Gradium](/providers/gradium)
6264
- [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
65+
- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech)
6366
- [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
6467
- [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)
6568
- [xAI Text to Speech](https://docs.x.ai/developers/rest-api-reference/inference/voice#text-to-speech-rest)
@@ -231,6 +234,34 @@ Resolution order is `messages.tts.providers.xai.apiKey` -> `XAI_API_KEY`.
231234
Current live voices are `ara`, `eve`, `leo`, `rex`, `sal`, and `una`; `eve` is
232235
the default. `language` accepts a BCP-47 tag or `auto`.
233236

237+
### Xiaomi MiMo primary
238+
239+
```json5
240+
{
241+
messages: {
242+
tts: {
243+
auto: "always",
244+
provider: "xiaomi",
245+
providers: {
246+
xiaomi: {
247+
apiKey: "xiaomi_api_key",
248+
baseUrl: "https://api.xiaomimimo.com/v1",
249+
model: "mimo-v2.5-tts",
250+
voice: "mimo_default",
251+
format: "mp3",
252+
style: "Bright, natural, conversational tone.",
253+
},
254+
},
255+
},
256+
},
257+
}
258+
```
259+
260+
Xiaomi MiMo TTS uses the same `XIAOMI_API_KEY` path as the bundled Xiaomi model
261+
provider. The speech provider id is `xiaomi`; `mimo` is accepted as an alias.
262+
The target text is sent as the assistant message, matching Xiaomi's TTS
263+
contract. Optional `style` is sent as a user instruction and is not spoken.
264+
234265
### OpenRouter primary
235266

236267
```json5
@@ -345,7 +376,7 @@ Then run:
345376
- `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
346377
- `enabled`: legacy toggle (doctor migrates this to `auto`).
347378
- `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
348-
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, or `"xai"` (fallback is automatic).
379+
- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
349380
- If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
350381
- Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and
351382
rewritten to `provider: "microsoft"`.
@@ -359,7 +390,7 @@ Then run:
359390
- `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
360391
- `timeoutMs`: request timeout (ms).
361392
- `prefsPath`: override the local prefs JSON path (provider/limit/summary).
362-
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`).
393+
- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
363394
- `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
364395
- `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
365396
- Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -391,6 +422,12 @@ Then run:
391422
- `providers.xai.language`: BCP-47 language code or `auto` (default `en`).
392423
- `providers.xai.responseFormat`: `mp3`, `wav`, `pcm`, `mulaw`, or `alaw` (default `mp3`).
393424
- `providers.xai.speed`: provider-native speed override.
425+
- `providers.xiaomi.apiKey`: Xiaomi MiMo API key (env: `XIAOMI_API_KEY`).
426+
- `providers.xiaomi.baseUrl`: override the Xiaomi MiMo API base URL (default `https://api.xiaomimimo.com/v1`, env: `XIAOMI_BASE_URL`).
427+
- `providers.xiaomi.model`: TTS model (default `mimo-v2.5-tts`, env: `XIAOMI_TTS_MODEL`; `mimo-v2-tts` is also supported).
428+
- `providers.xiaomi.voice`: MiMo voice id (default `mimo_default`, env: `XIAOMI_TTS_VOICE`).
429+
- `providers.xiaomi.format`: `mp3` or `wav` (default `mp3`, env: `XIAOMI_TTS_FORMAT`).
430+
- `providers.xiaomi.style`: optional natural-language style instruction sent as the user message; it is not spoken.
394431
- `providers.openrouter.apiKey`: OpenRouter API key (env: `OPENROUTER_API_KEY`; can reuse `models.providers.openrouter.apiKey`).
395432
- `providers.openrouter.baseUrl`: override the OpenRouter TTS base URL (default `https://openrouter.ai/api/v1`; legacy `https://openrouter.ai/v1` is normalized).
396433
- `providers.openrouter.model`: OpenRouter TTS model id (default `hexgrad/kokoro-82m`; `modelId` is also accepted).
@@ -432,9 +469,9 @@ Here you go.
432469

433470
Available directive keys (when enabled):
434471

435-
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `vydra`, or `xai`; requires `allowProvider: true`)
436-
- `voice` (OpenAI or Gradium voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
437-
- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
472+
- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`)
473+
- `voice` (OpenAI, Gradium, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
474+
- `model` (OpenAI TTS model, ElevenLabs model id, MiniMax model, or Xiaomi MiMo TTS model) or `google_model` (Google TTS model)
438475
- `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
439476
- `vol` / `volume` (MiniMax volume, 0-10)
440477
- `pitch` (MiniMax integer pitch, -12 to 12; fractional values are truncated before the MiniMax request)
@@ -498,6 +535,7 @@ These override `messages.tts.*` for that host.
498535
- **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
499536
- 44.1kHz / 128kbps is the default balance for speech clarity.
500537
- **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery.
538+
- **Xiaomi MiMo**: MP3 by default, or WAV when configured. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes Xiaomi output to 48kHz Opus with `ffmpeg` before delivery.
501539
- **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
502540
- **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
503541
- **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.

extensions/google/google.live.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ describeLive("google plugin live", () => {
4343
const speechProvider = requireRegisteredProvider(speechProviders, "google");
4444
const mediaProvider = requireRegisteredProvider(mediaProviders, "google");
4545

46-
const phrase = "Testing Google audio transcription with OpenClaw.";
46+
const phrase = "Testing Google audio transcription with pineapple.";
4747
const audioFile = await speechProvider.synthesize({
4848
text: phrase,
4949
cfg: { plugins: { enabled: true } } as never,
@@ -62,7 +62,7 @@ describeLive("google plugin live", () => {
6262

6363
const normalized = normalizeTranscriptForMatch(transcript?.text ?? "");
6464
expect(normalized).toContain("google");
65-
expect(normalized).toContain("openclaw");
65+
expect(normalized).toContain("pineapple");
6666
}, 180_000);
6767

6868
it("runs Gemini web search through the registered provider tool", async () => {

extensions/minimax/minimax.live.test.ts

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,30 @@
11
import { describe, expect, it } from "vitest";
22
import { isLiveTestEnabled } from "../../src/agents/live-test-helpers.js";
3+
import {
4+
registerProviderPlugin,
5+
requireRegisteredProvider,
6+
} from "../../test/helpers/plugins/provider-registration.js";
7+
import plugin from "./index.js";
8+
import { buildMinimaxSpeechProvider } from "./speech-provider.js";
39
import { createMiniMaxWebSearchProvider } from "./src/minimax-web-search-provider.js";
410

11+
const MINIMAX_API_KEY = process.env.MINIMAX_API_KEY?.trim() ?? "";
512
const MINIMAX_SEARCH_KEY =
613
process.env.MINIMAX_CODE_PLAN_KEY?.trim() ||
714
process.env.MINIMAX_CODING_API_KEY?.trim() ||
8-
process.env.MINIMAX_API_KEY?.trim() ||
15+
MINIMAX_API_KEY ||
916
"";
1017
const describeLive =
1118
isLiveTestEnabled() && MINIMAX_SEARCH_KEY.length > 0 ? describe : describe.skip;
19+
const describeTtsLive =
20+
isLiveTestEnabled() && MINIMAX_API_KEY.length > 0 ? describe : describe.skip;
21+
22+
const registerMinimaxPlugin = () =>
23+
registerProviderPlugin({
24+
plugin,
25+
id: "minimax",
26+
name: "MiniMax Provider",
27+
});
1228

1329
describeLive("minimax plugin live", () => {
1430
it("runs MiniMax web search through the provider tool", async () => {
@@ -25,3 +41,39 @@ describeLive("minimax plugin live", () => {
2541
expect(Array.isArray(result?.results)).toBe(true);
2642
}, 120_000);
2743
});
44+
45+
describeTtsLive("minimax tts live", () => {
46+
it("synthesizes TTS through the registered speech provider", async () => {
47+
const { speechProviders } = await registerMinimaxPlugin();
48+
const provider = requireRegisteredProvider(speechProviders, "minimax");
49+
50+
const audioFile = await provider.synthesize({
51+
text: "OpenClaw MiniMax text to speech integration test OK.",
52+
cfg: { plugins: { enabled: true } } as never,
53+
providerConfig: { apiKey: MINIMAX_API_KEY },
54+
target: "audio-file",
55+
timeoutMs: 90_000,
56+
});
57+
58+
expect(audioFile.outputFormat).toBe("mp3");
59+
expect(audioFile.fileExtension).toBe(".mp3");
60+
expect(audioFile.audioBuffer.byteLength).toBeGreaterThan(512);
61+
}, 120_000);
62+
63+
it("synthesizes MiniMax TTS as an Opus voice note", async () => {
64+
const provider = buildMinimaxSpeechProvider();
65+
66+
const voiceNote = await provider.synthesize({
67+
text: "OpenClaw MiniMax voice note test OK.",
68+
cfg: { plugins: { enabled: true } } as never,
69+
providerConfig: { apiKey: MINIMAX_API_KEY },
70+
target: "voice-note",
71+
timeoutMs: 90_000,
72+
});
73+
74+
expect(voiceNote.outputFormat).toBe("opus");
75+
expect(voiceNote.fileExtension).toBe(".opus");
76+
expect(voiceNote.voiceCompatible).toBe(true);
77+
expect(voiceNote.audioBuffer.byteLength).toBeGreaterThan(512);
78+
}, 120_000);
79+
});

extensions/xiaomi/index.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ import { defineSingleProviderPluginEntry } from "openclaw/plugin-sdk/provider-en
22
import { PROVIDER_LABELS } from "openclaw/plugin-sdk/provider-usage";
33
import { applyXiaomiConfig, XIAOMI_DEFAULT_MODEL_REF } from "./onboard.js";
44
import { buildXiaomiProvider } from "./provider-catalog.js";
5+
import { buildXiaomiSpeechProvider } from "./speech-provider.js";
56

67
const PROVIDER_ID = "xiaomi";
78

@@ -40,4 +41,7 @@ export default defineSingleProviderPluginEntry({
4041
windows: [],
4142
}),
4243
},
44+
register(api) {
45+
api.registerSpeechProvider(buildXiaomiSpeechProvider());
46+
},
4347
});

extensions/xiaomi/openclaw.plugin.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22
"id": "xiaomi",
33
"enabledByDefault": true,
44
"providers": ["xiaomi"],
5+
"contracts": {
6+
"speechProviders": ["xiaomi"]
7+
},
58
"providerAuthEnvVars": {
69
"xiaomi": ["XIAOMI_API_KEY"]
710
},

0 commit comments

Comments
 (0)