feat(tts): add xiaomi mimo speech provider

steipete · steipete · commit ec8dbc459558 · 2026-04-25T09:48:05.000+01:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@ Docs: https://docs.openclaw.ai
 - Diagnostics/OTEL: support `OPENCLAW_OTEL_PRELOADED=1` so the plugin can reuse an already-registered OpenTelemetry SDK while keeping OpenClaw diagnostic listeners wired. (#71450) Thanks @vincentkoc and @jlapenna.
 - Control UI: refine the agent Tool Access panel with compact live-tool chips, collapsible tool groups, direct per-tool toggles, and clearer runtime/source provenance. (#71405) Thanks @BunsDev.
 - Memory-core/hybrid search: expose raw `vectorScore` and `textScore` alongside the combined `score` on hybrid memory search results, so callers can inspect vector-versus-text retrieval contribution before temporal decay or MMR reordering. Fixes #68166. (#68286) Thanks @ajfonthemove.
+- Providers/Xiaomi: add MiMo TTS as a bundled speech provider with MP3/WAV output and voice-note Opus transcoding. Fixes #52376. (#55614) Thanks @zoujiejun.
 
 ### Fixes
 
diff --git a/docs/providers/xiaomi.md b/docs/providers/xiaomi.md
@@ -53,6 +53,46 @@ OpenAI-compatible endpoint with API-key authentication.
 The default model ref is `xiaomi/mimo-v2-flash`. The provider is injected automatically when `XIAOMI_API_KEY` is set or an auth profile exists.
 </Tip>
 
+## Text-to-speech
+
+The bundled `xiaomi` plugin also registers Xiaomi MiMo as a speech provider for
+`messages.tts`. It calls Xiaomi's chat-completions TTS contract with the text as
+an `assistant` message and optional style guidance as a `user` message.
+
+| Property | Value                                    |
+| -------- | ---------------------------------------- |
+| TTS id   | `xiaomi` (`mimo` alias)                  |
+| Auth     | `XIAOMI_API_KEY`                         |
+| API      | `POST /v1/chat/completions` with `audio` |
+| Default  | `mimo-v2.5-tts`, voice `mimo_default`    |
+| Output   | MP3 by default; WAV when configured      |
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "xiaomi",
+      providers: {
+        xiaomi: {
+          apiKey: "xiaomi_api_key",
+          model: "mimo-v2.5-tts",
+          voice: "mimo_default",
+          format: "mp3",
+          style: "Bright, natural, conversational tone.",
+        },
+      },
+    },
+  },
+}
+```
+
+Supported built-in voices include `mimo_default`, `default_zh`, `default_en`,
+`Mia`, `Chloe`, `Milo`, and `Dean`. `mimo-v2-tts` is supported for older MiMo
+TTS accounts; the default uses the current MiMo-V2.5 TTS model. For voice-note
+targets such as Feishu and Telegram, OpenClaw transcodes Xiaomi output to 48kHz
+Opus with `ffmpeg` before delivery.
+
 ## Config example
 
 ```json5
diff --git a/docs/tools/tts.md b/docs/tools/tts.md
@@ -7,7 +7,7 @@ read_when:
 title: "Text-to-speech"
 ---
 
-OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Microsoft, MiniMax, OpenAI, Vydra, or xAI.
+OpenClaw can convert outbound replies into audio using ElevenLabs, Google Gemini, Gradium, Microsoft, MiniMax, OpenAI, Vydra, xAI, or Xiaomi MiMo.
 It works anywhere OpenClaw can send audio.
 
 ## Supported services
@@ -20,6 +20,7 @@ It works anywhere OpenClaw can send audio.
 - **OpenAI** (primary or fallback provider; also used for summaries)
 - **Vydra** (primary or fallback provider; shared image, video, and speech provider)
 - **xAI** (primary or fallback provider; uses the xAI TTS API)
+- **Xiaomi MiMo** (primary or fallback provider; uses MiMo TTS through Xiaomi chat completions)
 
 ### Microsoft speech notes
 
@@ -36,7 +37,7 @@ or ElevenLabs.
 
 ## Optional keys
 
-If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, or xAI:
+If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, xAI, or Xiaomi MiMo:
 
 - `ELEVENLABS_API_KEY` (or `XI_API_KEY`)
 - `GEMINI_API_KEY` (or `GOOGLE_API_KEY`)
@@ -45,6 +46,7 @@ If you want OpenAI, ElevenLabs, Google Gemini, Gradium, MiniMax, Vydra, or xAI:
 - `OPENAI_API_KEY`
 - `VYDRA_API_KEY`
 - `XAI_API_KEY`
+- `XIAOMI_API_KEY`
 
 Microsoft speech does **not** require an API key.
 
@@ -60,6 +62,7 @@ so that provider must also be authenticated if you enable summaries.
 - [ElevenLabs Authentication](https://elevenlabs.io/docs/api-reference/authentication)
 - [Gradium](/providers/gradium)
 - [MiniMax T2A v2 API](https://platform.minimaxi.com/document/T2A%20V2)
+- [Xiaomi MiMo speech synthesis](/providers/xiaomi#text-to-speech)
 - [node-edge-tts](https://github.com/SchneeHertz/node-edge-tts)
 - [Microsoft Speech output formats](https://learn.microsoft.com/azure/ai-services/speech-service/rest-text-to-speech#audio-outputs)
 - [xAI Text to Speech](https://docs.x.ai/developers/rest-api-reference/inference/voice#text-to-speech-rest)
@@ -231,6 +234,34 @@ Resolution order is `messages.tts.providers.xai.apiKey` -> `XAI_API_KEY`.
 Current live voices are `ara`, `eve`, `leo`, `rex`, `sal`, and `una`; `eve` is
 the default. `language` accepts a BCP-47 tag or `auto`.
 
+### Xiaomi MiMo primary
+
+```json5
+{
+  messages: {
+    tts: {
+      auto: "always",
+      provider: "xiaomi",
+      providers: {
+        xiaomi: {
+          apiKey: "xiaomi_api_key",
+          baseUrl: "https://api.xiaomimimo.com/v1",
+          model: "mimo-v2.5-tts",
+          voice: "mimo_default",
+          format: "mp3",
+          style: "Bright, natural, conversational tone.",
+        },
+      },
+    },
+  },
+}
+```
+
+Xiaomi MiMo TTS uses the same `XIAOMI_API_KEY` path as the bundled Xiaomi model
+provider. The speech provider id is `xiaomi`; `mimo` is accepted as an alias.
+The target text is sent as the assistant message, matching Xiaomi's TTS
+contract. Optional `style` is sent as a user instruction and is not spoken.
+
 ### OpenRouter primary
 
 ```json5
@@ -345,7 +376,7 @@ Then run:
   - `tagged` only sends audio when the reply includes `[[tts:key=value]]` directives or a `[[tts:text]]...[[/tts:text]]` block.
 - `enabled`: legacy toggle (doctor migrates this to `auto`).
 - `mode`: `"final"` (default) or `"all"` (includes tool/block replies).
-- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, or `"xai"` (fallback is automatic).
+- `provider`: speech provider id such as `"elevenlabs"`, `"google"`, `"gradium"`, `"microsoft"`, `"minimax"`, `"openai"`, `"vydra"`, `"xai"`, or `"xiaomi"` (fallback is automatic).
 - If `provider` is **unset**, OpenClaw uses the first configured speech provider in registry auto-select order.
 - Legacy `provider: "edge"` config is repaired by `openclaw doctor --fix` and
   rewritten to `provider: "microsoft"`.
@@ -359,7 +390,7 @@ Then run:
 - `maxTextLength`: hard cap for TTS input (chars). `/tts audio` fails if exceeded.
 - `timeoutMs`: request timeout (ms).
 - `prefsPath`: override the local prefs JSON path (provider/limit/summary).
-- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`).
+- `apiKey` values fall back to env vars (`ELEVENLABS_API_KEY`/`XI_API_KEY`, `GEMINI_API_KEY`/`GOOGLE_API_KEY`, `GRADIUM_API_KEY`, `MINIMAX_API_KEY`, `OPENAI_API_KEY`, `VYDRA_API_KEY`, `XAI_API_KEY`, `XIAOMI_API_KEY`).
 - `providers.elevenlabs.baseUrl`: override ElevenLabs API base URL.
 - `providers.openai.baseUrl`: override the OpenAI TTS endpoint.
   - Resolution order: `messages.tts.providers.openai.baseUrl` -> `OPENAI_TTS_BASE_URL` -> `https://api.openai.com/v1`
@@ -391,6 +422,12 @@ Then run:
 - `providers.xai.language`: BCP-47 language code or `auto` (default `en`).
 - `providers.xai.responseFormat`: `mp3`, `wav`, `pcm`, `mulaw`, or `alaw` (default `mp3`).
 - `providers.xai.speed`: provider-native speed override.
+- `providers.xiaomi.apiKey`: Xiaomi MiMo API key (env: `XIAOMI_API_KEY`).
+- `providers.xiaomi.baseUrl`: override the Xiaomi MiMo API base URL (default `https://api.xiaomimimo.com/v1`, env: `XIAOMI_BASE_URL`).
+- `providers.xiaomi.model`: TTS model (default `mimo-v2.5-tts`, env: `XIAOMI_TTS_MODEL`; `mimo-v2-tts` is also supported).
+- `providers.xiaomi.voice`: MiMo voice id (default `mimo_default`, env: `XIAOMI_TTS_VOICE`).
+- `providers.xiaomi.format`: `mp3` or `wav` (default `mp3`, env: `XIAOMI_TTS_FORMAT`).
+- `providers.xiaomi.style`: optional natural-language style instruction sent as the user message; it is not spoken.
 - `providers.openrouter.apiKey`: OpenRouter API key (env: `OPENROUTER_API_KEY`; can reuse `models.providers.openrouter.apiKey`).
 - `providers.openrouter.baseUrl`: override the OpenRouter TTS base URL (default `https://openrouter.ai/api/v1`; legacy `https://openrouter.ai/v1` is normalized).
 - `providers.openrouter.model`: OpenRouter TTS model id (default `hexgrad/kokoro-82m`; `modelId` is also accepted).
@@ -432,9 +469,9 @@ Here you go.
 
 Available directive keys (when enabled):
 
-- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `vydra`, or `xai`; requires `allowProvider: true`)
-- `voice` (OpenAI or Gradium voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
-- `model` (OpenAI TTS model, ElevenLabs model id, or MiniMax model) or `google_model` (Google TTS model)
+- `provider` (registered speech provider id, for example `openai`, `elevenlabs`, `google`, `gradium`, `minimax`, `microsoft`, `vydra`, `xai`, or `xiaomi`; requires `allowProvider: true`)
+- `voice` (OpenAI, Gradium, or Xiaomi voice), `voiceName` / `voice_name` / `google_voice` (Google voice), or `voiceId` (ElevenLabs / Gradium / MiniMax / xAI)
+- `model` (OpenAI TTS model, ElevenLabs model id, MiniMax model, or Xiaomi MiMo TTS model) or `google_model` (Google TTS model)
 - `stability`, `similarityBoost`, `style`, `speed`, `useSpeakerBoost`
 - `vol` / `volume` (MiniMax volume, 0-10)
 - `pitch` (MiniMax integer pitch, -12 to 12; fractional values are truncated before the MiniMax request)
@@ -498,6 +535,7 @@ These override `messages.tts.*` for that host.
 - **Other channels**: MP3 (`mp3_44100_128` from ElevenLabs, `mp3` from OpenAI).
   - 44.1kHz / 128kbps is the default balance for speech clarity.
 - **MiniMax**: MP3 (`speech-2.8-hd` model, 32kHz sample rate) for normal audio attachments. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with `ffmpeg` before delivery.
+- **Xiaomi MiMo**: MP3 by default, or WAV when configured. For voice-note targets such as Feishu and Telegram, OpenClaw transcodes Xiaomi output to 48kHz Opus with `ffmpeg` before delivery.
 - **Google Gemini**: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments and returns PCM directly for Talk/telephony. Native Opus voice-note format is not supported by this path.
 - **Gradium**: WAV for audio attachments, Opus for voice-note targets, and `ulaw_8000` at 8 kHz for telephony.
 - **xAI**: MP3 by default; `responseFormat` may be `mp3`, `wav`, `pcm`, `mulaw`, or `alaw`. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
diff --git a/extensions/google/google.live.test.ts b/extensions/google/google.live.test.ts
@@ -43,7 +43,7 @@ describeLive("google plugin live", () => {
     const speechProvider = requireRegisteredProvider(speechProviders, "google");
     const mediaProvider = requireRegisteredProvider(mediaProviders, "google");
 
-    const phrase = "Testing Google audio transcription with OpenClaw.";
+    const phrase = "Testing Google audio transcription with pineapple.";
     const audioFile = await speechProvider.synthesize({
       text: phrase,
       cfg: { plugins: { enabled: true } } as never,
@@ -62,7 +62,7 @@ describeLive("google plugin live", () => {
 
     const normalized = normalizeTranscriptForMatch(transcript?.text ?? "");
     expect(normalized).toContain("google");
-    expect(normalized).toContain("openclaw");
+    expect(normalized).toContain("pineapple");
   }, 180_000);
 
   it("runs Gemini web search through the registered provider tool", async () => {
diff --git a/extensions/minimax/minimax.live.test.ts b/extensions/minimax/minimax.live.test.ts
@@ -1,14 +1,30 @@
 import { describe, expect, it } from "vitest";
 import { isLiveTestEnabled } from "../../src/agents/live-test-helpers.js";
+import {
+  registerProviderPlugin,
+  requireRegisteredProvider,
+} from "../../test/helpers/plugins/provider-registration.js";
+import plugin from "./index.js";
+import { buildMinimaxSpeechProvider } from "./speech-provider.js";
 import { createMiniMaxWebSearchProvider } from "./src/minimax-web-search-provider.js";
 
+const MINIMAX_API_KEY = process.env.MINIMAX_API_KEY?.trim() ?? "";
 const MINIMAX_SEARCH_KEY =
   process.env.MINIMAX_CODE_PLAN_KEY?.trim() ||
   process.env.MINIMAX_CODING_API_KEY?.trim() ||
-  process.env.MINIMAX_API_KEY?.trim() ||
+  MINIMAX_API_KEY ||
   "";
 const describeLive =
   isLiveTestEnabled() && MINIMAX_SEARCH_KEY.length > 0 ? describe : describe.skip;
+const describeTtsLive =
+  isLiveTestEnabled() && MINIMAX_API_KEY.length > 0 ? describe : describe.skip;
+
+const registerMinimaxPlugin = () =>
+  registerProviderPlugin({
+    plugin,
+    id: "minimax",
+    name: "MiniMax Provider",
+  });
 
 describeLive("minimax plugin live", () => {
   it("runs MiniMax web search through the provider tool", async () => {
@@ -25,3 +41,39 @@ describeLive("minimax plugin live", () => {
     expect(Array.isArray(result?.results)).toBe(true);
   }, 120_000);
 });
+
+describeTtsLive("minimax tts live", () => {
+  it("synthesizes TTS through the registered speech provider", async () => {
+    const { speechProviders } = await registerMinimaxPlugin();
+    const provider = requireRegisteredProvider(speechProviders, "minimax");
+
+    const audioFile = await provider.synthesize({
+      text: "OpenClaw MiniMax text to speech integration test OK.",
+      cfg: { plugins: { enabled: true } } as never,
+      providerConfig: { apiKey: MINIMAX_API_KEY },
+      target: "audio-file",
+      timeoutMs: 90_000,
+    });
+
+    expect(audioFile.outputFormat).toBe("mp3");
+    expect(audioFile.fileExtension).toBe(".mp3");
+    expect(audioFile.audioBuffer.byteLength).toBeGreaterThan(512);
+  }, 120_000);
+
+  it("synthesizes MiniMax TTS as an Opus voice note", async () => {
+    const provider = buildMinimaxSpeechProvider();
+
+    const voiceNote = await provider.synthesize({
+      text: "OpenClaw MiniMax voice note test OK.",
+      cfg: { plugins: { enabled: true } } as never,
+      providerConfig: { apiKey: MINIMAX_API_KEY },
+      target: "voice-note",
+      timeoutMs: 90_000,
+    });
+
+    expect(voiceNote.outputFormat).toBe("opus");
+    expect(voiceNote.fileExtension).toBe(".opus");
+    expect(voiceNote.voiceCompatible).toBe(true);
+    expect(voiceNote.audioBuffer.byteLength).toBeGreaterThan(512);
+  }, 120_000);
+});
diff --git a/extensions/xiaomi/index.ts b/extensions/xiaomi/index.ts
@@ -2,6 +2,7 @@ import { defineSingleProviderPluginEntry } from "openclaw/plugin-sdk/provider-en
 import { PROVIDER_LABELS } from "openclaw/plugin-sdk/provider-usage";
 import { applyXiaomiConfig, XIAOMI_DEFAULT_MODEL_REF } from "./onboard.js";
 import { buildXiaomiProvider } from "./provider-catalog.js";
+import { buildXiaomiSpeechProvider } from "./speech-provider.js";
 
 const PROVIDER_ID = "xiaomi";
 
@@ -40,4 +41,7 @@ export default defineSingleProviderPluginEntry({
       windows: [],
     }),
   },
+  register(api) {
+    api.registerSpeechProvider(buildXiaomiSpeechProvider());
+  },
 });
diff --git a/extensions/xiaomi/openclaw.plugin.json b/extensions/xiaomi/openclaw.plugin.json
@@ -2,6 +2,9 @@
   "id": "xiaomi",
   "enabledByDefault": true,
   "providers": ["xiaomi"],
+  "contracts": {
+    "speechProviders": ["xiaomi"]
+  },
   "providerAuthEnvVars": {
     "xiaomi": ["XIAOMI_API_KEY"]
   },
diff --git a/extensions/xiaomi/speech-provider.test.ts b/extensions/xiaomi/speech-provider.test.ts
diff --git a/extensions/xiaomi/speech-provider.ts b/extensions/xiaomi/speech-provider.ts
diff --git a/extensions/xiaomi/xiaomi.live.test.ts b/extensions/xiaomi/xiaomi.live.test.ts