A real-time text-to-speech MCP App with karaoke-style text highlighting, powered by Kyutai's Pocket TTS.
Add to your MCP client configuration (stdio transport):
{
"mcpServers": {
"say": {
"command": "uv",
"args": [
"run",
"--default-index",
"https://pypi.org/simple",
"https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/refs/heads/main/examples/say-server/server.py",
"--stdio"
]
}
}
}To test local modifications, use this configuration (replace ~/code/ext-apps with your clone path):
{
"mcpServers": {
"say": {
"command": "bash",
"args": [
"-c",
"uv run --index https://pypi.org/simple ~/code/ext-apps/examples/say-server/server.py --stdio"
]
}
}
}This example showcases several MCP App capabilities:
- Single-file executable: Python server with embedded React UI - no build step required
- Partial tool inputs (
ontoolinputpartial): The view receives streaming text as it's being generated - Queue-based streaming: Demonstrates how to stream text out and audio in via a polling tool (adds text to an input queue, retrieves audio chunks from an output queue)
- Model context updates: The view updates the LLM with playback progress ("Playing: ...snippet...")
- Native theming: Uses CSS variables for automatic dark/light mode adaptation
- Fullscreen mode: Toggle fullscreen via
requestDisplayMode()API, press Escape to exit - Multi-view speak lock: Coordinates multiple TTS views via localStorage so only one plays at a time
- Hidden tools (
visibility: ["app"]): Private tools only accessible to the view, not the model - External links (
openLink): Attribution popup usesapp.openLink()to open external URLs - CSP metadata: Resource declares required domains (
esm.sh) for in-browser transpilation
- Streaming TTS: Audio starts playing as text is being generated
- Karaoke highlighting: Words are highlighted in sync with speech
- Interactive controls: Click to pause/resume, double-click to restart
- Low latency: Uses a polling-based queue for minimal delay
- uv - fast Python package manager
The server is a single self-contained Python file that can be run directly from GitHub:
# Run directly from GitHub (uv auto-installs dependencies)
uv run https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/main/examples/say-server/server.pyThe server will be available at http://localhost:3109/mcp.
Run directly from GitHub using the official uv Docker image:
docker run --rm -it \
-p 3109:3109 \
-v ~/.cache/huggingface-docker-say-server:/root/.cache/huggingface \
ghcr.io/astral-sh/uv:debian \
uv run https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/main/examples/say-server/server.pyAdd to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"say": {
"command": "uv",
"args": [
"run",
"https://raw.githubusercontent.com/modelcontextprotocol/ext-apps/main/examples/say-server/server.py",
"--stdio"
]
}
}
}Connect to http://localhost:3109/mcp and call the say tool:
{
"name": "say",
"arguments": {
"text": "Hello, world! This is a streaming TTS demo."
}
}The default voice is cosette. Use the list_voices tool or pass a voice parameter to say:
alba,marius,javert,jean- from alba-mackenna (CC BY 4.0)cosette,eponine,azelma,fantine- from VCTK dataset (CC BY 4.0)
You can also use HuggingFace URLs or local file paths:
{"text": "Hello!", "voice": "hf://kyutai/tts-voices/voice-donations/alice.wav"}
{"text": "Hello!", "voice": "/path/to/my-voice.wav"}See the kyutai/tts-voices repository for more voice collections
The entire server is contained in a single server.py file:
saytool: Public tool that triggers the view with text to speak- Private tools (
create_tts_queue,add_tts_text,poll_tts_audio, etc.): Hidden from the model, only callable by the view - Embedded React view: Uses Babel standalone for in-browser JSX transpilation - no build step needed
- TTS backend: Manages per-request audio queues using Pocket TTS
The view communicates with the server via MCP tool calls:
- Receives streaming text via
ontoolinputpartialcallback - Incrementally sends new text to the server as it arrives (via
add_tts_text) - Polls for generated audio chunks while TTS runs in parallel
- Plays audio via Web Audio API with synchronized text highlighting
When multiple TTS views exist in the same browser (e.g., multiple chat messages each with their own say view), they coordinate via localStorage to ensure only one plays at a time:
- Unique view IDs: Each view receives a UUID via
toolResult._meta.viewUUID - Announce on Play: When starting, a view writes
{uuid, timestamp}tolocalStorage["mcp-tts-playing"] - Poll for Conflicts: Every 200ms, playing views check if another view took the lock
- Yield Gracefully: If another view started playing, pause and yield
- Clean Up: On pause/finish, clear the lock (only if owned)
This "last writer wins" protocol ensures a seamless experience: clicking play on any view immediately pauses others, without requiring cross-iframe postMessage coordination.
- Persist caret position in localStorage (resume from where you left off)
- Click anywhere in text to move the cursor/playback position
This project uses Pocket TTS by Kyutai - a fantastic open-source text-to-speech model. Thank you to the Kyutai team for making this technology available!
The server includes modified Pocket TTS code to support streaming text input (text can be fed incrementally while audio generation runs in parallel). A PR contributing this functionality back to the original repo is planned.
This example is MIT licensed.
This project uses the following open-source components:
| Component | License | Link |
|---|---|---|
| pocket-tts | MIT | Python TTS library |
| Kyutai TTS model | CC-BY 4.0 | Text-to-speech model weights |
| kyutai/tts-voices | Mixed (see below) | Voice prompt files |
The predefined voices in this example use CC-BY 4.0 licensed collections:
| Collection | License | Commercial Use |
|---|---|---|
| alba-mackenna | CC-BY 4.0 | ✅ Yes (with attribution) |
| vctk | CC-BY 4.0 | ✅ Yes (with attribution) |
| cml-tts/fr | CC-BY 4.0 | ✅ Yes (with attribution) |
| voice-donations | CC0 (Public Domain) | ✅ Yes |
| expresso | CC-BY-NC 4.0 | ❌ Non-commercial only |
| ears | CC-BY-NC 4.0 | ❌ Non-commercial only |
expresso/ or ears/ collections, your use is restricted to non-commercial purposes.
