Summary
Add a server-side speech-to-text option for webchat, allowing voice input to be processed by local Whisper (or other configured STT backends) instead of relying on browser's Web Speech API.
Problem
The current webchat voice input uses the browser's native SpeechRecognition API (Web Speech API), which has significant limitations:
- Safari: Limited/broken support, especially on macOS
- Privacy: Chrome's Web Speech API sends audio to Google servers
- Network dependency: Doesn't work offline or in restricted network environments
- Inconsistent: Behavior varies across browsers and versions
Proposed Solution
Add a second voice input button (or toggle) that:
- Uses
MediaRecorder API to capture audio locally (widely supported, including Safari)
- Sends the audio blob to Gateway via WebSocket or HTTP upload
- Gateway transcribes using the configured
tools.media.audio backend (e.g., local Whisper)
- Returns transcribed text to the input field
UI Suggestion
- Keep existing browser STT button (for users who prefer it)
- Add new "Server STT" button with distinct icon (e.g., server + mic)
- Or: single button with config option to choose backend
Backend
Already have tools.media.audio config that supports local Whisper CLI:
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ type: "cli", command: "whisper", args: ["{{MediaPath}}", "--model", "tiny", "--language", "zh"] }
]
}
}
}
}
Just need a new endpoint to accept audio uploads from webchat and return transcription.
Benefits
- Works consistently across all browsers (Safari, Firefox, Chrome)
- Privacy-friendly (audio stays on user's server)
- Works offline/air-gapped
- Leverages existing media understanding infrastructure
- Users can choose accuracy vs speed (tiny/base/medium models)
Environment
- OpenClaw version: 2026.3.8
- Browser: Safari (macOS)
- Server: Ubuntu ARM64 with Whisper installed
Summary
Add a server-side speech-to-text option for webchat, allowing voice input to be processed by local Whisper (or other configured STT backends) instead of relying on browser's Web Speech API.
Problem
The current webchat voice input uses the browser's native
SpeechRecognitionAPI (Web Speech API), which has significant limitations:Proposed Solution
Add a second voice input button (or toggle) that:
MediaRecorderAPI to capture audio locally (widely supported, including Safari)tools.media.audiobackend (e.g., local Whisper)UI Suggestion
Backend
Already have
tools.media.audioconfig that supports local Whisper CLI:Just need a new endpoint to accept audio uploads from webchat and return transcription.
Benefits
Environment