Skip to content

[Feature Request] Add server-side STT option for webchat voice input #47311

@jmomford

Description

@jmomford

Summary

Add a server-side speech-to-text option for webchat, allowing voice input to be processed by local Whisper (or other configured STT backends) instead of relying on browser's Web Speech API.

Problem

The current webchat voice input uses the browser's native SpeechRecognition API (Web Speech API), which has significant limitations:

  • Safari: Limited/broken support, especially on macOS
  • Privacy: Chrome's Web Speech API sends audio to Google servers
  • Network dependency: Doesn't work offline or in restricted network environments
  • Inconsistent: Behavior varies across browsers and versions

Proposed Solution

Add a second voice input button (or toggle) that:

  1. Uses MediaRecorder API to capture audio locally (widely supported, including Safari)
  2. Sends the audio blob to Gateway via WebSocket or HTTP upload
  3. Gateway transcribes using the configured tools.media.audio backend (e.g., local Whisper)
  4. Returns transcribed text to the input field

UI Suggestion

  • Keep existing browser STT button (for users who prefer it)
  • Add new "Server STT" button with distinct icon (e.g., server + mic)
  • Or: single button with config option to choose backend

Backend

Already have tools.media.audio config that supports local Whisper CLI:

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { type: "cli", command: "whisper", args: ["{{MediaPath}}", "--model", "tiny", "--language", "zh"] }
        ]
      }
    }
  }
}

Just need a new endpoint to accept audio uploads from webchat and return transcription.

Benefits

  • Works consistently across all browsers (Safari, Firefox, Chrome)
  • Privacy-friendly (audio stays on user's server)
  • Works offline/air-gapped
  • Leverages existing media understanding infrastructure
  • Users can choose accuracy vs speed (tiny/base/medium models)

Environment

  • OpenClaw version: 2026.3.8
  • Browser: Safari (macOS)
  • Server: Ubuntu ARM64 with Whisper installed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions