Summary
Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).
Problem to solve
The voice-call plugin's streaming STT is hardcoded to openai-realtime in three places: the zod schema enum, the initializeMediaStreaming() method, and the openclaw.plugin.json config schema. There is no way to use an alternative STT provider (AWS Transcribe, Deepgram, local Whisper, etc.) without patching compiled dist files. This forces all voice-call users to depend on OpenAI for transcription regardless of their infrastructure preferences.
Proposed solution
Add api.registerRealtimeTranscriptionProvider(provider) to the plugin SDK. The provider interface already exists informally — OpenAIRealtimeSTTProvider has a clean contract: createSession() returning a session with sendAudio(), onTranscript(), onPartial(), onSpeechStart(), close(), and isConnected(). Making this pluggable requires: expanding the sttProvider config to accept any registered provider ID, and having initializeMediaStreaming() resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.
Alternatives considered
- Patching the compiled dist files after each update (current workaround — fragile, requires a systemd watcher or manual reapplication)
- Using a SOCKS proxy or external adapter to intercept the OpenAI WebSocket and redirect to another provider (over-engineered, adds latency)
- Disabling streaming STT and using buffered transcription (loses real-time capability)
Impact
Opens the voice-call plugin to the broader STT ecosystem. AWS Transcribe, Deepgram, Azure Speech, and local Whisper all have streaming transcription APIs. Users running on their own hardware (edge boxes, self-hosted) benefit most — they can choose providers based on cost, latency, privacy, or regulatory requirements rather than being locked to OpenAI.
Evidence/examples
've built an AWS Transcribe STT provider that implements the same interface as OpenAIRealtimeSTTProvider, including mu-law to PCM decoding for Twilio, speech detection from partial results, and configurable silence thresholds. Full source: https://github.com/agenticbrian/openclaw-provider-aws-polly/blob/master/transcribe-stt.js — ready to contribute as a PR if the pluggable interface lands.
Additional information
Also related: the responseAgent config field from #9635 would complement this — currently const agentId = "main" is hardcoded in the response generator, requiring a separate patch to route voice calls to a specific agent. Together, pluggable STT + responseAgent would make the voice-call plugin fully configurable for multi-agent, multi-provider voice setups.
Summary
Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).
Problem to solve
The voice-call plugin's streaming STT is hardcoded to openai-realtime in three places: the zod schema enum, the initializeMediaStreaming() method, and the openclaw.plugin.json config schema. There is no way to use an alternative STT provider (AWS Transcribe, Deepgram, local Whisper, etc.) without patching compiled dist files. This forces all voice-call users to depend on OpenAI for transcription regardless of their infrastructure preferences.
Proposed solution
Add api.registerRealtimeTranscriptionProvider(provider) to the plugin SDK. The provider interface already exists informally — OpenAIRealtimeSTTProvider has a clean contract: createSession() returning a session with sendAudio(), onTranscript(), onPartial(), onSpeechStart(), close(), and isConnected(). Making this pluggable requires: expanding the sttProvider config to accept any registered provider ID, and having initializeMediaStreaming() resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.
Alternatives considered
Impact
Opens the voice-call plugin to the broader STT ecosystem. AWS Transcribe, Deepgram, Azure Speech, and local Whisper all have streaming transcription APIs. Users running on their own hardware (edge boxes, self-hosted) benefit most — they can choose providers based on cost, latency, privacy, or regulatory requirements rather than being locked to OpenAI.
Evidence/examples
've built an AWS Transcribe STT provider that implements the same interface as OpenAIRealtimeSTTProvider, including mu-law to PCM decoding for Twilio, speech detection from partial results, and configurable silence thresholds. Full source: https://github.com/agenticbrian/openclaw-provider-aws-polly/blob/master/transcribe-stt.js — ready to contribute as a PR if the pluggable interface lands.
Additional information
Also related: the responseAgent config field from #9635 would complement this — currently const agentId = "main" is hardcoded in the response generator, requiring a separate patch to route voice calls to a specific agent. Together, pluggable STT + responseAgent would make the voice-call plugin fully configurable for multi-agent, multi-provider voice setups.