Pluggable STT Providers for voice-call Plugin

### Summary

Add a pluggable STT provider interface to the voice-call plugin, mirroring the existing TTS provider pattern (api.registerSpeechProvider).

### Problem to solve

The voice-call plugin's streaming STT is hardcoded to openai-realtime in three places: the zod schema enum, the initializeMediaStreaming() method, and the openclaw.plugin.json config schema. There is no way to use an alternative STT provider (AWS Transcribe, Deepgram, local Whisper, etc.) without patching compiled dist files. This forces all voice-call users to depend on OpenAI for transcription regardless of their infrastructure preferences.

### Proposed solution

Add api.registerRealtimeTranscriptionProvider(provider) to the plugin SDK. The provider interface already exists informally — OpenAIRealtimeSTTProvider has a clean contract: createSession() returning a session with sendAudio(), onTranscript(), onPartial(), onSpeechStart(), close(), and isConnected(). Making this pluggable requires: expanding the sttProvider config to accept any registered provider ID, and having initializeMediaStreaming() resolve the provider from the registry instead of directly instantiating OpenAIRealtimeSTTProvider.


### Alternatives considered

- Patching the compiled dist files after each update (current workaround — fragile, requires a systemd watcher or manual reapplication)
- Using a SOCKS proxy or external adapter to intercept the OpenAI WebSocket and redirect to another provider (over-engineered, adds latency)
- Disabling streaming STT and using buffered transcription (loses real-time capability)


### Impact

Opens the voice-call plugin to the broader STT ecosystem. AWS Transcribe, Deepgram, Azure Speech, and local Whisper all have streaming transcription APIs. Users running on their own hardware (edge boxes, self-hosted) benefit most — they can choose providers based on cost, latency, privacy, or regulatory requirements rather than being locked to OpenAI.

### Evidence/examples

've built an AWS Transcribe STT provider that implements the same interface as OpenAIRealtimeSTTProvider, including mu-law to PCM decoding for Twilio, speech detection from partial results, and configurable silence thresholds. Full source: https://github.com/agenticbrian/openclaw-provider-aws-polly/blob/master/transcribe-stt.js — ready to contribute as a PR if the pluggable interface lands.

### Additional information

Also related: the responseAgent config field from #9635 would complement this — currently const agentId = "main" is hardcoded in the response generator, requiring a separate patch to route voice calls to a specific agent. Together, pluggable STT + responseAgent would make the voice-call plugin fully configurable for multi-agent, multi-provider voice setups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pluggable STT Providers for voice-call Plugin #68697

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Pluggable STT Providers for voice-call Plugin #68697

Description

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions