Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
Twilio inbound voice-call media streams connect and initial TTS plays, but OpenAI realtime transcription times out during the live call, so caller speech is never transcribed or routed to the agent.
Steps to reproduce
Steps to reproduce:
- Start OpenClaw 2026.4.27 on Ubuntu 24.04 with
voice-call enabled.
- Configure
voice-call with Twilio, streaming.enabled: true, streaming.provider: "openai", and streaming.providers.openai.model: "gpt-4o-transcribe".
- Configure TTS with OpenAI
gpt-4o-mini-tts.
- Call the configured Twilio number from an allowlisted caller.
- Observe the Twilio media stream connect.
- Speak during or after the initial greeting.
Expected behavior
After Twilio media stream connects, OpenAI realtime transcription should connect successfully, caller speech should be transcribed, and the transcript should be routed to the voice-call agent response path.
Actual behavior
The Twilio media stream connects and the initial greeting eventually plays, but STT fails with OpenAI realtime transcription connection timeout. No user transcript is recorded, and the call remains effectively deaf until disconnect/end.
OpenClaw version
2026.4.27
Operating system
Ubuntu 24.04.4 LTS / Linux 6.8.0-110-generic x86_64
Install method
npm global
Model
Voice-call streaming STT: openai/gpt-4o-transcribe, Voice-call TTS: openai/gpt-4o-mini-tts, agent model codex-5.5
Provider / routing chain
Twilio inbound call -> Tailscale Funnel HTTPS/WSS -> OpenClaw voice-call webhook/media stream -> OpenAI Realtime transcription API
Additional provider/model setup details
Relevant redacted voice-call config:
{
"provider": "twilio",
"publicUrl": "https://<tailscale-host>/voice/webhook",
"serve": {
"port": 3334,
"bind": "127.0.0.1",
"path": "/voice/webhook"
},
"inboundPolicy": "allowlist",
"streaming": {
"enabled": true,
"provider": "openai",
"streamPath": "/voice/stream",
"providers": {
"openai": {
"apiKey": "***",
"model": "gpt-4o-transcribe",
"silenceDurationMs": 800,
"vadThreshold": 0.5
}
}
},
"realtime": {
"enabled": false
},
"tts": {
"provider": "openai",
"providers": {
"openai": {
"apiKey": "***",
"model": "gpt-4o-mini-tts",
"voice": "alloy"
}
},
"timeoutMs": 30000
}
}
Direct probes from the same machine succeeded:
OpenAI gpt-4o-mini-tts request returned 200 in about 1.2s.
Direct OpenAI realtime transcription WebSocket opened and returned transcription_session.created in about 1.1s.
Logs, screenshots, and evidence
07:55:06 [voice-call] Inbound call accepted: +<PHONE_NUMBER_REDACTED> is in allowlist
07:55:06 [voice-call] Created inbound call record: 41be546b-d1db-4f1a-b613-b4155a8821db from +<PHONE_NUMBER_REDACTED>
07:55:07 [MediaStream] Twilio connected
07:55:07 [MediaStream] Stream started: MZd0ddb4a2aa6561e185e88e481c1523b0 (call: CA0c67464cb2ddbccd522404560efbe0e5)
07:55:07 [voice-call] Media stream connected: CA0c67464cb2ddbccd522404560efbe0e5 -> MZd0ddb4a2aa6561e185e88e481c1523b0
07:55:07 [voice-call] Speaking initial message for call 41be546b-d1db-4f1a-b613-b4155a8821db (mode: conversation)
07:55:19 [MediaStream] Transcription session error: OpenAI realtime transcription connection timeout
07:55:19 [MediaStream] STT connection failed (TTS still works): OpenAI realtime transcription connection timeout
07:57:04 [MediaStream] Stream stopped: MZd0ddb4a2aa6561e185e88e481c1523b0
07:57:04 [voice-call] Media stream disconnected: CA0c67464cb2ddbccd522404560efbe0e5 (MZd0ddb4a2aa6561e185e88e481c1523b0)
07:57:05 [MediaStream] WebSocket closed (code: 1005, reason: none)
07:57:06 [voice-call] Auto-ending call 41be546b-d1db-4f1a-b613-b4155a8821db after stream disconnect grace
Persisted call record evidence shows only the bot greeting transcript, with no user transcript:
{
"callId": "41be546b-d1db-4f1a-b613-b4155a8821db",
"state": "speaking",
"transcript": [
{
"speaker": "bot",
"text": "Hello! How can I help you today?",
"isFinal": true
}
]
}
Impact and severity
Affected: voice-call plugin users using Twilio inbound calls with OpenAI realtime transcription.
Severity: High; inbound conversation mode is unusable because caller speech is not transcribed.
Frequency: Observed repeatedly across multiple inbound call attempts in this setup.
Consequence: Calls connect and may play the greeting, but the assistant cannot hear/respond to the caller.
Additional information
A direct OpenAI realtime transcription WebSocket probe from the same host succeeds quickly, so this does not appear to be basic OpenAI network reachability. The failure appears specific to the live voice-call media stream runtime path.
Potentially relevant observation: the initial greeting begins immediately after media stream connect, while STT connection is still pending. In observed calls, STT times out and user speech is never captured.
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
Twilio inbound voice-call media streams connect and initial TTS plays, but OpenAI realtime transcription times out during the live call, so caller speech is never transcribed or routed to the agent.
Steps to reproduce
Steps to reproduce:
voice-callenabled.voice-callwith Twilio,streaming.enabled: true,streaming.provider: "openai", andstreaming.providers.openai.model: "gpt-4o-transcribe".gpt-4o-mini-tts.Expected behavior
After Twilio media stream connects, OpenAI realtime transcription should connect successfully, caller speech should be transcribed, and the transcript should be routed to the voice-call agent response path.
Actual behavior
The Twilio media stream connects and the initial greeting eventually plays, but STT fails with
OpenAI realtime transcription connection timeout. No user transcript is recorded, and the call remains effectively deaf until disconnect/end.OpenClaw version
2026.4.27
Operating system
Ubuntu 24.04.4 LTS / Linux 6.8.0-110-generic x86_64
Install method
npm global
Model
Voice-call streaming STT: openai/gpt-4o-transcribe, Voice-call TTS: openai/gpt-4o-mini-tts, agent model codex-5.5
Provider / routing chain
Twilio inbound call -> Tailscale Funnel HTTPS/WSS -> OpenClaw voice-call webhook/media stream -> OpenAI Realtime transcription API
Additional provider/model setup details
Relevant redacted voice-call config:
{ "provider": "twilio", "publicUrl": "https://<tailscale-host>/voice/webhook", "serve": { "port": 3334, "bind": "127.0.0.1", "path": "/voice/webhook" }, "inboundPolicy": "allowlist", "streaming": { "enabled": true, "provider": "openai", "streamPath": "/voice/stream", "providers": { "openai": { "apiKey": "***", "model": "gpt-4o-transcribe", "silenceDurationMs": 800, "vadThreshold": 0.5 } } }, "realtime": { "enabled": false }, "tts": { "provider": "openai", "providers": { "openai": { "apiKey": "***", "model": "gpt-4o-mini-tts", "voice": "alloy" } }, "timeoutMs": 30000 } }Direct probes from the same machine succeeded:
Logs, screenshots, and evidence
Impact and severity
Affected: voice-call plugin users using Twilio inbound calls with OpenAI realtime transcription.
Severity: High; inbound conversation mode is unusable because caller speech is not transcribed.
Frequency: Observed repeatedly across multiple inbound call attempts in this setup.
Consequence: Calls connect and may play the greeting, but the assistant cannot hear/respond to the caller.
Additional information
A direct OpenAI realtime transcription WebSocket probe from the same host succeeds quickly, so this does not appear to be basic OpenAI network reachability. The failure appears specific to the live voice-call media stream runtime path.
Potentially relevant observation: the initial greeting begins immediately after media stream connect, while STT connection is still pending. In observed calls, STT times out and user speech is never captured.