Skip to content

[Bug]: voice-call OpenAI realtime transcription times out during Twilio media stream while direct WebSocket succeeds #75197

@donkeykong91

Description

@donkeykong91

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Twilio inbound voice-call media streams connect and initial TTS plays, but OpenAI realtime transcription times out during the live call, so caller speech is never transcribed or routed to the agent.

Steps to reproduce

Steps to reproduce:

  1. Start OpenClaw 2026.4.27 on Ubuntu 24.04 with voice-call enabled.
  2. Configure voice-call with Twilio, streaming.enabled: true, streaming.provider: "openai", and streaming.providers.openai.model: "gpt-4o-transcribe".
  3. Configure TTS with OpenAI gpt-4o-mini-tts.
  4. Call the configured Twilio number from an allowlisted caller.
  5. Observe the Twilio media stream connect.
  6. Speak during or after the initial greeting.

Expected behavior

After Twilio media stream connects, OpenAI realtime transcription should connect successfully, caller speech should be transcribed, and the transcript should be routed to the voice-call agent response path.

Actual behavior

The Twilio media stream connects and the initial greeting eventually plays, but STT fails with OpenAI realtime transcription connection timeout. No user transcript is recorded, and the call remains effectively deaf until disconnect/end.

OpenClaw version

2026.4.27

Operating system

Ubuntu 24.04.4 LTS / Linux 6.8.0-110-generic x86_64

Install method

npm global

Model

Voice-call streaming STT: openai/gpt-4o-transcribe, Voice-call TTS: openai/gpt-4o-mini-tts, agent model codex-5.5

Provider / routing chain

Twilio inbound call -> Tailscale Funnel HTTPS/WSS -> OpenClaw voice-call webhook/media stream -> OpenAI Realtime transcription API

Additional provider/model setup details

Relevant redacted voice-call config:

{
  "provider": "twilio",
  "publicUrl": "https://<tailscale-host>/voice/webhook",
  "serve": {
    "port": 3334,
    "bind": "127.0.0.1",
    "path": "/voice/webhook"
  },
  "inboundPolicy": "allowlist",
  "streaming": {
    "enabled": true,
    "provider": "openai",
    "streamPath": "/voice/stream",
    "providers": {
      "openai": {
        "apiKey": "***",
        "model": "gpt-4o-transcribe",
        "silenceDurationMs": 800,
        "vadThreshold": 0.5
      }
    }
  },
  "realtime": {
    "enabled": false
  },
  "tts": {
    "provider": "openai",
    "providers": {
      "openai": {
        "apiKey": "***",
        "model": "gpt-4o-mini-tts",
        "voice": "alloy"
      }
    },
    "timeoutMs": 30000
  }
}

Direct probes from the same machine succeeded:

OpenAI gpt-4o-mini-tts request returned 200 in about 1.2s.
Direct OpenAI realtime transcription WebSocket opened and returned transcription_session.created in about 1.1s.

Logs, screenshots, and evidence

07:55:06 [voice-call] Inbound call accepted: +<PHONE_NUMBER_REDACTED> is in allowlist
07:55:06 [voice-call] Created inbound call record: 41be546b-d1db-4f1a-b613-b4155a8821db from +<PHONE_NUMBER_REDACTED>
07:55:07 [MediaStream] Twilio connected
07:55:07 [MediaStream] Stream started: MZd0ddb4a2aa6561e185e88e481c1523b0 (call: CA0c67464cb2ddbccd522404560efbe0e5)
07:55:07 [voice-call] Media stream connected: CA0c67464cb2ddbccd522404560efbe0e5 -> MZd0ddb4a2aa6561e185e88e481c1523b0
07:55:07 [voice-call] Speaking initial message for call 41be546b-d1db-4f1a-b613-b4155a8821db (mode: conversation)
07:55:19 [MediaStream] Transcription session error: OpenAI realtime transcription connection timeout
07:55:19 [MediaStream] STT connection failed (TTS still works): OpenAI realtime transcription connection timeout
07:57:04 [MediaStream] Stream stopped: MZd0ddb4a2aa6561e185e88e481c1523b0
07:57:04 [voice-call] Media stream disconnected: CA0c67464cb2ddbccd522404560efbe0e5 (MZd0ddb4a2aa6561e185e88e481c1523b0)
07:57:05 [MediaStream] WebSocket closed (code: 1005, reason: none)
07:57:06 [voice-call] Auto-ending call 41be546b-d1db-4f1a-b613-b4155a8821db after stream disconnect grace

Persisted call record evidence shows only the bot greeting transcript, with no user transcript:

{
  "callId": "41be546b-d1db-4f1a-b613-b4155a8821db",
  "state": "speaking",
  "transcript": [
    {
      "speaker": "bot",
      "text": "Hello! How can I help you today?",
      "isFinal": true
    }
  ]
}

Impact and severity

Affected: voice-call plugin users using Twilio inbound calls with OpenAI realtime transcription.
Severity: High; inbound conversation mode is unusable because caller speech is not transcribed.
Frequency: Observed repeatedly across multiple inbound call attempts in this setup.
Consequence: Calls connect and may play the greeting, but the assistant cannot hear/respond to the caller.

Additional information

A direct OpenAI realtime transcription WebSocket probe from the same host succeeds quickly, so this does not appear to be basic OpenAI network reachability. The failure appears specific to the live voice-call media stream runtime path.

Potentially relevant observation: the initial greeting begins immediately after media stream connect, while STT connection is still pending. In observed calls, STT times out and user speech is never captured.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions