Skip to content

WhatsApp voice messages broken on all model providers - audio sent as image with undefined MIME type #13924

@josteins

Description

@josteins

Bug Report: WhatsApp Voice Messages Broken on All Model Providers

Summary

WhatsApp voice messages fail on all model providers (Google Antigravity, OpenAI, Anthropic/OpenRouter) because OpenClaw sends audio content with incorrect MIME type formatting, treating audio as image content.

Environment

  • OpenClaw Version: 2026.2.9
  • OS: Ubuntu 22.04 (Linux 5.15.0-151-generic x86_64)
  • Node: 22.22.0
  • WhatsApp Channel: Baileys (web)

Steps to Reproduce

  1. Configure OpenClaw with WhatsApp channel
  2. Set any model provider as primary (tested: google-antigravity, openai, openrouter)
  3. Send a voice message from WhatsApp (phone or desktop)
  4. Observe error in response

Expected Behavior

Voice messages should either:

  1. Be transcribed automatically before sending to models that don't support audio, OR
  2. Be sent with correct audio MIME type to models that do support audio (e.g., GPT-4o)

Actual Behavior

OpenClaw sends audio content as if it were an image, with wrong/undefined MIME type:

Error Messages by Provider

Provider Error
Google Antigravity Cloud Code Assist API error (400): Unsupported MIME type:
OpenAI GPT-4o HTTP 400: Invalid 'input[1].content[0].image_url'. Expected a base64-encoded data URL with an image MIME type (e.g. 'data:image/png;base64,...'), but got unsupported MIME type 'undefined'.
OpenRouter/Claude messages.0.content.0.image.source.base64.media_type: Input should be 'image/jpeg', 'image/gif', 'image/webp' or 'image/png'

Technical Details

  • WhatsApp voice messages arrive as audio/ogg; codecs=opus
  • Files are correctly downloaded to ~/.openclaw/media/inbound/*.ogg
  • When building the API request, OpenClaw appears to:
    1. Detect media content
    2. Treat it as an image (wrong content type handling)
    3. Set MIME type incorrectly or as undefined
    4. Send to model API, which rejects it

Log Evidence

Inbound message +447432727972 -> +447432727972 (direct, audio/ogg; codecs=opus, 67 chars)

The audio is correctly identified on inbound, but incorrectly formatted on outbound to model API.

Suggested Fix

Option A: Auto-transcription

  • Add config option: channels.whatsapp.transcribeAudio: true
  • When enabled, auto-invoke openai-whisper-api skill for voice messages before sending to model
  • Send transcribed text instead of audio content

Option B: Proper audio content handling

  • Detect audio MIME types (audio/*)
  • For models with native audio support (GPT-4o, Gemini), send with correct audio content type
  • For models without audio support, fall back to transcription

Workaround

None currently available. Text messages work fine; only voice messages are affected.

Additional Context

  • The openai-whisper-api skill is installed and working for manual transcription
  • A transcribe-with-retry.sh script exists but isn't auto-invoked
  • This affects both phone and desktop WhatsApp voice messages
  • Issue started appearing after testing multiple model providers

Related

  • WhatsApp media download 0-byte issue (intermittent, separate bug)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions