-
-
Notifications
You must be signed in to change notification settings - Fork 54.6k
Closed
Description
Bug Report: WhatsApp Voice Messages Broken on All Model Providers
Summary
WhatsApp voice messages fail on all model providers (Google Antigravity, OpenAI, Anthropic/OpenRouter) because OpenClaw sends audio content with incorrect MIME type formatting, treating audio as image content.
Environment
- OpenClaw Version: 2026.2.9
- OS: Ubuntu 22.04 (Linux 5.15.0-151-generic x86_64)
- Node: 22.22.0
- WhatsApp Channel: Baileys (web)
Steps to Reproduce
- Configure OpenClaw with WhatsApp channel
- Set any model provider as primary (tested: google-antigravity, openai, openrouter)
- Send a voice message from WhatsApp (phone or desktop)
- Observe error in response
Expected Behavior
Voice messages should either:
- Be transcribed automatically before sending to models that don't support audio, OR
- Be sent with correct audio MIME type to models that do support audio (e.g., GPT-4o)
Actual Behavior
OpenClaw sends audio content as if it were an image, with wrong/undefined MIME type:
Error Messages by Provider
| Provider | Error |
|---|---|
| Google Antigravity | Cloud Code Assist API error (400): Unsupported MIME type: |
| OpenAI GPT-4o | HTTP 400: Invalid 'input[1].content[0].image_url'. Expected a base64-encoded data URL with an image MIME type (e.g. 'data:image/png;base64,...'), but got unsupported MIME type 'undefined'. |
| OpenRouter/Claude | messages.0.content.0.image.source.base64.media_type: Input should be 'image/jpeg', 'image/gif', 'image/webp' or 'image/png' |
Technical Details
- WhatsApp voice messages arrive as
audio/ogg; codecs=opus - Files are correctly downloaded to
~/.openclaw/media/inbound/*.ogg - When building the API request, OpenClaw appears to:
- Detect media content
- Treat it as an image (wrong content type handling)
- Set MIME type incorrectly or as
undefined - Send to model API, which rejects it
Log Evidence
Inbound message +447432727972 -> +447432727972 (direct, audio/ogg; codecs=opus, 67 chars)
The audio is correctly identified on inbound, but incorrectly formatted on outbound to model API.
Suggested Fix
Option A: Auto-transcription
- Add config option:
channels.whatsapp.transcribeAudio: true - When enabled, auto-invoke
openai-whisper-apiskill for voice messages before sending to model - Send transcribed text instead of audio content
Option B: Proper audio content handling
- Detect audio MIME types (
audio/*) - For models with native audio support (GPT-4o, Gemini), send with correct audio content type
- For models without audio support, fall back to transcription
Workaround
None currently available. Text messages work fine; only voice messages are affected.
Additional Context
- The
openai-whisper-apiskill is installed and working for manual transcription - A
transcribe-with-retry.shscript exists but isn't auto-invoked - This affects both phone and desktop WhatsApp voice messages
- Issue started appearing after testing multiple model providers
Related
- WhatsApp media download 0-byte issue (intermittent, separate bug)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels