Summary
Voice/audio messages sent via Telegram (and other messaging platforms) are downloaded successfully but not automatically transcribed, even when OpenAI Whisper is configured in tools.media.audio. Audio appears as media:audio placeholders in the agent's context, requiring manual transcription via curl. This breaks the natural flow of voice-based interaction on mobile platforms where voice notes are a primary input method.
Proposed solution
Automatically transcribe incoming audio/voice messages using the configured tools.media.audio provider (OpenAI Whisper, etc.) and inject the transcript into the agent's message context. The transcript should replace or annotate the media:audio placeholder so the agent can respond naturally to voice input.
Implementation details:
• Add an autoTranscribe flag to tools.media.audio (default: true when audio models are configured)
• Process audio attachments through the configured model pipeline automatically
• Inject transcripts with clear formatting: [Voice note transcription: "text here"]
• Respect existing tools.media.audio.scope permissions
Alternatives considered
- Manual transcription commands – Current workaround where the agent manually calls Whisper API via curl for each voice note. This defeats the purpose of seamless voice interaction.
- External webhook preprocessing – Set up a separate service to intercept Telegram webhooks, transcribe audio, and forward modified messages to OpenClaw. This adds unnecessary infrastructure complexity.
- Custom skill/hook – Build a local hook to process audio on inbound messages. This should be a core feature, not a per-deployment workaround.
Additional context
• Version: OpenClaw 2026.2.9
• Platform: Telegram (likely affects WhatsApp, Signal, Slack, Discord, and other voice-capable channels)
• Config tested: • OpenAI auth profile configured (openai:default)
• tools.media.audio.enabled: true
• tools.media.audio.scope.default: "allow"
• OpenAI Whisper model defined in tools.media.audio.models
• API key working (manual curl tests succeed)
• Related docs: Voice message sending is documented in docs/channels/telegram.md (lines 442-462) but only covers outbound, not inbound transcription
• Similar feature: Image understanding already works automatically via tools.media.image – audio should follow the same pattern
Summary
Voice/audio messages sent via Telegram (and other messaging platforms) are downloaded successfully but not automatically transcribed, even when OpenAI Whisper is configured in tools.media.audio. Audio appears as media:audio placeholders in the agent's context, requiring manual transcription via curl. This breaks the natural flow of voice-based interaction on mobile platforms where voice notes are a primary input method.
Proposed solution
Automatically transcribe incoming audio/voice messages using the configured tools.media.audio provider (OpenAI Whisper, etc.) and inject the transcript into the agent's message context. The transcript should replace or annotate the media:audio placeholder so the agent can respond naturally to voice input.
Implementation details:
• Add an autoTranscribe flag to tools.media.audio (default: true when audio models are configured)
• Process audio attachments through the configured model pipeline automatically
• Inject transcripts with clear formatting: [Voice note transcription: "text here"]
• Respect existing tools.media.audio.scope permissions
Alternatives considered
Additional context
• Version: OpenClaw 2026.2.9
• Platform: Telegram (likely affects WhatsApp, Signal, Slack, Discord, and other voice-capable channels)
• Config tested: • OpenAI auth profile configured (openai:default)
• tools.media.audio.enabled: true
• tools.media.audio.scope.default: "allow"
• OpenAI Whisper model defined in tools.media.audio.models
• API key working (manual curl tests succeed)
• Related docs: Voice message sending is documented in docs/channels/telegram.md (lines 442-462) but only covers outbound, not inbound transcription
• Similar feature: Image understanding already works automatically via tools.media.image – audio should follow the same pattern