[Feature]: Automatic Voice Note Transcription

**Summary**

Voice/audio messages sent via Telegram (and other messaging platforms) are downloaded successfully but not automatically transcribed, even when OpenAI Whisper is configured in tools.media.audio. Audio appears as <media:audio> placeholders in the agent's context, requiring manual transcription via curl. This breaks the natural flow of voice-based interaction on mobile platforms where voice notes are a primary input method.

**Proposed solution**

Automatically transcribe incoming audio/voice messages using the configured tools.media.audio provider (OpenAI Whisper, etc.) and inject the transcript into the agent's message context. The transcript should replace or annotate the <media:audio> placeholder so the agent can respond naturally to voice input.

**Implementation details:**

• Add an autoTranscribe flag to tools.media.audio (default: true when audio models are configured)
• Process audio attachments through the configured model pipeline automatically
• Inject transcripts with clear formatting: [Voice note transcription: "text here"]
• Respect existing tools.media.audio.scope permissions

**Alternatives considered**

1. Manual transcription commands – Current workaround where the agent manually calls Whisper API via curl for each voice note. This defeats the purpose of seamless voice interaction.
2. External webhook preprocessing – Set up a separate service to intercept Telegram webhooks, transcribe audio, and forward modified messages to OpenClaw. This adds unnecessary infrastructure complexity.
3. Custom skill/hook – Build a local hook to process audio on inbound messages. This should be a core feature, not a per-deployment workaround.

**Additional context**

• Version: OpenClaw 2026.2.9
• Platform: Telegram (likely affects WhatsApp, Signal, Slack, Discord, and other voice-capable channels)
• Config tested:  • OpenAI auth profile configured (openai:default)
  • tools.media.audio.enabled: true
  • tools.media.audio.scope.default: "allow"
  • OpenAI Whisper model defined in tools.media.audio.models
  • API key working (manual curl tests succeed)

• Related docs: Voice message sending is documented in docs/channels/telegram.md (lines 442-462) but only covers outbound, not inbound transcription
• Similar feature: Image understanding already works automatically via tools.media.image – audio should follow the same pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Automatic Voice Note Transcription #14374

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Automatic Voice Note Transcription #14374

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions