Skip to content

[Feature]: Automatic Voice Note Transcription #14374

@machaltitude

Description

@machaltitude

Summary

Voice/audio messages sent via Telegram (and other messaging platforms) are downloaded successfully but not automatically transcribed, even when OpenAI Whisper is configured in tools.media.audio. Audio appears as media:audio placeholders in the agent's context, requiring manual transcription via curl. This breaks the natural flow of voice-based interaction on mobile platforms where voice notes are a primary input method.

Proposed solution

Automatically transcribe incoming audio/voice messages using the configured tools.media.audio provider (OpenAI Whisper, etc.) and inject the transcript into the agent's message context. The transcript should replace or annotate the media:audio placeholder so the agent can respond naturally to voice input.

Implementation details:

• Add an autoTranscribe flag to tools.media.audio (default: true when audio models are configured)
• Process audio attachments through the configured model pipeline automatically
• Inject transcripts with clear formatting: [Voice note transcription: "text here"]
• Respect existing tools.media.audio.scope permissions

Alternatives considered

  1. Manual transcription commands – Current workaround where the agent manually calls Whisper API via curl for each voice note. This defeats the purpose of seamless voice interaction.
  2. External webhook preprocessing – Set up a separate service to intercept Telegram webhooks, transcribe audio, and forward modified messages to OpenClaw. This adds unnecessary infrastructure complexity.
  3. Custom skill/hook – Build a local hook to process audio on inbound messages. This should be a core feature, not a per-deployment workaround.

Additional context

• Version: OpenClaw 2026.2.9
• Platform: Telegram (likely affects WhatsApp, Signal, Slack, Discord, and other voice-capable channels)
• Config tested: • OpenAI auth profile configured (openai:default)
• tools.media.audio.enabled: true
• tools.media.audio.scope.default: "allow"
• OpenAI Whisper model defined in tools.media.audio.models
• API key working (manual curl tests succeed)

• Related docs: Voice message sending is documented in docs/channels/telegram.md (lines 442-462) but only covers outbound, not inbound transcription
• Similar feature: Image understanding already works automatically via tools.media.image – audio should follow the same pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions