Skip to content

BlueBubbles native iOS voice-memo delivery broken end-to-end with ElevenLabs (and other non-Azure TTS providers) #72506

@omarshahine

Description

@omarshahine

Summary

Sending TTS audio to a BlueBubbles iMessage chat using the bundled tts agent tool (or tts.convert RPC) currently always renders as a plain audio attachment in iMessage, never as a native iOS voice memo (the bubble with the waveform / scrubber UI). Two distinct upstream gaps in the same pipeline are conspiring to make this delivery mode unreachable for any non-Azure TTS provider, even though every individual link in the chain otherwise works.

Pipeline (what should happen)

For native voice-memo rendering, the chain must complete:

  1. TTS provider returns voiceCompatible: true for the synthesized clip.
  2. The bundled tts agent tool sets details.media.audioAsVoice = true based on result.audioAsVoice || result.voiceCompatible (src/agents/tools/tts-tool.ts:97).
  3. The reply-delivery layer propagates audioAsVoice through to the BlueBubbles channel monitor (extensions/bluebubbles/src/monitor-processing.ts:1689 reads payload.audioAsVoice === true into asVoice).
  4. extensions/bluebubbles/src/attachments.ts:134-188 flips wantsVoice = true and adds the isAudioMessage=true form field on the upload.
  5. The BlueBubbles server converts MP3 → CAF and posts via the private API as a native iMessage voice memo.

Where the chain breaks

Gap 1 — target=voice-note is never set when delivering TTS to BlueBubbles

extensions/elevenlabs/speech-provider.ts:514 only marks voiceCompatible: true when req.target === \"voice-note\". But there's no path that sets target = \"voice-note\" automatically based on the destination channel:

  • tts.convert RPC handler (src/gateway/server-methods/tts.ts:92-144) does not accept a target param. It calls textToSpeech({ text, cfg, channel, overrides, disableFallback }) — the channel is forwarded, but I cannot find any branch in the runtime that maps channel === \"bluebubbles\"target = \"voice-note\".
  • The bundled tts agent tool (src/agents/tools/tts-tool.ts) likewise has no target param in TtsToolSchema and does not set one explicitly.
  • Adding [[audio_as_voice]] to the input text (or passing \"target\": \"voice-note\" directly to tts.convert) does not cause the synthesis to flip — voiceCompatible stays false (verified on v2026.4.24, see repro below).

Gap 2 — ElevenLabs returns opus for voice-note target, but BlueBubbles rejects opus

Even if Gap 1 were closed, extensions/elevenlabs/speech-provider.ts:469-513 defaults to opus_48000_64 with file extension .opus whenever req.target === \"voice-note\":

const outputFormat =
  trimToUndefined(overrides.outputFormat) ??
  (req.target === \"voice-note\" ? \"opus_48000_64\" : \"mp3_44100_128\");
// ...
fileExtension: req.target === \"voice-note\" ? \".opus\" : \".mp3\",
voiceCompatible: req.target === \"voice-note\",

But extensions/bluebubbles/src/attachments.ts:170-188 requires MP3 or CAF for isAudioMessage=true and explicitly rejects opus:

if (isAudioMessage) {
  const voiceInfo = resolveVoiceInfo(filename, contentType);
  if (!voiceInfo.isAudio) { throw new Error(\"BlueBubbles voice messages require audio media (mp3 or caf).\"); }
  if (voiceInfo.isMp3) { /* ok */ }
  else if (voiceInfo.isCaf) { /* ok */ }
  else { throw new Error(\"BlueBubbles voice messages require mp3 or caf audio (convert before sending).\"); }
}

So overriding outputFormat: \"mp3_44100_128\" to coax MP3 out doesn't fix it either, because fileExtension is hardcoded to .opus whenever target === \"voice-note\" regardless of the actual format. BlueBubbles would receive .opus filename + MP3 bytes → voiceInfo.isMp3 derived from filename would be false.

Net effect

There is no provider+channel combination today (other than possibly Azure Speech, which has explicit voiceNoteOutputFormat config) that can produce a TTS clip BlueBubbles will accept as a native voice memo. The isAudioMessage/asVoice plumbing on the BlueBubbles side is fully wired and works (extensions/bluebubbles/src/actions.ts:448 accepts an explicit asVoice param on direct attachment posts) — but the agent-facing surfaces (tts tool, tts.convert, auto-reply delivery) cannot reach it for synthesized speech.

Reproduction

Environment:

  • OpenClaw v2026.4.24 (file-backed secrets, macOS LaunchAgent, BlueBubbles bundled channel)
  • BlueBubbles server with private API enabled (verified separately — asVoice works for non-TTS attachments via bluebubbles_send_attachment with asVoice: true)

Config:

{
  messages: {
    tts: {
      provider: \"elevenlabs\",
      providers: {
        elevenlabs: {
          apiKey: \"<literal sk_… key (workaround for #72496)>\",
          voiceId: \"<voice-id>\",
          model: \"eleven_v3\",
          outputFormat: \"mp3_44100_128\"
        }
      }
    }
  }
}

Tests (all return voiceCompatible: false):

openclaw gateway call tts.convert --params '{\"text\":\"hi\"}'
openclaw gateway call tts.convert --params '{\"text\":\"hi\",\"channel\":\"bluebubbles\"}'
openclaw gateway call tts.convert --params '{\"text\":\"hi\",\"channel\":\"bluebubbles\",\"target\":\"voice-note\"}'
openclaw gateway call tts.convert --params '{\"text\":\"[[audio_as_voice]] hi\",\"channel\":\"bluebubbles\"}'

Same result via the agent-facing tts tool: BlueBubbles delivery shows provider: \"elevenlabs\" in the tool result details, no audioAsVoice flag in details.media, BlueBubbles renders a generic audio attachment instead of a native voice memo.

Suggested fix

Two complementary changes that together unblock the pipeline:

  1. Auto-target voice-note for voice-capable channels (or expose target on the agent surface). When textToSpeech({ channel }) is called with a channel whose downstream supports voice-memo rendering (BlueBubbles, WhatsApp, Telegram voice notes, etc.), set target = \"voice-note\" by default. Alternatively/additionally, expose target as a parameter on tts.convert and the bundled tts agent tool's input schema so callers can opt in explicitly. Also consider honoring [[audio_as_voice]] reply directives at the synthesis stage (today they only affect downstream delivery).

  2. Honor outputFormat override for voice-note in ElevenLabs (and friends), and align fileExtension. In extensions/elevenlabs/speech-provider.ts:469-513, derive fileExtension from the resolved outputFormat rather than hardcoding .opus for voice-note. That lets users pin outputFormat: \"mp3_44100_128\" and have ElevenLabs return MP3 with .mp3 extension while still marking voiceCompatible: true. (Optional: add a sibling voiceNoteOutputFormat config field matching the Azure provider's pattern, for symmetry.)

Both changes are relatively contained. Either one alone is insufficient — closing Gap 1 only routes us into the opus-rejection trap; closing Gap 2 only is unreachable without Gap 1.

Related

No PII

All voice IDs, key material, file paths, and account-specific identifiers are placeholders. Reproduces on a clean LaunchAgent install with any ElevenLabs voice and a BlueBubbles server with the private API enabled.

Metadata

Metadata

Assignees

Labels

maintainerMaintainer-authored PR

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions