Skip to content

fix(tts): pre-transcode synthesized audio to channel-preferred container before voice-memo delivery#72586

Merged
omarshahine merged 1 commit into
openclaw:mainfrom
omarshahine:fix/72506-bb-pre-encode-caf-tts
Apr 27, 2026
Merged

fix(tts): pre-transcode synthesized audio to channel-preferred container before voice-memo delivery#72586
omarshahine merged 1 commit into
openclaw:mainfrom
omarshahine:fix/72506-bb-pre-encode-caf-tts

Conversation

@omarshahine

@omarshahine omarshahine commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #72506. After end-to-end testing on a real macOS + BlueBubbles + ElevenLabs stack, voice-memo replies from agents now render as native iMessage voice-memo bubbles (waveform UI, real duration) instead of plain file attachments.

The fix is a small, opt-in channel capability (tts.voice.preferAudioFileFormat) plus a macOS afconvert-backed pre-transcode in the speech-core pipeline. BlueBubbles declares preferAudioFileFormat: "caf" and the speech-core layer transcodes synthesized MP3 to opus-in-CAF before handing the file to the channel. Other channels are unaffected.

Diagnostic journey

The discovery process iterated through three CAF flavors. The descriptor block matters at every hop along OpenClaw → BlueBubbles server → Messages.app private API → iMessage:

Pre-encoded CAF flavor BlueBubbles' internal CAF→MP3 conversion iMessage rendering
(no fix; MP3 + isAudioMessage) Renames to .caf, conversion fails (race) Plain audio attachment
PCM int16 @ 44.1 kHz Conversion fails Voice-memo bubble, 0 s duration
AAC @ 22.05 kHz mono Conversion succeeds → silent downgrade Plain audio attachment
Opus @ 24 kHz mono n/a — passes through Native voice memo, real duration + waveform

What unlocked it was inspecting an Apple-recorded voice memo (a native iMessage Audio Message.caf Apple's Messages.app produces when the user holds the mic button). The descriptor is exactly 1 ch, 24000 Hz, opus, 480 frames/packet, and afconvert -f caff -d opus@24000 -c 1 produces a byte-identical container. iMessage uses that descriptor block as its native voice-memo recognizer; anything else gets downgraded somewhere along the path.

The AAC row in particular was the surprising one: BlueBubbles' internal CAF→MP3 conversion succeeded against AAC-CAF, and BlueBubbles' code path then sent the converted MP3 as audio/mp3 instead of forwarding the original CAF, silently downgrading from voice-memo bubble to plain attachment. PCM-CAF tripped the same conversion logic in the failure direction, which (counter-intuitively) made BlueBubbles fall back to forwarding the CAF — getting most of the way to a voice memo, except iMessage couldn't compute a duration from raw-PCM CAF, so the bubble showed 0 s.

A second, independent gap surfaced along the way: OpenClaw's auto-reply host-local-media validator uses the bundled file-type library to verify outbound buffers, and file-type v22 has no native CAF detector. Without the magic-byte fallback below, the validator drops the pre-transcoded buffer as an unknown binary blob and the agent ends up sending "⚠️ Media failed." instead of the voice memo. Adding a four-byte caff magic sniff in src/media/mime.ts returns audio/x-caf, which the validator already classifies as audio.

Pipeline pieces

  • src/channels/plugins/types.core.ts — extend ChannelTtsVoiceDeliveryCapabilities with optional preferAudioFileFormat?: string. Doc comment explains the intent.
  • extensions/speech-core/src/audio-transcode.ts (new) — transcodeAudioBuffer helper. macOS-only afconvert path; quietly returns undefined on any unsupported pair, missing platform, or process failure. Ships the MP3→CAF recipe used by BlueBubbles voice memos (-f caff -d opus@24000 -c 1) and a CAF→m4a fallback for symmetry with what BlueBubbles itself attempts.
  • extensions/speech-core/src/tts.ts — call the helper between synthesis and file-write inside textToSpeech. When transcoded, swap audioBuffer / fileExtension / outputFormat and use the new values for both the on-disk path and the shouldDeliverTtsAsVoice decision so the resulting audioAsVoice flag reflects the actual file shape that lands on the channel.
  • extensions/bluebubbles/src/channel-shared.ts — declare preferAudioFileFormat: "caf" on BlueBubbles capabilities, with a comment pointing at the Messages.app voice-memo descriptor so future readers know what the format choice protects.
  • src/media/mime.ts — add audio/x-caf → .caf to EXT_BY_MIME, plus a small caff-magic-bytes fallback in sniffMime so host-local validators recognize CAF as audio when file-type doesn't.
  • Tests:
    • extensions/speech-core/src/audio-transcode.test.ts (new) — covers the no-op cases (matching extensions, unsupported recipe, empty source) and platform-portable assertion that off-Darwin always returns undefined without invoking the binary.
    • src/media/mime.test.ts — adds two regression cases for the CAF magic-byte sniff (with and without a corroborating filename).

Behavior matrix

Host platform Channel preferAudioFileFormat Source format Result
macOS caf mp3 Pre-transcoded to opus-in-CAF; uploaded with isAudioMessage=true; renders as native voice-memo bubble in iMessage
macOS unset (other channels) any Unchanged behavior
Linux/Windows caf mp3 transcodeAudioBuffer returns undefined; original MP3 buffer preserved (BlueBubbles is macOS-only anyway)
any matches source already any Helper returns undefined; no extra work
any recipe not implemented any Helper returns undefined; original buffer preserved

Tests

  • pnpm exec vitest run src/media/mime.test.ts extensions/speech-core/src/audio-transcode.test.ts — 63/63 pass (includes existing tests; new cases for CAF sniff + transcode no-op paths).
  • pnpm exec tsc --noEmit -p tsconfig.json clean.
  • End-to-end manual on macOS Apple Silicon + BlueBubbles + ElevenLabs: [[tts:...]] directive in agent reply → native iMessage voice-memo bubble with real duration and waveform.

Test plan

  • Unit tests pass on macOS Apple Silicon
  • TypeScript checks pass
  • E2E: real device renders the result as a native voice-memo bubble
  • Reviewer with a BlueBubbles + macOS setup: send any TTS-tagged reply through any agent and confirm voice-memo bubble UI
  • Reviewer on Linux: confirm non-Darwin path returns the unchanged MP3 buffer (no regression for other channels)
  • Reviewer with Discord/Slack/Telegram TTS: confirm those channels continue to receive their existing format (no preferAudioFileFormat declared, no pre-transcode)

🤖 Generated with Claude Code

@aisle-research-bot

aisle-research-bot Bot commented Apr 27, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 4 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Path traversal / arbitrary file write via untrusted fileExtension in TTS temp file path
2 🟡 Medium Unbounded external transcoder invocation in TTS pre-transcode path (resource exhaustion)
3 🟡 Medium CAF MIME sniffing can be spoofed with only "caff" prefix, bypassing host-local media type validation
4 🔵 Low Log forging via unsanitized channel in verbose pre-transcode failure log
1. 🟠 Path traversal / arbitrary file write via untrusted `fileExtension` in TTS temp file path
Property Value
Severity High
CWE CWE-22
Location extensions/speech-core/src/tts.ts:1134-1135

Description

textToSpeech() writes synthesized audio to a temporary directory using a filename that interpolates fileExtension without validation.

  • fileExtension comes from the selected speech provider (synthesis.fileExtension) and is treated as a string to append to the filename.
  • If a provider (including third-party plugins) returns a malicious extension containing path separators or traversal sequences (e.g. "/../../somefile"), path.join(tempDir, voice-...${fileExtension}) will normalize the path and can escape tempDir.
  • The resulting path is then passed to writeFileSync(), creating an arbitrary file write primitive within the permissions of the running process.

Vulnerable code:

const audioPath = path.join(tempDir, `voice-${Date.now()}${fileExtension}`);
writeFileSync(audioPath, audioBuffer);

While the new pre-transcode logic may override the extension in some cases, it falls back to the original unvalidated provider-supplied synthesis.fileExtension whenever transcoding is skipped or fails (including for invalid preferAudioFileFormat values).

Recommendation

Validate and normalize fileExtension before using it in any filesystem path.

Suggested approach:

  • Strip a leading dot
  • Enforce a strict allowlist regex like /^[a-z0-9]{1,12}$/i
  • Re-add a single leading dot
  • Reject/throw on invalid values (or fall back to a safe default like .wav)

Example:

function safeExt(ext: string): string {
  const token = ext.trim().toLowerCase().replace(/^\./, "");
  if (!/^[a-z0-9]{1,12}$/.test(token)) {
    throw new Error(`Invalid synthesis fileExtension: ${ext}`);
  }
  return `.${token}`;
}

const safeFileExtension = safeExt(fileExtension);
const audioPath = path.join(tempDir, `voice-${Date.now()}${safeFileExtension}`);
writeFileSync(audioPath, audioBuffer, { mode: 0o600 });

Also consider applying the same validation at the provider boundary (when accepting synthesis.fileExtension) so the value is safe everywhere it is used.

2. 🟡 Unbounded external transcoder invocation in TTS pre-transcode path (resource exhaustion)
Property Value
Severity Medium
CWE CWE-400
Location extensions/speech-core/src/audio-transcode.ts:67-76

Description

The new macOS pre-transcode helper spawns an external process (/usr/bin/afconvert) and performs synchronous disk I/O for every eligible TTS request.

  • Inputs: audioBuffer originates from TTS synthesis (ultimately driven by user-provided text and channel delivery preferences).
  • Work performed per request:
    • Writes audioBuffer to disk (writeFileSync), then reads the output back (readFileSync).
    • Spawns afconvert (spawn) with up to a 5s timeout.
  • Issue: There is no explicit rate limiting, concurrency limiting, or maximum buffer size check on this code path. An attacker who can trigger TTS repeatedly (or trigger channels that request preferAudioFileFormat) can force repeated transcodes, causing CPU/disk exhaustion and blocking the Node.js event loop due to synchronous FS operations.

Vulnerable code:

writeFileSync(inPath, params.audioBuffer, { mode: 0o600 });
const result = await runAfconvert({
  args: [...recipe, inPath, outPath],
  timeoutMs: params.timeoutMs ?? 5000,
});
...
return { ok: true, buffer: readFileSync(outPath) };

Recommendation

Add explicit guardrails around transcoding to prevent resource exhaustion:

  • Enforce a maximum audioBuffer size eligible for transcoding (and skip pre-transcode above it).
  • Limit concurrency of afconvert invocations (e.g., a small semaphore/queue), and/or add per-channel/user rate limits.
  • Prefer async FS APIs to avoid blocking the event loop.

Example (sketch) using a semaphore + size cap:

import { promises as fs } from "node:fs";
import pLimit from "p-limit";

const limitAfconvert = pLimit(2); // at most 2 concurrent transcodes
const MAX_TRANSCODE_BYTES = 5 * 1024 * 1024;

export async function transcodeAudioBuffer(params: {...}) {
  if (params.audioBuffer.byteLength > MAX_TRANSCODE_BYTES) {
    return { ok: false, reason: "transcoder-failed", detail: "buffer-too-large" };
  }

  return limitAfconvert(async () => {
    await fs.writeFile(inPath, params.audioBuffer, { mode: 0o600 });
    const result = await runAfconvert(...);
    if (!result.ok) return { ok: false, reason: "transcoder-failed", detail: result.detail };
    return { ok: true, buffer: await fs.readFile(outPath) };
  });
}
3. 🟡 CAF MIME sniffing can be spoofed with only "caff" prefix, bypassing host-local media type validation
Property Value
Severity Medium
CWE CWE-20
Location src/media/mime.ts:109-112

Description

detectMime() falls back to sniffKnownAudioMagic() when file-type does not return a MIME. The new CAF detector classifies any buffer beginning with ASCII caff as audio/x-caf.

This is security-relevant because detectMime({ buffer }) is used as a buffer-verification step in host-local media allowlisting (e.g. assertHostReadMediaAllowed in src/media/web-media.ts). With the current implementation:

  • Input: attacker-controlled bytes (local file contents, downloaded media, etc.)
  • Gate: host-local-media validator allows any audio/* when the sniffed kind is audio
  • Bypass: arbitrary binary content can be treated as allowed audio by prefixing the buffer with caff, even if it is not a valid CAF container

Vulnerable code:

function sniffKnownAudioMagic(buffer: Buffer): string | undefined {
  if (buffer.byteLength >= 4 && buffer.toString("ascii", 0, 4) === "caff") {
    return "audio/x-caf";
  }
  return undefined;
}

Impact depends on where hostReadCapability / host-local-media validation is relied upon to prevent reading/sending arbitrary local files. The intent (per comments) is to only allow buffer-verified media types; this change weakens that verification for CAF to a trivially forgeable 4-byte check.

Recommendation

Strengthen CAF validation so that audio/x-caf is only returned for buffers that plausibly conform to the CAF container structure, not just the magic tag.

Minimum recommended checks:

  • Require at least the full CAF file header ('caff' + version + flags = 8 bytes)
  • Validate the version/flags are in expected ranges (commonly version 1, flags 0)
  • Optionally require the first chunk header to exist and have a plausible chunk type (e.g. desc, data) and a sane chunk size

Example (lightweight structural validation):

function sniffKnownAudioMagic(buffer: Buffer): string | undefined {
  if (buffer.byteLength < 12) return undefined;
  if (buffer.toString("ascii", 0, 4) !== "caff") return undefined;

  const version = buffer.readUInt16BE(4);
  const flags = buffer.readUInt16BE(6);
  if (version !== 1) return undefined;
  if (flags !== 0) return undefined;// First chunk type (bytes 8..12) should be ASCII letters.
  const chunkType = buffer.toString("ascii", 8, 12);
  if (!/^[A-Za-z]{4}$/.test(chunkType)) return undefined;

  return "audio/x-caf";
}

If stronger assurance is needed for security gating, prefer parsing CAF via a dedicated library/parser or performing an actual decode/probe step (e.g., ffprobe) in the validator path.

4. 🔵 Log forging via unsanitized `channel` in verbose pre-transcode failure log
Property Value
Severity Low
CWE CWE-117
Location extensions/speech-core/src/tts.ts:1192-1194

Description

In maybePreTranscodeForVoiceDelivery, the verbose log line interpolates params.channel directly into a free-form log message.

  • params.channel can originate from external request/tool parameters (e.g. gateway tts.convert takes params.channel from the request and passes it through to textToSpeech)
  • logVerbose() ultimately prints to the console when verbose mode is enabled (console.log(theme.muted(message))), so embedded newlines/control characters in channel can create spoofed/misleading log lines or break downstream log parsing

Vulnerable code:

logVerbose(
  `TTS: pre-transcode ${sourceExt}->${preferred} for channel=${params.channel ?? "?"} failed: ${outcome.detail ?? "unknown"}`,
);

This is a classic log-injection/log-forging issue (CWE-117).

Recommendation

Avoid embedding potentially attacker-controlled values into free-form log strings. Prefer structured logging fields and/or sanitize control characters.

Option A (structured fields):

getLogger().debug(
  {
    sourceExt,
    preferred,
    channel: params.channel ?? "?",
    detail: outcome.detail ?? "unknown",
  },
  "TTS: pre-transcode failed",
);

Option B (sanitize for console output):

const safeChannel = (params.channel ?? "?").replace(/[\r\n\t\0]/g, " ");
logVerbose(`TTS: pre-transcode ${sourceExt}->${preferred} for channel=${safeChannel} failed: ${outcome.detail ?? "unknown"}`);

Ideally apply sanitization in logVerbose() itself so all callers benefit.


Analyzed PR: #72586 at commit f099afd

Last updated on: 2026-04-27T19:46:18Z

@greptile-apps

greptile-apps Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR sidesteps the BlueBubbles server race condition by pre-transcoding synthesized TTS MP3 audio to CAF on macOS before upload, so the BlueBubbles server's immediate CAF→MP3 conversion attempt never races an incomplete write. The approach is well-scoped: opt-in via a new preferAudioFileFormat capability field, with a safe fallback to the original buffer on any platform, unsupported source/target pair, or process failure.

  • P1 — PCM output inflates file size significantly: The LEI16@44100 data format in pickAfconvertRecipe produces uncompressed 16-bit PCM at 44.1 kHz. A 10-second TTS clip expands from ~160 KB (MP3) to ~1.76 MB (PCM), and longer clips may approach iMessage attachment limits or cause noticeably slower uploads. Changing to -d aac would keep sizes comparable to the source MP3 while remaining fully Core Audio-compatible.
  • P2 — Source extension not checked in CAF recipe branch: The if (target === \"caf\") branch in pickAfconvertRecipe silently fires for any source format rather than the documented MP3-only path, which could attempt afconvert on non-decodable inputs and rely on the silent fallback instead of the explicit recipe guard.

Confidence Score: 3/5

Functionally sound but the PCM output format significantly inflates file sizes and should be verified against attachment limits before merging.

One P1 finding (LEI16@44100 data format produces files 10-50× larger than the source MP3, which can hit iMessage/BlueBubbles attachment limits for longer TTS clips) and one P2 (source param not checked in CAF recipe branch). P1 ceiling is 4; the P1 affects a core output of the new feature (the transcoded buffer size), pulling the score to 3.

extensions/speech-core/src/audio-transcode.ts — the pickAfconvertRecipe function's choice of LEI16@44100 and the missing source-format guard need review before merge.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 69-78

Comment:
**`source` parameter is ignored in the CAF branch**

The first branch returns the same recipe for any source format when `target === "caf"`, but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., `ogg`, `opus`, or a proprietary codec), the same `[-f caff -d LEI16@44100]` recipe will still be attempted — the afconvert process will fail, the catch block will return `undefined`, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises `source` matters but never uses it in this branch. Adding a source check (`source === "mp3"`) would make the guard explicit and prevent unexpected behavior when new TTS providers are added.

```suggestion
  if (source === "mp3" && target === "caf") {
    return ["-f", "caff", "-d", "LEI16@44100"];
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 72-73

Comment:
**Uncompressed PCM output inflates voice-memo file sizes significantly**

`-d LEI16@44100` forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.

Consider using a lossy compressed output data format instead (e.g. `aac` in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.

```suggestion
    return ["-f", "caff", "-d", "aac"];
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.test.ts
Line: 32-44

Comment:
**Happy-path conversion is never exercised in the unit-test suite**

The test for `"returns undefined off-platform without invoking afconvert"` unconditionally returns early on macOS (the platform where `afconvert` is actually available). This means the only code path that produces a `Buffer` result — the successful `mp3 → caf` transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts `result` is an instance of `Buffer` would catch future regressions in the recipe flags or `runAfconvert` wiring without requiring a full integration run.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(tts): pre-transcode synthesized audi..." | Re-trigger Greptile

Comment on lines +69 to +78
function pickAfconvertRecipe(source: string, target: string): string[] | undefined {
// Currently only the MP3→CAF path used by BlueBubbles voice memos. Other
// pairs intentionally fall through so callers keep their original buffer.
if (target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];
}
if (source === "caf" && target === "m4a") {
return ["-f", "m4af", "-d", "aac"];
}
return undefined;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 source parameter is ignored in the CAF branch

The first branch returns the same recipe for any source format when target === "caf", but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., ogg, opus, or a proprietary codec), the same [-f caff -d LEI16@44100] recipe will still be attempted — the afconvert process will fail, the catch block will return undefined, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises source matters but never uses it in this branch. Adding a source check (source === "mp3") would make the guard explicit and prevent unexpected behavior when new TTS providers are added.

Suggested change
function pickAfconvertRecipe(source: string, target: string): string[] | undefined {
// Currently only the MP3→CAF path used by BlueBubbles voice memos. Other
// pairs intentionally fall through so callers keep their original buffer.
if (target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];
}
if (source === "caf" && target === "m4a") {
return ["-f", "m4af", "-d", "aac"];
}
return undefined;
if (source === "mp3" && target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 69-78

Comment:
**`source` parameter is ignored in the CAF branch**

The first branch returns the same recipe for any source format when `target === "caf"`, but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., `ogg`, `opus`, or a proprietary codec), the same `[-f caff -d LEI16@44100]` recipe will still be attempted — the afconvert process will fail, the catch block will return `undefined`, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises `source` matters but never uses it in this branch. Adding a source check (`source === "mp3"`) would make the guard explicit and prevent unexpected behavior when new TTS providers are added.

```suggestion
  if (source === "mp3" && target === "caf") {
    return ["-f", "caff", "-d", "LEI16@44100"];
  }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +72 to +73
if (target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Uncompressed PCM output inflates voice-memo file sizes significantly

-d LEI16@44100 forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.

Consider using a lossy compressed output data format instead (e.g. aac in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.

Suggested change
if (target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];
return ["-f", "caff", "-d", "aac"];
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 72-73

Comment:
**Uncompressed PCM output inflates voice-memo file sizes significantly**

`-d LEI16@44100` forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.

Consider using a lossy compressed output data format instead (e.g. `aac` in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.

```suggestion
    return ["-f", "caff", "-d", "aac"];
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +32 to +44
it("returns undefined off-platform without invoking afconvert", async () => {
if (process.platform === "darwin") {
// macOS: the recipe exists. Real afconvert will be exercised separately
// via integration paths so this unit test stays platform-portable.
return;
}
const result = await transcodeAudioBuffer({
audioBuffer: Buffer.from("payload"),
sourceExtension: "mp3",
targetExtension: "caf",
});
expect(result).toBeUndefined();
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Happy-path conversion is never exercised in the unit-test suite

The test for "returns undefined off-platform without invoking afconvert" unconditionally returns early on macOS (the platform where afconvert is actually available). This means the only code path that produces a Buffer result — the successful mp3 → caf transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts result is an instance of Buffer would catch future regressions in the recipe flags or runAfconvert wiring without requiring a full integration run.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.test.ts
Line: 32-44

Comment:
**Happy-path conversion is never exercised in the unit-test suite**

The test for `"returns undefined off-platform without invoking afconvert"` unconditionally returns early on macOS (the platform where `afconvert` is actually available). This means the only code path that produces a `Buffer` result — the successful `mp3 → caf` transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts `result` is an instance of `Buffer` would catch future regressions in the recipe flags or `runAfconvert` wiring without requiring a full integration run.

How can I resolve this? If you propose a fix, please make it concise.

@omarshahine omarshahine force-pushed the fix/72506-bb-pre-encode-caf-tts branch 4 times, most recently from 05809d7 to a5de563 Compare April 27, 2026 17:48
…Message voice-memo bubbles via BlueBubbles

End-to-end testing on a real macOS + BlueBubbles setup walked through three
CAF flavors before landing on the format Apple's Messages.app actually emits
when a user records a native iMessage voice memo:

| Pre-encoded CAF flavor   | BlueBubbles internal CAF→MP3 conversion | iMessage rendering             |
|--------------------------|------------------------------------------|--------------------------------|
| (no fix; MP3 + isAudio)  | Renames to .caf, conversion fails (race) | Plain audio attachment         |
| PCM int16 @ 44.1 kHz     | Conversion fails                         | Voice-memo bubble, **0 s** time|
| AAC @ 22.05 kHz mono     | Conversion succeeds → silent downgrade    | Plain audio attachment         |
| **Opus @ 24 kHz mono**   | n/a — accepted as-is                     | **Native voice memo, real time + waveform** |

The descriptor block of an Apple-recorded voice memo is exactly
`1 ch, 24000 Hz, opus, 480 frames/packet`, and `afconvert -f caff -d opus@24000 -c 1`
produces a byte-identical container. iMessage uses that descriptor block as
its signal that the attachment is a voice memo, so anything else (PCM, AAC,
MP3) gets downgraded somewhere along the BlueBubbles → Messages.app path.

Also adds a magic-byte sniff for the CAF container in `src/media/mime.ts`
(`caff` ASCII tag → `audio/x-caf`). Without it the auto-reply host-local-
media validator drops the pre-transcoded buffer because the bundled
`file-type` library has no native CAF detector and returns `undefined`,
which the validator treats as an unknown binary blob and refuses to forward
("⚠️ Media failed.").

Pipeline pieces:

- `src/channels/plugins/types.core.ts` — extend `ChannelTtsVoiceDeliveryCapabilities`
  with optional `preferAudioFileFormat?: string`.
- `extensions/speech-core/src/audio-transcode.ts` (new) — `transcodeAudioBuffer`
  helper. macOS-only `afconvert` path; quietly returns `undefined` on any
  unsupported pair, missing platform, or process failure. Ships the MP3→CAF
  recipe used by BlueBubbles voice memos plus a CAF→m4a fallback for
  symmetry with what BlueBubbles itself attempts.
- `extensions/speech-core/src/tts.ts` — call the helper between synthesis
  and file-write inside `textToSpeech`. When transcoded, swap `audioBuffer`
  / `fileExtension` / `outputFormat` and use the new values for both the
  on-disk path and the `shouldDeliverTtsAsVoice` decision so the resulting
  `audioAsVoice` flag reflects the actual file shape that lands on the
  channel.
- `extensions/bluebubbles/src/channel-shared.ts` — declare
  `preferAudioFileFormat: "caf"` on BlueBubbles capabilities.
- `src/media/mime.ts` — `audio/x-caf` mapping plus a fallback `caff` magic-
  byte sniff so host-local validators recognize CAF as audio.
- Tests: new `audio-transcode.test.ts` covers the no-op cases and the
  off-Darwin fallback; new mime cases assert the CAF magic-byte sniff with
  and without a corroborating filename.

Falls back to the original buffer when the host platform, the source/target
pair, or the transcoder process can't produce the preferred container — so
non-Darwin hosts and unsupported provider combinations are unaffected.

BlueBubbles is the only currently affected channel and now declares
`preferAudioFileFormat: "caf"`. Other channels are unchanged.

Fixes openclaw#72506.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@omarshahine omarshahine force-pushed the fix/72506-bb-pre-encode-caf-tts branch from a5de563 to f099afd Compare April 27, 2026 19:44
@omarshahine omarshahine merged commit da3d17e into openclaw:main Apr 27, 2026
172 of 182 checks passed
@omarshahine

Copy link
Copy Markdown
Contributor Author

Landed in da3d17e1ca.

Test plan summary that ran locally before merge:

  • 85/85 vitest cases green across the 3 touched test files (extensions/speech-core/src/audio-transcode.test.ts, extensions/speech-core/src/tts.test.ts, src/media/mime.test.ts)
  • pnpm exec tsc --noEmit clean
  • lint:tmp:no-random-messaging green (was the original CI block)
  • End-to-end on macOS + BlueBubbles: native iMessage voice-memo bubble with correct duration + waveform

CI on the merge SHA had 2 persistent unrelated flakes that pass locally on both PR branch and origin/main and have completely unrelated test paths from this change:

  • src/gateway/http-utils.authorize-request.test.ts (vitest vi.fn() spy assertion order)
  • src/infra/install-package-dir.test.ts filesystem-race test

These have been observed to flake on main itself; not introduced by this PR. Worth tracking separately.

@omarshahine omarshahine deleted the fix/72506-bb-pre-encode-caf-tts branch April 27, 2026 21:15
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…Message voice-memo bubbles via BlueBubbles (openclaw#72586)

End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo:

- PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration.
- AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment.
- **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI.

Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged.

Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed."

Fixes openclaw#72506.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…Message voice-memo bubbles via BlueBubbles (openclaw#72586)

End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo:

- PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration.
- AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment.
- **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI.

Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged.

Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed."

Fixes openclaw#72506.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
globalcaos pushed a commit to globalcaos/tinkerclaw that referenced this pull request May 13, 2026
…Message voice-memo bubbles via BlueBubbles (openclaw#72586)

End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo:

- PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration.
- AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment.
- **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI.

Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged.

Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed."

Fixes openclaw#72506.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
…Message voice-memo bubbles via BlueBubbles (openclaw#72586)

End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo:

- PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration.
- AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment.
- **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI.

Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged.

Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed."

Fixes openclaw#72506.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
…Message voice-memo bubbles via BlueBubbles (openclaw#72586)

End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo:

- PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration.
- AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment.
- **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI.

Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged.

Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed."

Fixes openclaw#72506.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
…Message voice-memo bubbles via BlueBubbles (openclaw#72586)

End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo:

- PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration.
- AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment.
- **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI.

Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged.

Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed."

Fixes openclaw#72506.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: bluebubbles Channel integration: bluebubbles maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BlueBubbles native iOS voice-memo delivery broken end-to-end with ElevenLabs (and other non-Azure TTS providers)

1 participant