fix(tts): pre-transcode synthesized audio to channel-preferred container before voice-memo delivery by omarshahine · Pull Request #72586 · openclaw/openclaw

omarshahine · 2026-04-27T04:24:12Z

Summary

Fixes #72506. After end-to-end testing on a real macOS + BlueBubbles + ElevenLabs stack, voice-memo replies from agents now render as native iMessage voice-memo bubbles (waveform UI, real duration) instead of plain file attachments.

The fix is a small, opt-in channel capability (tts.voice.preferAudioFileFormat) plus a macOS afconvert-backed pre-transcode in the speech-core pipeline. BlueBubbles declares preferAudioFileFormat: "caf" and the speech-core layer transcodes synthesized MP3 to opus-in-CAF before handing the file to the channel. Other channels are unaffected.

Diagnostic journey

The discovery process iterated through three CAF flavors. The descriptor block matters at every hop along OpenClaw → BlueBubbles server → Messages.app private API → iMessage:

Pre-encoded CAF flavor	BlueBubbles' internal CAF→MP3 conversion	iMessage rendering
(no fix; MP3 + isAudioMessage)	Renames to .caf, conversion fails (race)	Plain audio attachment
PCM int16 @ 44.1 kHz	Conversion fails	Voice-memo bubble, 0 s duration
AAC @ 22.05 kHz mono	Conversion succeeds → silent downgrade	Plain audio attachment
Opus @ 24 kHz mono	n/a — passes through	Native voice memo, real duration + waveform

What unlocked it was inspecting an Apple-recorded voice memo (a native iMessage Audio Message.caf Apple's Messages.app produces when the user holds the mic button). The descriptor is exactly 1 ch, 24000 Hz, opus, 480 frames/packet, and afconvert -f caff -d opus@24000 -c 1 produces a byte-identical container. iMessage uses that descriptor block as its native voice-memo recognizer; anything else gets downgraded somewhere along the path.

The AAC row in particular was the surprising one: BlueBubbles' internal CAF→MP3 conversion succeeded against AAC-CAF, and BlueBubbles' code path then sent the converted MP3 as audio/mp3 instead of forwarding the original CAF, silently downgrading from voice-memo bubble to plain attachment. PCM-CAF tripped the same conversion logic in the failure direction, which (counter-intuitively) made BlueBubbles fall back to forwarding the CAF — getting most of the way to a voice memo, except iMessage couldn't compute a duration from raw-PCM CAF, so the bubble showed 0 s.

A second, independent gap surfaced along the way: OpenClaw's auto-reply host-local-media validator uses the bundled file-type library to verify outbound buffers, and file-type v22 has no native CAF detector. Without the magic-byte fallback below, the validator drops the pre-transcoded buffer as an unknown binary blob and the agent ends up sending "⚠️ Media failed." instead of the voice memo. Adding a four-byte caff magic sniff in src/media/mime.ts returns audio/x-caf, which the validator already classifies as audio.

Pipeline pieces

src/channels/plugins/types.core.ts — extend ChannelTtsVoiceDeliveryCapabilities with optional preferAudioFileFormat?: string. Doc comment explains the intent.
extensions/speech-core/src/audio-transcode.ts (new) — transcodeAudioBuffer helper. macOS-only afconvert path; quietly returns undefined on any unsupported pair, missing platform, or process failure. Ships the MP3→CAF recipe used by BlueBubbles voice memos (-f caff -d opus@24000 -c 1) and a CAF→m4a fallback for symmetry with what BlueBubbles itself attempts.
extensions/speech-core/src/tts.ts — call the helper between synthesis and file-write inside textToSpeech. When transcoded, swap audioBuffer / fileExtension / outputFormat and use the new values for both the on-disk path and the shouldDeliverTtsAsVoice decision so the resulting audioAsVoice flag reflects the actual file shape that lands on the channel.
extensions/bluebubbles/src/channel-shared.ts — declare preferAudioFileFormat: "caf" on BlueBubbles capabilities, with a comment pointing at the Messages.app voice-memo descriptor so future readers know what the format choice protects.
src/media/mime.ts — add audio/x-caf → .caf to EXT_BY_MIME, plus a small caff-magic-bytes fallback in sniffMime so host-local validators recognize CAF as audio when file-type doesn't.
Tests:
- extensions/speech-core/src/audio-transcode.test.ts (new) — covers the no-op cases (matching extensions, unsupported recipe, empty source) and platform-portable assertion that off-Darwin always returns undefined without invoking the binary.
- src/media/mime.test.ts — adds two regression cases for the CAF magic-byte sniff (with and without a corroborating filename).

Behavior matrix

Host platform	Channel `preferAudioFileFormat`	Source format	Result
macOS	`caf`	mp3	Pre-transcoded to opus-in-CAF; uploaded with `isAudioMessage=true`; renders as native voice-memo bubble in iMessage
macOS	unset (other channels)	any	Unchanged behavior
Linux/Windows	`caf`	mp3	`transcodeAudioBuffer` returns `undefined`; original MP3 buffer preserved (BlueBubbles is macOS-only anyway)
any	matches source already	any	Helper returns `undefined`; no extra work
any	recipe not implemented	any	Helper returns `undefined`; original buffer preserved

Tests

pnpm exec vitest run src/media/mime.test.ts extensions/speech-core/src/audio-transcode.test.ts — 63/63 pass (includes existing tests; new cases for CAF sniff + transcode no-op paths).
pnpm exec tsc --noEmit -p tsconfig.json clean.
End-to-end manual on macOS Apple Silicon + BlueBubbles + ElevenLabs: [[tts:...]] directive in agent reply → native iMessage voice-memo bubble with real duration and waveform.

Test plan

Unit tests pass on macOS Apple Silicon
TypeScript checks pass
E2E: real device renders the result as a native voice-memo bubble
Reviewer with a BlueBubbles + macOS setup: send any TTS-tagged reply through any agent and confirm voice-memo bubble UI
Reviewer on Linux: confirm non-Darwin path returns the unchanged MP3 buffer (no regression for other channels)
Reviewer with Discord/Slack/Telegram TTS: confirm those channels continue to receive their existing format (no preferAudioFileFormat declared, no pre-transcode)

🤖 Generated with Claude Code

aisle-research-bot · 2026-04-27T04:24:18Z

🔒 Aisle Security Analysis

We found 4 potential security issue(s) in this PR:

#	Severity	Title
1	🟠 High	Path traversal / arbitrary file write via untrusted `fileExtension` in TTS temp file path
2	🟡 Medium	Unbounded external transcoder invocation in TTS pre-transcode path (resource exhaustion)
3	🟡 Medium	CAF MIME sniffing can be spoofed with only "caff" prefix, bypassing host-local media type validation
4	🔵 Low	Log forging via unsanitized `channel` in verbose pre-transcode failure log

1. 🟠 Path traversal / arbitrary file write via untrusted `fileExtension` in TTS temp file path

Property	Value
Severity	High
CWE	CWE-22
Location	`extensions/speech-core/src/tts.ts:1134-1135`

Description

textToSpeech() writes synthesized audio to a temporary directory using a filename that interpolates fileExtension without validation.

fileExtension comes from the selected speech provider (synthesis.fileExtension) and is treated as a string to append to the filename.
If a provider (including third-party plugins) returns a malicious extension containing path separators or traversal sequences (e.g. "/../../somefile"), path.join(tempDir, voice-...${fileExtension}) will normalize the path and can escape tempDir.
The resulting path is then passed to writeFileSync(), creating an arbitrary file write primitive within the permissions of the running process.

Vulnerable code:

const audioPath = path.join(tempDir, `voice-${Date.now()}${fileExtension}`);
writeFileSync(audioPath, audioBuffer);

While the new pre-transcode logic may override the extension in some cases, it falls back to the original unvalidated provider-supplied synthesis.fileExtension whenever transcoding is skipped or fails (including for invalid preferAudioFileFormat values).

Recommendation

Validate and normalize fileExtension before using it in any filesystem path.

Suggested approach:

Strip a leading dot
Enforce a strict allowlist regex like /^[a-z0-9]{1,12}$/i
Re-add a single leading dot
Reject/throw on invalid values (or fall back to a safe default like .wav)

Example:

function safeExt(ext: string): string {
  const token = ext.trim().toLowerCase().replace(/^\./, "");
  if (!/^[a-z0-9]{1,12}$/.test(token)) {
    throw new Error(`Invalid synthesis fileExtension: ${ext}`);
  }
  return `.${token}`;
}

const safeFileExtension = safeExt(fileExtension);
const audioPath = path.join(tempDir, `voice-${Date.now()}${safeFileExtension}`);
writeFileSync(audioPath, audioBuffer, { mode: 0o600 });

Also consider applying the same validation at the provider boundary (when accepting synthesis.fileExtension) so the value is safe everywhere it is used.

2. 🟡 Unbounded external transcoder invocation in TTS pre-transcode path (resource exhaustion)

Property	Value
Severity	Medium
CWE	CWE-400
Location	`extensions/speech-core/src/audio-transcode.ts:67-76`

Description

The new macOS pre-transcode helper spawns an external process (/usr/bin/afconvert) and performs synchronous disk I/O for every eligible TTS request.

Inputs: audioBuffer originates from TTS synthesis (ultimately driven by user-provided text and channel delivery preferences).
Work performed per request:
- Writes audioBuffer to disk (writeFileSync), then reads the output back (readFileSync).
- Spawns afconvert (spawn) with up to a 5s timeout.
Issue: There is no explicit rate limiting, concurrency limiting, or maximum buffer size check on this code path. An attacker who can trigger TTS repeatedly (or trigger channels that request preferAudioFileFormat) can force repeated transcodes, causing CPU/disk exhaustion and blocking the Node.js event loop due to synchronous FS operations.

Vulnerable code:

writeFileSync(inPath, params.audioBuffer, { mode: 0o600 });
const result = await runAfconvert({
  args: [...recipe, inPath, outPath],
  timeoutMs: params.timeoutMs ?? 5000,
});
...
return { ok: true, buffer: readFileSync(outPath) };

Recommendation

Add explicit guardrails around transcoding to prevent resource exhaustion:

Enforce a maximum audioBuffer size eligible for transcoding (and skip pre-transcode above it).
Limit concurrency of afconvert invocations (e.g., a small semaphore/queue), and/or add per-channel/user rate limits.
Prefer async FS APIs to avoid blocking the event loop.

Example (sketch) using a semaphore + size cap:

import { promises as fs } from "node:fs";
import pLimit from "p-limit";

const limitAfconvert = pLimit(2); // at most 2 concurrent transcodes
const MAX_TRANSCODE_BYTES = 5 * 1024 * 1024;

export async function transcodeAudioBuffer(params: {...}) {
  if (params.audioBuffer.byteLength > MAX_TRANSCODE_BYTES) {
    return { ok: false, reason: "transcoder-failed", detail: "buffer-too-large" };
  }

  return limitAfconvert(async () => {
    await fs.writeFile(inPath, params.audioBuffer, { mode: 0o600 });
    const result = await runAfconvert(...);
    if (!result.ok) return { ok: false, reason: "transcoder-failed", detail: result.detail };
    return { ok: true, buffer: await fs.readFile(outPath) };
  });
}

3. 🟡 CAF MIME sniffing can be spoofed with only "caff" prefix, bypassing host-local media type validation

Property	Value
Severity	Medium
CWE	CWE-20
Location	`src/media/mime.ts:109-112`

Description

detectMime() falls back to sniffKnownAudioMagic() when file-type does not return a MIME. The new CAF detector classifies any buffer beginning with ASCII caff as audio/x-caf.

This is security-relevant because detectMime({ buffer }) is used as a buffer-verification step in host-local media allowlisting (e.g. assertHostReadMediaAllowed in src/media/web-media.ts). With the current implementation:

Input: attacker-controlled bytes (local file contents, downloaded media, etc.)
Gate: host-local-media validator allows any audio/* when the sniffed kind is audio
Bypass: arbitrary binary content can be treated as allowed audio by prefixing the buffer with caff, even if it is not a valid CAF container

Vulnerable code:

function sniffKnownAudioMagic(buffer: Buffer): string | undefined {
  if (buffer.byteLength >= 4 && buffer.toString("ascii", 0, 4) === "caff") {
    return "audio/x-caf";
  }
  return undefined;
}

Impact depends on where hostReadCapability / host-local-media validation is relied upon to prevent reading/sending arbitrary local files. The intent (per comments) is to only allow buffer-verified media types; this change weakens that verification for CAF to a trivially forgeable 4-byte check.

Recommendation

Strengthen CAF validation so that audio/x-caf is only returned for buffers that plausibly conform to the CAF container structure, not just the magic tag.

Minimum recommended checks:

Require at least the full CAF file header ('caff' + version + flags = 8 bytes)
Validate the version/flags are in expected ranges (commonly version 1, flags 0)
Optionally require the first chunk header to exist and have a plausible chunk type (e.g. desc, data) and a sane chunk size

Example (lightweight structural validation):

function sniffKnownAudioMagic(buffer: Buffer): string | undefined {
  if (buffer.byteLength < 12) return undefined;
  if (buffer.toString("ascii", 0, 4) !== "caff") return undefined;

  const version = buffer.readUInt16BE(4);
  const flags = buffer.readUInt16BE(6);
  if (version !== 1) return undefined;
  if (flags !== 0) return undefined;

  // First chunk type (bytes 8..12) should be ASCII letters.
  const chunkType = buffer.toString("ascii", 8, 12);
  if (!/^[A-Za-z]{4}$/.test(chunkType)) return undefined;

  return "audio/x-caf";
}

If stronger assurance is needed for security gating, prefer parsing CAF via a dedicated library/parser or performing an actual decode/probe step (e.g., ffprobe) in the validator path.

4. 🔵 Log forging via unsanitized `channel` in verbose pre-transcode failure log

Property	Value
Severity	Low
CWE	CWE-117
Location	`extensions/speech-core/src/tts.ts:1192-1194`

Description

In maybePreTranscodeForVoiceDelivery, the verbose log line interpolates params.channel directly into a free-form log message.

params.channel can originate from external request/tool parameters (e.g. gateway tts.convert takes params.channel from the request and passes it through to textToSpeech)
logVerbose() ultimately prints to the console when verbose mode is enabled (console.log(theme.muted(message))), so embedded newlines/control characters in channel can create spoofed/misleading log lines or break downstream log parsing

Vulnerable code:

logVerbose(
  `TTS: pre-transcode ${sourceExt}->${preferred} for channel=${params.channel ?? "?"} failed: ${outcome.detail ?? "unknown"}`,
);

This is a classic log-injection/log-forging issue (CWE-117).

Recommendation

Avoid embedding potentially attacker-controlled values into free-form log strings. Prefer structured logging fields and/or sanitize control characters.

Option A (structured fields):

getLogger().debug(
  {
    sourceExt,
    preferred,
    channel: params.channel ?? "?",
    detail: outcome.detail ?? "unknown",
  },
  "TTS: pre-transcode failed",
);

Option B (sanitize for console output):

const safeChannel = (params.channel ?? "?").replace(/[\r\n\t\0]/g, " ");
logVerbose(`TTS: pre-transcode ${sourceExt}->${preferred} for channel=${safeChannel} failed: ${outcome.detail ?? "unknown"}`);

Ideally apply sanitization in logVerbose() itself so all callers benefit.

Analyzed PR: #72586 at commit f099afd

_{Last updated on: 2026-04-27T19:46:18Z}

greptile-apps · 2026-04-27T04:27:39Z

Greptile Summary

This PR sidesteps the BlueBubbles server race condition by pre-transcoding synthesized TTS MP3 audio to CAF on macOS before upload, so the BlueBubbles server's immediate CAF→MP3 conversion attempt never races an incomplete write. The approach is well-scoped: opt-in via a new preferAudioFileFormat capability field, with a safe fallback to the original buffer on any platform, unsupported source/target pair, or process failure.

P1 — PCM output inflates file size significantly: The LEI16@44100 data format in pickAfconvertRecipe produces uncompressed 16-bit PCM at 44.1 kHz. A 10-second TTS clip expands from ~160 KB (MP3) to ~1.76 MB (PCM), and longer clips may approach iMessage attachment limits or cause noticeably slower uploads. Changing to -d aac would keep sizes comparable to the source MP3 while remaining fully Core Audio-compatible.
P2 — Source extension not checked in CAF recipe branch: The if (target === \"caf\") branch in pickAfconvertRecipe silently fires for any source format rather than the documented MP3-only path, which could attempt afconvert on non-decodable inputs and rely on the silent fallback instead of the explicit recipe guard.

Confidence Score: 3/5

Functionally sound but the PCM output format significantly inflates file sizes and should be verified against attachment limits before merging.

One P1 finding (LEI16@44100 data format produces files 10-50× larger than the source MP3, which can hit iMessage/BlueBubbles attachment limits for longer TTS clips) and one P2 (source param not checked in CAF recipe branch). P1 ceiling is 4; the P1 affects a core output of the new feature (the transcoded buffer size), pulling the score to 3.

extensions/speech-core/src/audio-transcode.ts — the pickAfconvertRecipe function's choice of LEI16@44100 and the missing source-format guard need review before merge.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 69-78

Comment:
**`source` parameter is ignored in the CAF branch**

The first branch returns the same recipe for any source format when `target === "caf"`, but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., `ogg`, `opus`, or a proprietary codec), the same `[-f caff -d LEI16@44100]` recipe will still be attempted — the afconvert process will fail, the catch block will return `undefined`, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises `source` matters but never uses it in this branch. Adding a source check (`source === "mp3"`) would make the guard explicit and prevent unexpected behavior when new TTS providers are added.

```suggestion
  if (source === "mp3" && target === "caf") {
    return ["-f", "caff", "-d", "LEI16@44100"];
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 72-73

Comment:
**Uncompressed PCM output inflates voice-memo file sizes significantly**

`-d LEI16@44100` forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.

Consider using a lossy compressed output data format instead (e.g. `aac` in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.

```suggestion
    return ["-f", "caff", "-d", "aac"];
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.test.ts
Line: 32-44

Comment:
**Happy-path conversion is never exercised in the unit-test suite**

The test for `"returns undefined off-platform without invoking afconvert"` unconditionally returns early on macOS (the platform where `afconvert` is actually available). This means the only code path that produces a `Buffer` result — the successful `mp3 → caf` transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts `result` is an instance of `Buffer` would catch future regressions in the recipe flags or `runAfconvert` wiring without requiring a full integration run.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(tts): pre-transcode synthesized audi..." | Re-trigger Greptile}

greptile-apps · 2026-04-27T04:27:43Z

+function pickAfconvertRecipe(source: string, target: string): string[] | undefined {
+  // Currently only the MP3→CAF path used by BlueBubbles voice memos. Other
+  // pairs intentionally fall through so callers keep their original buffer.
+  if (target === "caf") {
+    return ["-f", "caff", "-d", "LEI16@44100"];
+  }
+  if (source === "caf" && target === "m4a") {
+    return ["-f", "m4af", "-d", "aac"];
+  }
+  return undefined;


source parameter is ignored in the CAF branch

The first branch returns the same recipe for any source format when target === "caf", but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., ogg, opus, or a proprietary codec), the same [-f caff -d LEI16@44100] recipe will still be attempted — the afconvert process will fail, the catch block will return undefined, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises source matters but never uses it in this branch. Adding a source check (source === "mp3") would make the guard explicit and prevent unexpected behavior when new TTS providers are added.

Suggested change

function pickAfconvertRecipe(source: string, target: string): string[] | undefined {

// Currently only the MP3→CAF path used by BlueBubbles voice memos. Other

// pairs intentionally fall through so callers keep their original buffer.

if (target === "caf") {

return ["-f", "caff", "-d", "LEI16@44100"];

}

if (source === "caf" && target === "m4a") {

return ["-f", "m4af", "-d", "aac"];

}

return undefined;

if (source === "mp3" && target === "caf") {

return ["-f", "caff", "-d", "LEI16@44100"];

}

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/speech-core/src/audio-transcode.ts Line: 69-78 Comment: **`source` parameter is ignored in the CAF branch** The first branch returns the same recipe for any source format when `target === "caf"`, but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., `ogg`, `opus`, or a proprietary codec), the same `[-f caff -d LEI16@44100]` recipe will still be attempted — the afconvert process will fail, the catch block will return `undefined`, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises `source` matters but never uses it in this branch. Adding a source check (`source === "mp3"`) would make the guard explicit and prevent unexpected behavior when new TTS providers are added. ```suggestion if (source === "mp3" && target === "caf") { return ["-f", "caff", "-d", "LEI16@44100"]; } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-27T04:27:44Z

+  if (target === "caf") {
+    return ["-f", "caff", "-d", "LEI16@44100"];


Uncompressed PCM output inflates voice-memo file sizes significantly

-d LEI16@44100 forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.

Consider using a lossy compressed output data format instead (e.g. aac in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.

Suggested change

if (target === "caf") {

return ["-f", "caff", "-d", "LEI16@44100"];

return ["-f", "caff", "-d", "aac"];

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/speech-core/src/audio-transcode.ts Line: 72-73 Comment: **Uncompressed PCM output inflates voice-memo file sizes significantly** `-d LEI16@44100` forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output. Consider using a lossy compressed output data format instead (e.g. `aac` in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer. ```suggestion return ["-f", "caff", "-d", "aac"]; ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-27T04:27:45Z

+  it("returns undefined off-platform without invoking afconvert", async () => {
+    if (process.platform === "darwin") {
+      // macOS: the recipe exists. Real afconvert will be exercised separately
+      // via integration paths so this unit test stays platform-portable.
+      return;
+    }
+    const result = await transcodeAudioBuffer({
+      audioBuffer: Buffer.from("payload"),
+      sourceExtension: "mp3",
+      targetExtension: "caf",
+    });
+    expect(result).toBeUndefined();
+  });


Happy-path conversion is never exercised in the unit-test suite

The test for "returns undefined off-platform without invoking afconvert" unconditionally returns early on macOS (the platform where afconvert is actually available). This means the only code path that produces a Buffer result — the successful mp3 → caf transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts result is an instance of Buffer would catch future regressions in the recipe flags or runAfconvert wiring without requiring a full integration run.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/speech-core/src/audio-transcode.test.ts Line: 32-44 Comment: **Happy-path conversion is never exercised in the unit-test suite** The test for `"returns undefined off-platform without invoking afconvert"` unconditionally returns early on macOS (the platform where `afconvert` is actually available). This means the only code path that produces a `Buffer` result — the successful `mp3 → caf` transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts `result` is an instance of `Buffer` would catch future regressions in the recipe flags or `runAfconvert` wiring without requiring a full integration run. How can I resolve this? If you propose a fix, please make it concise.

…Message voice-memo bubbles via BlueBubbles End-to-end testing on a real macOS + BlueBubbles setup walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: | Pre-encoded CAF flavor | BlueBubbles internal CAF→MP3 conversion | iMessage rendering | |--------------------------|------------------------------------------|--------------------------------| | (no fix; MP3 + isAudio) | Renames to .caf, conversion fails (race) | Plain audio attachment | | PCM int16 @ 44.1 kHz | Conversion fails | Voice-memo bubble, **0 s** time| | AAC @ 22.05 kHz mono | Conversion succeeds → silent downgrade | Plain audio attachment | | **Opus @ 24 kHz mono** | n/a — accepted as-is | **Native voice memo, real time + waveform** | The descriptor block of an Apple-recorded voice memo is exactly `1 ch, 24000 Hz, opus, 480 frames/packet`, and `afconvert -f caff -d opus@24000 -c 1` produces a byte-identical container. iMessage uses that descriptor block as its signal that the attachment is a voice memo, so anything else (PCM, AAC, MP3) gets downgraded somewhere along the BlueBubbles → Messages.app path. Also adds a magic-byte sniff for the CAF container in `src/media/mime.ts` (`caff` ASCII tag → `audio/x-caf`). Without it the auto-reply host-local- media validator drops the pre-transcoded buffer because the bundled `file-type` library has no native CAF detector and returns `undefined`, which the validator treats as an unknown binary blob and refuses to forward ("⚠️ Media failed."). Pipeline pieces: - `src/channels/plugins/types.core.ts` — extend `ChannelTtsVoiceDeliveryCapabilities` with optional `preferAudioFileFormat?: string`. - `extensions/speech-core/src/audio-transcode.ts` (new) — `transcodeAudioBuffer` helper. macOS-only `afconvert` path; quietly returns `undefined` on any unsupported pair, missing platform, or process failure. Ships the MP3→CAF recipe used by BlueBubbles voice memos plus a CAF→m4a fallback for symmetry with what BlueBubbles itself attempts. - `extensions/speech-core/src/tts.ts` — call the helper between synthesis and file-write inside `textToSpeech`. When transcoded, swap `audioBuffer` / `fileExtension` / `outputFormat` and use the new values for both the on-disk path and the `shouldDeliverTtsAsVoice` decision so the resulting `audioAsVoice` flag reflects the actual file shape that lands on the channel. - `extensions/bluebubbles/src/channel-shared.ts` — declare `preferAudioFileFormat: "caf"` on BlueBubbles capabilities. - `src/media/mime.ts` — `audio/x-caf` mapping plus a fallback `caff` magic- byte sniff so host-local validators recognize CAF as audio. - Tests: new `audio-transcode.test.ts` covers the no-op cases and the off-Darwin fallback; new mime cases assert the CAF magic-byte sniff with and without a corroborating filename. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unaffected. BlueBubbles is the only currently affected channel and now declares `preferAudioFileFormat: "caf"`. Other channels are unchanged. Fixes openclaw#72506. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

omarshahine · 2026-04-27T21:15:42Z

Landed in da3d17e1ca.

Test plan summary that ran locally before merge:

85/85 vitest cases green across the 3 touched test files (extensions/speech-core/src/audio-transcode.test.ts, extensions/speech-core/src/tts.test.ts, src/media/mime.test.ts)
pnpm exec tsc --noEmit clean
lint:tmp:no-random-messaging green (was the original CI block)
End-to-end on macOS + BlueBubbles: native iMessage voice-memo bubble with correct duration + waveform

CI on the merge SHA had 2 persistent unrelated flakes that pass locally on both PR branch and origin/main and have completely unrelated test paths from this change:

src/gateway/http-utils.authorize-request.test.ts (vitest vi.fn() spy assertion order)
src/infra/install-package-dir.test.ts filesystem-race test

These have been observed to flake on main itself; not introduced by this PR. Worth tracking separately.

…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

openclaw-barnacle Bot added channel: bluebubbles Channel integration: bluebubbles size: M maintainer Maintainer-authored PR labels Apr 27, 2026

omarshahine mentioned this pull request Apr 27, 2026

Voice-memo attachments race-fail CAF→MP3 conversion when client uploads MP3 with isAudioMessage=true, fall back to generic attachment BlueBubblesApp/bluebubbles-server#799

Open

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

omarshahine force-pushed the fix/72506-bb-pre-encode-caf-tts branch 4 times, most recently from 05809d7 to a5de563 Compare April 27, 2026 17:48

omarshahine force-pushed the fix/72506-bb-pre-encode-caf-tts branch from a5de563 to f099afd Compare April 27, 2026 19:44

omarshahine merged commit da3d17e into openclaw:main Apr 27, 2026
172 of 182 checks passed

omarshahine deleted the fix/72506-bb-pre-encode-caf-tts branch April 27, 2026 21:15

github-actions Bot mentioned this pull request Apr 27, 2026

📡 Upstream Digest — 2026-04-27 22:42 UTC curtismercier/openclaw-mods#704

Open

clawsweeper Bot mentioned this pull request Apr 30, 2026

fix(tts): pick file extension from output format and expose target #72564

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tts): pre-transcode synthesized audio to channel-preferred container before voice-memo delivery#72586

fix(tts): pre-transcode synthesized audio to channel-preferred container before voice-memo delivery#72586
omarshahine merged 1 commit into
openclaw:mainfrom
omarshahine:fix/72506-bb-pre-encode-caf-tts

omarshahine commented Apr 27, 2026 •

edited

Loading

Uh oh!

aisle-research-bot Bot commented Apr 27, 2026 •

edited

Loading

Description

Recommendation

Description

Recommendation

Description

Recommendation

Description

Recommendation

Uh oh!

greptile-apps Bot commented Apr 27, 2026

Uh oh!

greptile-apps Bot Apr 27, 2026

Uh oh!

greptile-apps Bot Apr 27, 2026

Uh oh!

greptile-apps Bot Apr 27, 2026

Uh oh!

Uh oh!

omarshahine commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (target === "caf") {
		return ["-f", "caff", "-d", "LEI16@44100"];

	if (target === "caf") {
	return ["-f", "caff", "-d", "LEI16@44100"];
	return ["-f", "caff", "-d", "aac"];

Uh oh!

Conversation

omarshahine commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Diagnostic journey

Pipeline pieces

Behavior matrix

Tests

Test plan

Uh oh!

aisle-research-bot Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

Description

Recommendation

Description

Recommendation

Description

Recommendation

Description

Recommendation

Uh oh!

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

omarshahine commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

omarshahine commented Apr 27, 2026 •

edited

Loading

aisle-research-bot Bot commented Apr 27, 2026 •

edited

Loading