fix(tts): pre-transcode synthesized audio to channel-preferred container before voice-memo delivery#72586
Conversation
🔒 Aisle Security AnalysisWe found 4 potential security issue(s) in this PR:
1. 🟠 Path traversal / arbitrary file write via untrusted `fileExtension` in TTS temp file path
Description
Vulnerable code: const audioPath = path.join(tempDir, `voice-${Date.now()}${fileExtension}`);
writeFileSync(audioPath, audioBuffer);While the new pre-transcode logic may override the extension in some cases, it falls back to the original unvalidated provider-supplied RecommendationValidate and normalize Suggested approach:
Example: function safeExt(ext: string): string {
const token = ext.trim().toLowerCase().replace(/^\./, "");
if (!/^[a-z0-9]{1,12}$/.test(token)) {
throw new Error(`Invalid synthesis fileExtension: ${ext}`);
}
return `.${token}`;
}
const safeFileExtension = safeExt(fileExtension);
const audioPath = path.join(tempDir, `voice-${Date.now()}${safeFileExtension}`);
writeFileSync(audioPath, audioBuffer, { mode: 0o600 });Also consider applying the same validation at the provider boundary (when accepting 2. 🟡 Unbounded external transcoder invocation in TTS pre-transcode path (resource exhaustion)
DescriptionThe new macOS pre-transcode helper spawns an external process (
Vulnerable code: writeFileSync(inPath, params.audioBuffer, { mode: 0o600 });
const result = await runAfconvert({
args: [...recipe, inPath, outPath],
timeoutMs: params.timeoutMs ?? 5000,
});
...
return { ok: true, buffer: readFileSync(outPath) };RecommendationAdd explicit guardrails around transcoding to prevent resource exhaustion:
Example (sketch) using a semaphore + size cap: import { promises as fs } from "node:fs";
import pLimit from "p-limit";
const limitAfconvert = pLimit(2); // at most 2 concurrent transcodes
const MAX_TRANSCODE_BYTES = 5 * 1024 * 1024;
export async function transcodeAudioBuffer(params: {...}) {
if (params.audioBuffer.byteLength > MAX_TRANSCODE_BYTES) {
return { ok: false, reason: "transcoder-failed", detail: "buffer-too-large" };
}
return limitAfconvert(async () => {
await fs.writeFile(inPath, params.audioBuffer, { mode: 0o600 });
const result = await runAfconvert(...);
if (!result.ok) return { ok: false, reason: "transcoder-failed", detail: result.detail };
return { ok: true, buffer: await fs.readFile(outPath) };
});
}3. 🟡 CAF MIME sniffing can be spoofed with only "caff" prefix, bypassing host-local media type validation
Description
This is security-relevant because
Vulnerable code: function sniffKnownAudioMagic(buffer: Buffer): string | undefined {
if (buffer.byteLength >= 4 && buffer.toString("ascii", 0, 4) === "caff") {
return "audio/x-caf";
}
return undefined;
}Impact depends on where RecommendationStrengthen CAF validation so that Minimum recommended checks:
Example (lightweight structural validation): function sniffKnownAudioMagic(buffer: Buffer): string | undefined {
if (buffer.byteLength < 12) return undefined;
if (buffer.toString("ascii", 0, 4) !== "caff") return undefined;
const version = buffer.readUInt16BE(4);
const flags = buffer.readUInt16BE(6);
if (version !== 1) return undefined;
if (flags !== 0) return undefined;
// First chunk type (bytes 8..12) should be ASCII letters.
const chunkType = buffer.toString("ascii", 8, 12);
if (!/^[A-Za-z]{4}$/.test(chunkType)) return undefined;
return "audio/x-caf";
}If stronger assurance is needed for security gating, prefer parsing CAF via a dedicated library/parser or performing an actual decode/probe step (e.g., 4. 🔵 Log forging via unsanitized `channel` in verbose pre-transcode failure log
DescriptionIn
Vulnerable code: logVerbose(
`TTS: pre-transcode ${sourceExt}->${preferred} for channel=${params.channel ?? "?"} failed: ${outcome.detail ?? "unknown"}`,
);This is a classic log-injection/log-forging issue (CWE-117). RecommendationAvoid embedding potentially attacker-controlled values into free-form log strings. Prefer structured logging fields and/or sanitize control characters. Option A (structured fields): getLogger().debug(
{
sourceExt,
preferred,
channel: params.channel ?? "?",
detail: outcome.detail ?? "unknown",
},
"TTS: pre-transcode failed",
);Option B (sanitize for console output): const safeChannel = (params.channel ?? "?").replace(/[\r\n\t\0]/g, " ");
logVerbose(`TTS: pre-transcode ${sourceExt}->${preferred} for channel=${safeChannel} failed: ${outcome.detail ?? "unknown"}`);Ideally apply sanitization in Analyzed PR: #72586 at commit Last updated on: 2026-04-27T19:46:18Z |
Greptile SummaryThis PR sidesteps the BlueBubbles server race condition by pre-transcoding synthesized TTS MP3 audio to CAF on macOS before upload, so the BlueBubbles server's immediate CAF→MP3 conversion attempt never races an incomplete write. The approach is well-scoped: opt-in via a new
Confidence Score: 3/5Functionally sound but the PCM output format significantly inflates file sizes and should be verified against attachment limits before merging. One P1 finding (LEI16@44100 data format produces files 10-50× larger than the source MP3, which can hit iMessage/BlueBubbles attachment limits for longer TTS clips) and one P2 (source param not checked in CAF recipe branch). P1 ceiling is 4; the P1 affects a core output of the new feature (the transcoded buffer size), pulling the score to 3. extensions/speech-core/src/audio-transcode.ts — the pickAfconvertRecipe function's choice of LEI16@44100 and the missing source-format guard need review before merge. Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 69-78
Comment:
**`source` parameter is ignored in the CAF branch**
The first branch returns the same recipe for any source format when `target === "caf"`, but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., `ogg`, `opus`, or a proprietary codec), the same `[-f caff -d LEI16@44100]` recipe will still be attempted — the afconvert process will fail, the catch block will return `undefined`, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises `source` matters but never uses it in this branch. Adding a source check (`source === "mp3"`) would make the guard explicit and prevent unexpected behavior when new TTS providers are added.
```suggestion
if (source === "mp3" && target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 72-73
Comment:
**Uncompressed PCM output inflates voice-memo file sizes significantly**
`-d LEI16@44100` forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.
Consider using a lossy compressed output data format instead (e.g. `aac` in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.
```suggestion
return ["-f", "caff", "-d", "aac"];
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.test.ts
Line: 32-44
Comment:
**Happy-path conversion is never exercised in the unit-test suite**
The test for `"returns undefined off-platform without invoking afconvert"` unconditionally returns early on macOS (the platform where `afconvert` is actually available). This means the only code path that produces a `Buffer` result — the successful `mp3 → caf` transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts `result` is an instance of `Buffer` would catch future regressions in the recipe flags or `runAfconvert` wiring without requiring a full integration run.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(tts): pre-transcode synthesized audi..." | Re-trigger Greptile |
| function pickAfconvertRecipe(source: string, target: string): string[] | undefined { | ||
| // Currently only the MP3→CAF path used by BlueBubbles voice memos. Other | ||
| // pairs intentionally fall through so callers keep their original buffer. | ||
| if (target === "caf") { | ||
| return ["-f", "caff", "-d", "LEI16@44100"]; | ||
| } | ||
| if (source === "caf" && target === "m4a") { | ||
| return ["-f", "m4af", "-d", "aac"]; | ||
| } | ||
| return undefined; |
There was a problem hiding this comment.
source parameter is ignored in the CAF branch
The first branch returns the same recipe for any source format when target === "caf", but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., ogg, opus, or a proprietary codec), the same [-f caff -d LEI16@44100] recipe will still be attempted — the afconvert process will fail, the catch block will return undefined, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises source matters but never uses it in this branch. Adding a source check (source === "mp3") would make the guard explicit and prevent unexpected behavior when new TTS providers are added.
| function pickAfconvertRecipe(source: string, target: string): string[] | undefined { | |
| // Currently only the MP3→CAF path used by BlueBubbles voice memos. Other | |
| // pairs intentionally fall through so callers keep their original buffer. | |
| if (target === "caf") { | |
| return ["-f", "caff", "-d", "LEI16@44100"]; | |
| } | |
| if (source === "caf" && target === "m4a") { | |
| return ["-f", "m4af", "-d", "aac"]; | |
| } | |
| return undefined; | |
| if (source === "mp3" && target === "caf") { | |
| return ["-f", "caff", "-d", "LEI16@44100"]; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 69-78
Comment:
**`source` parameter is ignored in the CAF branch**
The first branch returns the same recipe for any source format when `target === "caf"`, but the code comment and PR description both describe this specifically as the "MP3→CAF path." If the TTS synthesis layer ever emits a format that Core Audio cannot decode (e.g., `ogg`, `opus`, or a proprietary codec), the same `[-f caff -d LEI16@44100]` recipe will still be attempted — the afconvert process will fail, the catch block will return `undefined`, and the original buffer is silently preserved. While the fallback is safe, the function's signature promises `source` matters but never uses it in this branch. Adding a source check (`source === "mp3"`) would make the guard explicit and prevent unexpected behavior when new TTS providers are added.
```suggestion
if (source === "mp3" && target === "caf") {
return ["-f", "caff", "-d", "LEI16@44100"];
}
```
How can I resolve this? If you propose a fix, please make it concise.| if (target === "caf") { | ||
| return ["-f", "caff", "-d", "LEI16@44100"]; |
There was a problem hiding this comment.
Uncompressed PCM output inflates voice-memo file sizes significantly
-d LEI16@44100 forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.
Consider using a lossy compressed output data format instead (e.g. aac in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.
| if (target === "caf") { | |
| return ["-f", "caff", "-d", "LEI16@44100"]; | |
| return ["-f", "caff", "-d", "aac"]; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.ts
Line: 72-73
Comment:
**Uncompressed PCM output inflates voice-memo file sizes significantly**
`-d LEI16@44100` forces the output data format to 16-bit linear PCM at 44100 Hz. TTS-synthesized MP3 is typically 16–22 kHz mono at 32–128 kbps. A 10-second TTS clip at 128 kbps is ~160 KB as MP3; the same clip re-encoded as PCM stereo at 44.1 kHz is ~1.76 MB — roughly 11× larger. For longer messages this approaches or exceeds common iMessage attachment limits, risks slower BlueBubbles upload times, and negates the storage efficiency of the original TTS provider output.
Consider using a lossy compressed output data format instead (e.g. `aac` in a CAF container), which keeps file sizes comparable to the source MP3 while remaining fully compatible with Core Audio and the iMessage voice-memo renderer.
```suggestion
return ["-f", "caff", "-d", "aac"];
```
How can I resolve this? If you propose a fix, please make it concise.| it("returns undefined off-platform without invoking afconvert", async () => { | ||
| if (process.platform === "darwin") { | ||
| // macOS: the recipe exists. Real afconvert will be exercised separately | ||
| // via integration paths so this unit test stays platform-portable. | ||
| return; | ||
| } | ||
| const result = await transcodeAudioBuffer({ | ||
| audioBuffer: Buffer.from("payload"), | ||
| sourceExtension: "mp3", | ||
| targetExtension: "caf", | ||
| }); | ||
| expect(result).toBeUndefined(); | ||
| }); |
There was a problem hiding this comment.
Happy-path conversion is never exercised in the unit-test suite
The test for "returns undefined off-platform without invoking afconvert" unconditionally returns early on macOS (the platform where afconvert is actually available). This means the only code path that produces a Buffer result — the successful mp3 → caf transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts result is an instance of Buffer would catch future regressions in the recipe flags or runAfconvert wiring without requiring a full integration run.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/speech-core/src/audio-transcode.test.ts
Line: 32-44
Comment:
**Happy-path conversion is never exercised in the unit-test suite**
The test for `"returns undefined off-platform without invoking afconvert"` unconditionally returns early on macOS (the platform where `afconvert` is actually available). This means the only code path that produces a `Buffer` result — the successful `mp3 → caf` transcode — is entirely unexercised by the unit tests, and the PR relies on manual end-to-end verification instead. Adding a macOS-only test that passes a small valid MP3 payload and asserts `result` is an instance of `Buffer` would catch future regressions in the recipe flags or `runAfconvert` wiring without requiring a full integration run.
How can I resolve this? If you propose a fix, please make it concise.05809d7 to
a5de563
Compare
…Message voice-memo bubbles via BlueBubbles
End-to-end testing on a real macOS + BlueBubbles setup walked through three
CAF flavors before landing on the format Apple's Messages.app actually emits
when a user records a native iMessage voice memo:
| Pre-encoded CAF flavor | BlueBubbles internal CAF→MP3 conversion | iMessage rendering |
|--------------------------|------------------------------------------|--------------------------------|
| (no fix; MP3 + isAudio) | Renames to .caf, conversion fails (race) | Plain audio attachment |
| PCM int16 @ 44.1 kHz | Conversion fails | Voice-memo bubble, **0 s** time|
| AAC @ 22.05 kHz mono | Conversion succeeds → silent downgrade | Plain audio attachment |
| **Opus @ 24 kHz mono** | n/a — accepted as-is | **Native voice memo, real time + waveform** |
The descriptor block of an Apple-recorded voice memo is exactly
`1 ch, 24000 Hz, opus, 480 frames/packet`, and `afconvert -f caff -d opus@24000 -c 1`
produces a byte-identical container. iMessage uses that descriptor block as
its signal that the attachment is a voice memo, so anything else (PCM, AAC,
MP3) gets downgraded somewhere along the BlueBubbles → Messages.app path.
Also adds a magic-byte sniff for the CAF container in `src/media/mime.ts`
(`caff` ASCII tag → `audio/x-caf`). Without it the auto-reply host-local-
media validator drops the pre-transcoded buffer because the bundled
`file-type` library has no native CAF detector and returns `undefined`,
which the validator treats as an unknown binary blob and refuses to forward
("⚠️ Media failed.").
Pipeline pieces:
- `src/channels/plugins/types.core.ts` — extend `ChannelTtsVoiceDeliveryCapabilities`
with optional `preferAudioFileFormat?: string`.
- `extensions/speech-core/src/audio-transcode.ts` (new) — `transcodeAudioBuffer`
helper. macOS-only `afconvert` path; quietly returns `undefined` on any
unsupported pair, missing platform, or process failure. Ships the MP3→CAF
recipe used by BlueBubbles voice memos plus a CAF→m4a fallback for
symmetry with what BlueBubbles itself attempts.
- `extensions/speech-core/src/tts.ts` — call the helper between synthesis
and file-write inside `textToSpeech`. When transcoded, swap `audioBuffer`
/ `fileExtension` / `outputFormat` and use the new values for both the
on-disk path and the `shouldDeliverTtsAsVoice` decision so the resulting
`audioAsVoice` flag reflects the actual file shape that lands on the
channel.
- `extensions/bluebubbles/src/channel-shared.ts` — declare
`preferAudioFileFormat: "caf"` on BlueBubbles capabilities.
- `src/media/mime.ts` — `audio/x-caf` mapping plus a fallback `caff` magic-
byte sniff so host-local validators recognize CAF as audio.
- Tests: new `audio-transcode.test.ts` covers the no-op cases and the
off-Darwin fallback; new mime cases assert the CAF magic-byte sniff with
and without a corroborating filename.
Falls back to the original buffer when the host platform, the source/target
pair, or the transcoder process can't produce the preferred container — so
non-Darwin hosts and unsupported provider combinations are unaffected.
BlueBubbles is the only currently affected channel and now declares
`preferAudioFileFormat: "caf"`. Other channels are unchanged.
Fixes openclaw#72506.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a5de563 to
f099afd
Compare
|
Landed in Test plan summary that ran locally before merge:
CI on the merge SHA had 2 persistent unrelated flakes that pass locally on both PR branch and
These have been observed to flake on |
…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Message voice-memo bubbles via BlueBubbles (openclaw#72586) End-to-end testing on macOS + BlueBubbles + ElevenLabs walked through three CAF flavors before landing on the format Apple's Messages.app actually emits when a user records a native iMessage voice memo: - PCM int16 @ 44.1 kHz CAF: BlueBubbles' internal `afconvert -f m4af -d aac` conversion fails; the original CAF reaches iMessage but renders with 0 s duration. - AAC @ 22.05 kHz mono CAF: BlueBubbles' conversion succeeds and the server silently downgrades the delivery, sending the converted MP3 as a generic audio attachment. - **Opus @ 24 kHz mono CAF**: byte-identical to the descriptor block Apple's Messages.app produces; BlueBubbles passes it through unchanged and iMessage renders a native voice-memo bubble with proper duration and waveform UI. Adds an opt-in `tts.voice.preferAudioFileFormat` channel capability and a macOS `afconvert`-backed pre-transcode in the speech-core pipeline. BlueBubbles declares `preferAudioFileFormat: "caf"`. Other channels are unaffected. Falls back to the original buffer when the host platform, the source/target pair, or the transcoder process can't produce the preferred container — so non-Darwin hosts and unsupported provider combinations are unchanged. Also adds a `caff` magic-byte sniff in `src/media/mime.ts` so the auto-reply host-local-media validator (which uses `file-type` and didn't recognize CAF natively) accepts the buffer instead of dropping it as "⚠️ Media failed." Fixes openclaw#72506. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes #72506. After end-to-end testing on a real macOS + BlueBubbles + ElevenLabs stack, voice-memo replies from agents now render as native iMessage voice-memo bubbles (waveform UI, real duration) instead of plain file attachments.
The fix is a small, opt-in channel capability (
tts.voice.preferAudioFileFormat) plus a macOSafconvert-backed pre-transcode in the speech-core pipeline. BlueBubbles declarespreferAudioFileFormat: "caf"and the speech-core layer transcodes synthesized MP3 to opus-in-CAF before handing the file to the channel. Other channels are unaffected.Diagnostic journey
The discovery process iterated through three CAF flavors. The descriptor block matters at every hop along
OpenClaw → BlueBubbles server → Messages.app private API → iMessage:What unlocked it was inspecting an Apple-recorded voice memo (a native iMessage
Audio Message.cafApple's Messages.app produces when the user holds the mic button). The descriptor is exactly1 ch, 24000 Hz, opus, 480 frames/packet, andafconvert -f caff -d opus@24000 -c 1produces a byte-identical container. iMessage uses that descriptor block as its native voice-memo recognizer; anything else gets downgraded somewhere along the path.The AAC row in particular was the surprising one: BlueBubbles' internal CAF→MP3 conversion succeeded against AAC-CAF, and BlueBubbles' code path then sent the converted MP3 as
audio/mp3instead of forwarding the original CAF, silently downgrading from voice-memo bubble to plain attachment. PCM-CAF tripped the same conversion logic in the failure direction, which (counter-intuitively) made BlueBubbles fall back to forwarding the CAF — getting most of the way to a voice memo, except iMessage couldn't compute a duration from raw-PCM CAF, so the bubble showed 0 s.A second, independent gap surfaced along the way: OpenClaw's auto-reply host-local-media validator uses the bundled⚠️ Media failed." instead of the voice memo. Adding a four-byte
file-typelibrary to verify outbound buffers, andfile-typev22 has no native CAF detector. Without the magic-byte fallback below, the validator drops the pre-transcoded buffer as an unknown binary blob and the agent ends up sending "caffmagic sniff insrc/media/mime.tsreturnsaudio/x-caf, which the validator already classifies as audio.Pipeline pieces
src/channels/plugins/types.core.ts— extendChannelTtsVoiceDeliveryCapabilitieswith optionalpreferAudioFileFormat?: string. Doc comment explains the intent.extensions/speech-core/src/audio-transcode.ts(new) —transcodeAudioBufferhelper. macOS-onlyafconvertpath; quietly returnsundefinedon any unsupported pair, missing platform, or process failure. Ships the MP3→CAF recipe used by BlueBubbles voice memos (-f caff -d opus@24000 -c 1) and a CAF→m4a fallback for symmetry with what BlueBubbles itself attempts.extensions/speech-core/src/tts.ts— call the helper between synthesis and file-write insidetextToSpeech. When transcoded, swapaudioBuffer/fileExtension/outputFormatand use the new values for both the on-disk path and theshouldDeliverTtsAsVoicedecision so the resultingaudioAsVoiceflag reflects the actual file shape that lands on the channel.extensions/bluebubbles/src/channel-shared.ts— declarepreferAudioFileFormat: "caf"on BlueBubbles capabilities, with a comment pointing at the Messages.app voice-memo descriptor so future readers know what the format choice protects.src/media/mime.ts— addaudio/x-caf → .caftoEXT_BY_MIME, plus a smallcaff-magic-bytes fallback insniffMimeso host-local validators recognize CAF as audio whenfile-typedoesn't.extensions/speech-core/src/audio-transcode.test.ts(new) — covers the no-op cases (matching extensions, unsupported recipe, empty source) and platform-portable assertion that off-Darwin always returnsundefinedwithout invoking the binary.src/media/mime.test.ts— adds two regression cases for the CAF magic-byte sniff (with and without a corroborating filename).Behavior matrix
preferAudioFileFormatcafisAudioMessage=true; renders as native voice-memo bubble in iMessagecaftranscodeAudioBufferreturnsundefined; original MP3 buffer preserved (BlueBubbles is macOS-only anyway)undefined; no extra workundefined; original buffer preservedTests
pnpm exec vitest run src/media/mime.test.ts extensions/speech-core/src/audio-transcode.test.ts— 63/63 pass (includes existing tests; new cases for CAF sniff + transcode no-op paths).pnpm exec tsc --noEmit -p tsconfig.jsonclean.[[tts:...]]directive in agent reply → native iMessage voice-memo bubble with real duration and waveform.Test plan
preferAudioFileFormatdeclared, no pre-transcode)🤖 Generated with Claude Code