-
-
Notifications
You must be signed in to change notification settings - Fork 52.6k
Description
Summary
detectContentType() in src/line/download.ts checks the MPEG-4 ftyp magic bytes before the M4A-specific check, causing LINE voice messages (M4A / AAC-LC) to be classified as video/mp4 instead of audio/mp4. This makes isAudioAttachment() return false, so the entire audio transcription pipeline is skipped and the agent receives no transcript.
Root Cause
The MP4 check at bytes 4–7 (ftyp) fires for all MPEG-4 containers, including M4A. The M4A-specific check below it is dead code because it also matches bytes 4–7:
// Matches ANY ftyp box, including M4A → returns "video/mp4"
if (buffer[4] === 0x66 && buffer[5] === 0x74 && buffer[6] === 0x79 && buffer[7] === 0x70) {
return "video/mp4";
}
// Dead code: never reached for M4A files
if (buffer[0] === 0x00 && buffer[1] === 0x00 && buffer[2] === 0x00) {
if (buffer[4] === 0x66 && buffer[5] === 0x74 && buffer[6] === 0x79 && buffer[7] === 0x70) {
return "audio/mp4";
}
}M4A magic bytes: 00 00 00 1c 66 74 79 70 4d 34 41 20 — positions 4–7 are ftyp, so the MP4 rule fires first.
Downstream Impact (verified from source)
detectContentType()→"video/mp4"(wrong)getExtensionForContentType("video/mp4")→.mp4(wrong extension)mediaKindFromMime("video/mp4")→"video"(src/media/constants.ts:mime.startsWith("video/"))resolveAttachmentKind()→ returns"video"— never reachesisAudioFileName()fallback (src/media-understanding/attachments.ts)isAudioAttachment()→falseselectAttachments({capability: "audio"})→ empty arraytranscribeFirstAudio()inaudio-preflight.ts→firstAudio = undefined→ returnsundefined- No
ctx.Transcriptset, agent receives raw placeholder without transcription
Note: The file-type library (used in src/media/mime.ts sniffMime()) correctly identifies M4A as audio/x-m4a, but it is never called because the LINE download path uses its own detectContentType() implementation.
Suggested Fix
Option A — Check the ftyp sub-brand to distinguish M4A from MP4 video:
if (buffer[4] === 0x66 && buffer[5] === 0x74 && buffer[6] === 0x79 && buffer[7] === 0x70) {
const subBrand = String.fromCharCode(buffer[8], buffer[9], buffer[10], buffer[11]);
if (subBrand === 'M4A ' || subBrand === 'M4B ') {
return "audio/mp4";
}
return "video/mp4";
}Option B — Use the existing file-type library (already a dependency) via detectMime() from src/media/mime.ts instead of reimplementing magic-byte detection.
Steps to reproduce
- Configure a LINE channel with
tools.media.audioin auto-detection mode (OPENAI_API_KEY present, no explicitenabled: false). - Send a voice message from the LINE mobile app to the bot.
- Observe the downloaded file and its detected content type in verbose logs.
- Check whether
transcribeFirstAudio()produces a transcript.
Expected behavior
LINE voice messages (M4A) should be detected as audio/mp4, saved with .m4a extension, and automatically transcribed via the audio understanding pipeline (Whisper / OpenAI). The agent should receive the transcript in ctx.Transcript.
Actual behavior
Voice message is saved as .mp4 with MIME video/mp4. The audio transcription pipeline is completely skipped:
isAudioAttachment()returnsfalseselectAttachments({capability: "audio"})returns empty arraytranscribeFirstAudio()returnsundefined- Agent receives
<media:audio>placeholder +[media attached: /tmp/openclaw/line-media-xxx.mp4 (video/mp4)]but no transcript
The file-type library in src/media/mime.ts correctly identifies M4A as audio/x-m4a, but it is never invoked because the LINE download path uses its own detectContentType() implementation.
OpenClaw version
2026.2.26
Operating system
Linux (NVIDIA Jetson AGX Orin, aarch64, Ubuntu-based JetPack)
Install method
docker (openclaw:local-docker image)
Logs, screenshots, and evidence
# Actual LINE voice message file identified by `file` command
$ file /tmp/openclaw/line-media-1772274968613-*.mp4
Apple iTunes ALAC/AAC-LC (.M4A) Audio
# Magic bytes confirm ftyp M4A sub-brand at offset 8
$ xxd /tmp/openclaw/line-media-*.mp4 | head -1
00000000: 0000 001c 6674 7970 4d34 4120 0000 0000 ....ftypM4A ....
The `ftyp` box (bytes 4–7) is identical for both MP4 video and M4A audio. The distinguishing factor is the sub-brand at bytes 8–11: `M4A ` for audio vs `isom`/`mp42` for video.Impact and severity
- Affected: All LINE channel users who send voice messages
- Severity: High — completely blocks voice message understanding on LINE
- Frequency: 100% reproducible (every LINE voice message)
- Consequence: Agent cannot process voice messages at all; users get no response to voice input. This is platform-specific to LINE (Telegram/WhatsApp use different audio formats).
Additional information
Related issues:
- [Bug]: Voice message binary leaks into context after transcription #7333 — Voice message binary leaks into context (OGG, fixed in fix: skip audio files from text extraction to prevent binary processing #7475)
- Telegram voice messages not auto-transcribed despite tools.media.audio.enabled config (Windows) #22554 — Telegram voice messages not auto-transcribed
- [Bug]: Telegram Voice Messages Not Transcribed #17101 — Telegram voice messages not transcribed
- [Bug]: Telegram voice messages not transcribed - applyMediaUnderstanding not called #7899 — applyMediaUnderstanding not called for Telegram
- WhatsApp voice messages broken on all model providers - audio sent as image with undefined MIME type #13924 — WhatsApp voice messages broken with undefined MIME type
This is likely a latent bug introduced when detectContentType() was first written. LINE is the only channel that uses M4A (AAC-LC in MPEG-4 container) for voice messages, which is why it wasn't caught by earlier OGG/Opus fixes for Telegram.
The existing file-type npm package (already a dependency) handles this correctly via fileTypeFromBuffer(). The simplest fix would be to delegate to it from downloadLineMedia() or to check the ftyp sub-brand at bytes 8–11.