fix(whatsapp): respect audioAsVoice flag in outbound delivery by masatohoshino · Pull Request #68744 · openclaw/openclaw

masatohoshino · 2026-04-19T00:58:20Z

Summary

Fixes #66053. The WhatsApp adapter dropped the audioAsVoice flag from
ChannelOutboundContext, so replies with [[audio_as_voice]] were
delivered as document attachments instead of PTT voice notes. This PR
wires the flag through createWhatsAppOutboundBase →
sendMessageWhatsApp / deliverWebReply, and adds channel-agnostic
MIME/filename hardening.

Changes

Commit 1 — feat(media): adds to src/media/mime.ts (exported via
openclaw/plugin-sdk/media-runtime):

isVerifiedAudioSource({ kind, contentType }) — predicate gating
voice-note routing.
sanitizeMediaMime(input, { preserveCodecsParam? }) — rejects control
characters (CWE-93), preserves codecs= param only for audio.
sanitizeFileName(input) — strips ASCII control + Unicode General
Category Cf characters (\p{Cf}, CWE-451), caps length at 128.

Signatures are channel-agnostic so Telegram, Discord, and Matrix can
adopt the same helpers in a follow-up.

Commit 2 — fix(whatsapp): consumes the new helpers.

sendMessageWhatsApp: forceVoiceDelivery gated on
isVerifiedAudioSource; PTT mimetype rebuilt with opus codec
preservation/fallback.
createWhatsAppOutboundBase: forwards audioAsVoice through
sendMedia.
deliverWebReply: audio / audio-as-voice paths route to PTT with
sanitized opus mime; image/video/document branches use sanitized
allowlisted mime; document fileName sanitized.

Commit 3 — fix(media): widens sanitizeFileName invisible-character
coverage to all Unicode Category Cf (addresses Aisle Finding 1).

Rebased onto main after #69813 landed; the audio mimetype
canonicalization that previously lived inline is now delegated to
normalizeWhatsAppLoadedMedia, and this PR layers PTT routing +
security hardening on top.

Verification

pnpm test extensions/whatsapp → 567 passed / 0 failed
pnpm test src/media/mime.test.ts → 106 passed / 0 failed (after Aisle fix: add @lid format support and allowFrom wildcard handling #1 fix)
pnpm tsgo:prod → exit 0
codex review --base upstream/main → no P1 findings

Security notes

Finding 1 (Aisle, CWE-451, filename UI spoofing): fixed in commit 3.
Finding 2 (Aisle, CWE-20, magic-byte verification): deferred to a
cross-channel follow-up. The upstream auto-reply path already lands
ptt: true based on media.kind === "audio" (derived from
detectMime extension/header fallback), so the correct fix plumbs a
sniffed content-type through loadWebMedia and has Telegram,
Discord, Matrix, and WhatsApp all adopt the verified source together.

Follow-ups

Magic-byte audio verification in loadWebMedia / detectMime (see
security notes).
Telegram / Discord / Matrix adoption of the channel-agnostic helpers
in src/media/mime.ts.

masatohoshino · 2026-04-19T00:58:22Z

For a bit of extra context on the follow-up note in the PR body:

While working on this fix, I noticed a few other channel adapters that may be worth checking for similar audioAsVoice consumption gaps in their outbound paths. I deliberately kept this PR scoped to WhatsApp so it stays easy to review and resolves #66053 cleanly.

Happy to prepare a separate follow-up PR for the cross-channel side if maintainers think that would be useful — otherwise I'll leave the observation here for the record.

greptile-apps · 2026-04-19T01:00:43Z

Greptile Summary

This PR fixes the WhatsApp adapter to correctly route audio replies flagged with audioAsVoice=true to the PTT (push-to-talk) voice-note path instead of the document-attachment path. The fix is applied consistently across send.ts, outbound-base.ts, outbound-adapter.ts, and auto-reply/deliver-reply.ts, with a follow-up commit (566b640) also guarding the media.kind === "audio" branch against being overwritten when forceVoiceDelivery is active — directly addressing the concern raised in the previous review thread.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 issues have been addressed and the remaining changes are well-tested.

The previously flagged bug (audio-kind branch overwriting forceVoiceDelivery) is fixed in the latest commit with the !forceVoiceDelivery guard on line 133 of send.ts. The mimetype normalization logic is consistent across all four touched files. No new logic paths are left unguarded, and no remaining P1 or P0 findings exist.

No files require special attention.

_{Reviews (2): Last reviewed commit: "fix(whatsapp): guard audio-kind branch a..." | Re-trigger Greptile}

masatohoshino · 2026-04-19T01:39:43Z

Thanks @greptile-apps for the catch. Addressed in the latest commit:

send.ts — moved the media.kind === "audio" branch into the else if chain guarded by !forceVoiceDelivery, matching the pattern used by the video/image/document branches. The forceVoiceDelivery block now correctly owns the audio-routing decision when the flag is set.
send.test.ts — added a regression test for the audioAsVoice=true + kind="audio" + contentType=null path, asserting the resulting mimetype is audio/ogg; codecs=opus.

Verification:

pnpm test extensions/whatsapp/src → 57 files / 484 tests pass
codex review --base origin/main reports no P1 or P2 findings on the latest state
The new test fails against the pre-fix code as expected

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566b640784

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

masatohoshino · 2026-04-19T22:41:17Z

@codex review

masatohoshino · 2026-04-19T22:47:45Z

All security findings raised by the analysis bots on this PR have been addressed or scoped to follow-ups as discussed above. The PR is ready for human review when convenient.

Two items intentionally left for separate follow-ups:

magic-byte verification in isVerifiedAudioSource (needs loadWebMedia refactoring and changes the [[audio_as_voice]] trust model)
fileName sanitization in inbound/send-api.ts (best fixed alongside the upstream parseContentDispositionFileName sanitizer)

Happy to split either of those into their own PRs if maintainers want them tracked separately.

chatgpt-codex-connector · 2026-04-19T22:48:29Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

aisle-research-bot · 2026-04-23T16:44:52Z

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

#	Severity	Title
1	🟡 Medium	Audio trust boundary bypass via isVerifiedAudioSource relying on caller-provided media.kind

1. 🟡 Audio trust boundary bypass via isVerifiedAudioSource relying on caller-provided media.kind

Property	Value
Severity	Medium
CWE	CWE-345
Location	`src/media/mime.ts:118-135`

Description

The new isVerifiedAudioSource() helper is used by WhatsApp outbound code to decide whether to force PTT/voice-note delivery (audioAsVoice). However, it treats media.kind === "audio" as a verified signal even though kind can be derived from untrusted metadata.

In particular:

For remote URLs, loadWebMediaInternal() sets kind = kindFromMime(contentType) where contentType comes from detectMime({ buffer, headerMime, filePath }).
detectMime() will fall back to extension-derived or HTTP Content-Type header-derived MIME when magic-byte sniffing is inconclusive (i.e., file-type returns undefined).
An attacker-controlled server can therefore provide non-audio bytes while setting Content-Type: audio/* (or a misleading extension) such that kind becomes "audio".
WhatsApp send paths then treat the payload as verified audio and can coerce it into the voice-note/PTT path, enabling content-type confusion / policy bypass and potentially exposing downstream parsers to unexpected input.

Vulnerable code:

export function isVerifiedAudioSource(media: {
  kind?: string | null;
  contentType?: string | null;
}): boolean {
  return media.kind === "audio";
}

Because this function is explicitly intended as a trust boundary (“verified audio source”), it should not rely solely on a caller-controlled classification.

Recommendation

Treat audioAsVoice as an authorization-like decision and require byte-based verification before returning true.

Options (from strongest to weakest):

Pass sniffed MIME through the pipeline (or the buffer head) and require it to be audio:

export function isVerifiedAudioSource(media: {
  kind?: string | null;
  sniffedMime?: string | null;
}): boolean {
  return media.kind === "audio" && Boolean(media.sniffedMime?.startsWith("audio/"));
}

If you already have an audio validator, require it (e.g., opus/ogg for WA PTT) rather than trusting kind:

return media.kind === "audio" && media.sniffedMime === "audio/ogg";

At minimum, ensure kind is derived from sniffed bytes (not header/extension fallback) before allowing PTT coercion.

Also consider failing closed: if sniffing is inconclusive, do not treat the media as verified audio for voice-note delivery.

Analyzed PR: #68744 at commit 6abf1f4

_{Last updated on: 2026-04-25T00:08:29Z}

masatohoshino · 2026-04-24T11:37:57Z

Second rebase and security review update (2026-04-24):

Rebase

Rebased onto the latest main after #69813 (canonicalize outbound media
delivery) landed. The media normalization that previously lived inline in
sendMessageWhatsApp / deliverWebReply is now delegated to
normalizeWhatsAppLoadedMedia in outbound-media-contract.ts. This PR's
audioAsVoice routing and security-hardening helpers sit on top of that
canonical layer:

sendMessageWhatsApp: keeps the audioAsVoice option; forceVoiceDelivery
is gated on isVerifiedAudioSource({ kind, contentType: media.mimetype })
after normalization; PTT mimetype uses the opus fallback with
sanitizeMediaMime(..., { preserveCodecsParam: true }).
outbound-base.ts (createWhatsAppOutboundBase): propagates audioAsVoice
through sendMedia. The refactored outbound-adapter.ts consumes this
factory directly, so the earlier per-adapter wiring is removed.
deliver-reply.ts: audio and audioAsVoice-verified paths route to PTT with
sanitized opus mimetype; image/video/document branches apply sanitized
allowlisted mimetypes; document fileName is sanitized.

Tests: pnpm test extensions/whatsapp → 567 / 0; pnpm tsgo:prod → green.

Aisle response

Incomplete Unicode invisible character stripping in sanitizeFileName —
addressed in the latest commit. sanitizeFileName now strips every
Unicode General Category Cf character via /\p{Cf}/gu, covering the
previously-missed ZWSP (U+200B), ZWJ (U+200D), BOM (U+FEFF), and Soft
Hyphen (U+00AD, itself Cf) as well as the bidi marks already handled.
Regression tests added in src/media/mime.test.ts.
Voice-note sending can be coerced by trusting file extension/headers —
left as a follow-up, consistent with the earlier comment on this thread.
Reasoning: this is not WhatsApp-specific. The upstream auto-reply paths
that land ptt: true based on media.kind === "audio" have the same
trust boundary (kind is derived from detectMime, which falls back to
URL/path extension and HTTP Content-Type). Fixing it well means
plumbing a sniffedContentType/sniffedIsAudio through loadWebMedia
and having Telegram (resolveTelegramVoiceSend), Discord
(sendVoiceMessageDiscord), Matrix, and WhatsApp all adopt the verified
source together. Happy to open a separate PR for that refactor if
maintainers agree — otherwise flagged here for the record.

Cross-channel note (reiterated)

The shared helpers introduced here — isVerifiedAudioSource,
sanitizeMediaMime, sanitizeFileName in src/media/mime.ts — use a
channel-agnostic signature ({ kind, contentType }) so Telegram, Discord,
and Matrix can adopt them in a follow-up for consistent voice-note
hardening across channels.

masatohoshino · 2026-04-24T12:04:46Z

Aisle CWE-345 follow-up (2026-04-24):

The Medium finding flagged in the latest scan
("Content-type spoofing allows non-audio payloads to be sent as WhatsApp
voice notes (PTT)") is addressed in the latest commit.

isVerifiedAudioSource now requires a caller-classified kind === "audio".
The contentType parameter is retained on the signature so future work can
extend the helper with a sniffedMime argument once magic-byte sniffing is
plumbed through loadWebMedia / detectMime — but it is no longer consulted
on its own. This closes the header-spoofing path Aisle identified: an
attacker-controlled Content-Type: audio/... response no longer bypasses
the kind classifier.

Tests updated in src/media/mime.test.ts:

Flip the previously-accepted { kind: "document", contentType: "audio/ogg" }
case expectation to false.
Add a dedicated CWE-345 regression test.

Verification:

pnpm test src/media/mime.test.ts → 107 passed / 0 failed
pnpm test extensions/whatsapp → 567 passed / 0 failed
pnpm tsgo:prod → exit 0

The broader "plumb sniffed content type through loadWebMedia and have
Telegram/Discord/Matrix all adopt a verified source" refactor remains a
follow-up PR as discussed earlier in this thread — the current change is the
in-scope defensive tightening Aisle's example implementation suggested.

…leName utilities Channel-agnostic helpers in src/media/mime.ts for voice-note delivery safety, exported through openclaw/plugin-sdk/media-runtime. - isVerifiedAudioSource(media): predicate gating voice-note (PTT) routing on media.kind === 'audio' or sanitized audio/* mime - sanitizeMediaMime(input, opts): validates MIME for outbound headers, rejects ASCII control characters (CWE-93), normalizes to lowercase base type, optionally preserves a sanitized codecs= param for audio - sanitizeFileName(input): strips ASCII control characters and Unicode bidirectional/invisible format characters (CWE-451), replaces path separators and quotes, caps length at 128 chars, linear-time iteration bounded against attacker-controlled input Signatures are kept minimal ({ kind, contentType } / string | null) so Telegram (resolveTelegramVoiceSend), Discord (sendVoiceMessageDiscord), and Matrix can adopt the same helpers in a follow-up for consistent cross-channel voice-note hardening.

Fixes openclaw#66053. The WhatsApp adapter previously dropped the audioAsVoice flag from ChannelOutboundContext, so replies carrying the [[audio_as_voice]] directive were delivered as document attachments instead of PTT voice notes. Layers - sendMessageWhatsApp: adds audioAsVoice option. When true and the loaded media is a verified audio source (isVerifiedAudioSource), mediaType is rebuilt from a sanitized MIME with opus codec preservation/fallback so the outbound frame reaches WhatsApp as a voice note. Non-audio-verified payloads stay on the document path. - createWhatsAppOutboundBase (outbound-base.ts): forwards audioAsVoice through the sendMedia factory option so outbound-adapter.ts no longer needs per-adapter wiring (upstream canonicalization landed in openclaw#69813). - deliverWebReply (auto-reply/deliver-reply.ts): routes kind='audio' and audioAsVoice-verified media to PTT with sanitized opus mimetype; image/video/document branches apply sanitized allowlisted mimetypes; document fileName is passed through sanitizeFileName. Security hardening (applied on top of upstream's normalizeWhatsAppLoadedMedia helper) - sanitizeMediaMime rejects control characters (CWE-93) and preserves codecs params only in the audio path. - sanitizeFileName strips ASCII control and bidi/invisible Unicode to prevent filename UI spoofing (CWE-451). - isVerifiedAudioSource gates forceVoiceDelivery so an audioAsVoice reply cannot coerce non-audio bytes into a voice-note payload. Tests (extensions/whatsapp) - send.test.ts: audioAsVoice=true+audio routing, forceVoiceDelivery override guard, mimetype sanitization across kinds, document fileName sanitization. - deliver-reply.test.ts: voice-coercion rejection, control-character fallbacks for audio/image/video/document, fileName sanitization, audioAsVoice-unset document path. - outbound-base.test.ts / outbound-adapter.sendpayload.test.ts: audioAsVoice propagation through the factory. Rebased onto upstream/main after openclaw#69813. Audio mimetype canonicalization ("audio/ogg" -> "audio/ogg; codecs=opus") is now owned by normalizeWhatsAppLoadedMedia; this change layers PTT routing and security hardening on top. Closes openclaw#66053

isVerifiedAudioSource previously returned true for any sanitized contentType starting with audio/, allowing an attacker who controls a mediaUrl response to spoof Content-Type: audio/... and coerce WhatsApp PTT delivery of non-audio bytes (Aisle CWE-345). Tighten the helper to only accept media.kind === "audio". The contentType parameter is retained on the signature as a seam for a future sniffedMime-based extension once magic-byte sniffing is plumbed through loadWebMedia / detectMime cross-channel. Addresses Aisle re-scan finding on PR openclaw#68744.

steipete · 2026-04-25T05:36:19Z

Thanks @masatohoshino. I landed a narrower mainline fix for the issue in c2a2a481b2

What landed:

sendTextMediaPayload now preserves payload.audioAsVoice when it fans out media sends.
The WhatsApp outbound adapter forwards audioAsVoice through sendMedia to sendMessageWhatsApp.
WhatsApp docs and changelog now document that reply payloads preserve audioAsVoice; WhatsApp audio media continues to go out as Baileys PTT voice notes.

I intentionally did not land the broader MIME/filename helper bundle from this PR in this pass. The current root bug was the dropped shared payload flag, and the existing WhatsApp send path already maps audio MIME payloads to Baileys { audio, ptt: true }. New public SDK helper surface and cross-channel MIME hardening should be split/reviewed separately.

Dependency contract checked locally: installed Baileys types expose AnyMediaMessageContent audio payloads with ptt?: boolean, and createWebSendApi already sends audio as ptt: true.

Verification:

pnpm test src/plugin-sdk/reply-payload.test.ts extensions/whatsapp/src/outbound-adapter.sendpayload.test.ts extensions/whatsapp/src/outbound-base.test.ts extensions/whatsapp/src/send.test.ts extensions/whatsapp/src/auto-reply/deliver-reply.test.ts extensions/whatsapp/src/inbound/send-api.test.ts
pnpm format:check -- CHANGELOG.md docs/channels/whatsapp.md src/plugin-sdk/reply-payload.ts src/plugin-sdk/reply-payload.test.ts extensions/whatsapp/src/outbound-base.ts extensions/whatsapp/src/outbound-base.test.ts extensions/whatsapp/src/outbound-adapter.sendpayload.test.ts extensions/whatsapp/src/send.ts
pnpm docs:list
git diff --check
pnpm check:changed passed, including core/extension typecheck, lint, import-cycle guards, changed tests, and full extension test shards.

Closing this PR as superseded by the landed main commit.

openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: M labels Apr 19, 2026

greptile-apps Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread extensions/whatsapp/src/send.ts Outdated

chatgpt-codex-connector Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread extensions/whatsapp/src/send.ts Outdated

openclaw-barnacle Bot added size: L and removed size: M labels Apr 19, 2026

masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from ece86aa to 098b4b0 Compare April 23, 2026 16:44

masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from 098b4b0 to c0b5777 Compare April 24, 2026 11:36

masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from 57a349a to ab8bbda Compare April 24, 2026 23:43

masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from ab8bbda to 76f1fdd Compare April 24, 2026 23:52

masatohoshino added 2 commits April 25, 2026 00:03

fix(media): strip all Unicode Cf format chars in sanitizeFileName

63d84f7

masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from 76f1fdd to 6abf1f4 Compare April 25, 2026 00:06

steipete closed this Apr 25, 2026

This was referenced May 3, 2026

fix(media): add shared media MIME validation helpers #76566

Closed

fix(whatsapp): strip control characters from outbound document fileName #77114

Merged

Uh oh!

Conversation

masatohoshino commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Security notes

Follow-ups

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

greptile-apps Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

masatohoshino commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

aisle-research-bot Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔒 Aisle Security Analysis

Description

Recommendation

Uh oh!

masatohoshino commented Apr 24, 2026

Uh oh!

masatohoshino commented Apr 24, 2026

Uh oh!

steipete commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

masatohoshino commented Apr 19, 2026 •

edited

Loading

greptile-apps Bot commented Apr 19, 2026 •

edited

Loading

aisle-research-bot Bot commented Apr 23, 2026 •

edited

Loading