Skip to content

fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744

Closed
masatohoshino wants to merge 4 commits into
openclaw:mainfrom
masatohoshino:fix/whatsapp-audio-as-voice-consumption
Closed

fix(whatsapp): respect audioAsVoice flag in outbound delivery#68744
masatohoshino wants to merge 4 commits into
openclaw:mainfrom
masatohoshino:fix/whatsapp-audio-as-voice-consumption

Conversation

@masatohoshino

@masatohoshino masatohoshino commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #66053. The WhatsApp adapter dropped the audioAsVoice flag from
ChannelOutboundContext, so replies with [[audio_as_voice]] were
delivered as document attachments instead of PTT voice notes. This PR
wires the flag through createWhatsAppOutboundBase
sendMessageWhatsApp / deliverWebReply, and adds channel-agnostic
MIME/filename hardening.

Changes

Commit 1 — feat(media): adds to src/media/mime.ts (exported via
openclaw/plugin-sdk/media-runtime):

  • isVerifiedAudioSource({ kind, contentType }) — predicate gating
    voice-note routing.
  • sanitizeMediaMime(input, { preserveCodecsParam? }) — rejects control
    characters (CWE-93), preserves codecs= param only for audio.
  • sanitizeFileName(input) — strips ASCII control + Unicode General
    Category Cf characters (\p{Cf}, CWE-451), caps length at 128.

Signatures are channel-agnostic so Telegram, Discord, and Matrix can
adopt the same helpers in a follow-up.

Commit 2 — fix(whatsapp): consumes the new helpers.

  • sendMessageWhatsApp: forceVoiceDelivery gated on
    isVerifiedAudioSource; PTT mimetype rebuilt with opus codec
    preservation/fallback.
  • createWhatsAppOutboundBase: forwards audioAsVoice through
    sendMedia.
  • deliverWebReply: audio / audio-as-voice paths route to PTT with
    sanitized opus mime; image/video/document branches use sanitized
    allowlisted mime; document fileName sanitized.

Commit 3 — fix(media): widens sanitizeFileName invisible-character
coverage to all Unicode Category Cf (addresses Aisle Finding 1).

Rebased onto main after #69813 landed; the audio mimetype
canonicalization that previously lived inline is now delegated to
normalizeWhatsAppLoadedMedia, and this PR layers PTT routing +
security hardening on top.

Verification

Security notes

  • Finding 1 (Aisle, CWE-451, filename UI spoofing): fixed in commit 3.
  • Finding 2 (Aisle, CWE-20, magic-byte verification): deferred to a
    cross-channel follow-up. The upstream auto-reply path already lands
    ptt: true based on media.kind === "audio" (derived from
    detectMime extension/header fallback), so the correct fix plumbs a
    sniffed content-type through loadWebMedia and has Telegram,
    Discord, Matrix, and WhatsApp all adopt the verified source together.

Follow-ups

  • Magic-byte audio verification in loadWebMedia / detectMime (see
    security notes).
  • Telegram / Discord / Matrix adoption of the channel-agnostic helpers
    in src/media/mime.ts.

@masatohoshino

Copy link
Copy Markdown
Contributor Author

For a bit of extra context on the follow-up note in the PR body:

While working on this fix, I noticed a few other channel adapters that may be worth checking for similar audioAsVoice consumption gaps in their outbound paths. I deliberately kept this PR scoped to WhatsApp so it stays easy to review and resolves #66053 cleanly.

Happy to prepare a separate follow-up PR for the cross-channel side if maintainers think that would be useful — otherwise I'll leave the observation here for the record.

@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: M labels Apr 19, 2026
@greptile-apps

greptile-apps Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes the WhatsApp adapter to correctly route audio replies flagged with audioAsVoice=true to the PTT (push-to-talk) voice-note path instead of the document-attachment path. The fix is applied consistently across send.ts, outbound-base.ts, outbound-adapter.ts, and auto-reply/deliver-reply.ts, with a follow-up commit (566b640) also guarding the media.kind === "audio" branch against being overwritten when forceVoiceDelivery is active — directly addressing the concern raised in the previous review thread.

Confidence Score: 5/5

Safe to merge — all previously flagged P1 issues have been addressed and the remaining changes are well-tested.

The previously flagged bug (audio-kind branch overwriting forceVoiceDelivery) is fixed in the latest commit with the !forceVoiceDelivery guard on line 133 of send.ts. The mimetype normalization logic is consistent across all four touched files. No new logic paths are left unguarded, and no remaining P1 or P0 findings exist.

No files require special attention.

Reviews (2): Last reviewed commit: "fix(whatsapp): guard audio-kind branch a..." | Re-trigger Greptile

Comment thread extensions/whatsapp/src/send.ts Outdated
@masatohoshino

Copy link
Copy Markdown
Contributor Author

Thanks @greptile-apps for the catch. Addressed in the latest commit:

  • send.ts — moved the media.kind === "audio" branch into the else if chain guarded by !forceVoiceDelivery, matching the pattern used by the video/image/document branches. The forceVoiceDelivery block now correctly owns the audio-routing decision when the flag is set.
  • send.test.ts — added a regression test for the audioAsVoice=true + kind="audio" + contentType=null path, asserting the resulting mimetype is audio/ogg; codecs=opus.

Verification:

  • pnpm test extensions/whatsapp/src → 57 files / 484 tests pass
  • codex review --base origin/main reports no P1 or P2 findings on the latest state
  • The new test fails against the pre-fix code as expected

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566b640784

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread extensions/whatsapp/src/send.ts Outdated
@masatohoshino

Copy link
Copy Markdown
Contributor Author

@codex review

@masatohoshino

Copy link
Copy Markdown
Contributor Author

All security findings raised by the analysis bots on this PR have been addressed or scoped to follow-ups as discussed above. The PR is ready for human review when convenient.

Two items intentionally left for separate follow-ups:

  • magic-byte verification in isVerifiedAudioSource (needs loadWebMedia refactoring and changes the [[audio_as_voice]] trust model)
  • fileName sanitization in inbound/send-api.ts (best fixed alongside the upstream parseContentDispositionFileName sanitizer)

Happy to split either of those into their own PRs if maintainers want them tracked separately.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@masatohoshino masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from ece86aa to 098b4b0 Compare April 23, 2026 16:44
@aisle-research-bot

aisle-research-bot Bot commented Apr 23, 2026

Copy link
Copy Markdown

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Audio trust boundary bypass via isVerifiedAudioSource relying on caller-provided media.kind
1. 🟡 Audio trust boundary bypass via isVerifiedAudioSource relying on caller-provided media.kind
Property Value
Severity Medium
CWE CWE-345
Location src/media/mime.ts:118-135

Description

The new isVerifiedAudioSource() helper is used by WhatsApp outbound code to decide whether to force PTT/voice-note delivery (audioAsVoice). However, it treats media.kind === "audio" as a verified signal even though kind can be derived from untrusted metadata.

In particular:

  • For remote URLs, loadWebMediaInternal() sets kind = kindFromMime(contentType) where contentType comes from detectMime({ buffer, headerMime, filePath }).
  • detectMime() will fall back to extension-derived or HTTP Content-Type header-derived MIME when magic-byte sniffing is inconclusive (i.e., file-type returns undefined).
  • An attacker-controlled server can therefore provide non-audio bytes while setting Content-Type: audio/* (or a misleading extension) such that kind becomes "audio".
  • WhatsApp send paths then treat the payload as verified audio and can coerce it into the voice-note/PTT path, enabling content-type confusion / policy bypass and potentially exposing downstream parsers to unexpected input.

Vulnerable code:

export function isVerifiedAudioSource(media: {
  kind?: string | null;
  contentType?: string | null;
}): boolean {
  return media.kind === "audio";
}

Because this function is explicitly intended as a trust boundary (“verified audio source”), it should not rely solely on a caller-controlled classification.

Recommendation

Treat audioAsVoice as an authorization-like decision and require byte-based verification before returning true.

Options (from strongest to weakest):

  1. Pass sniffed MIME through the pipeline (or the buffer head) and require it to be audio:
export function isVerifiedAudioSource(media: {
  kind?: string | null;
  sniffedMime?: string | null;
}): boolean {
  return media.kind === "audio" && Boolean(media.sniffedMime?.startsWith("audio/"));
}
  1. If you already have an audio validator, require it (e.g., opus/ogg for WA PTT) rather than trusting kind:
return media.kind === "audio" && media.sniffedMime === "audio/ogg";
  1. At minimum, ensure kind is derived from sniffed bytes (not header/extension fallback) before allowing PTT coercion.

Also consider failing closed: if sniffing is inconclusive, do not treat the media as verified audio for voice-note delivery.


Analyzed PR: #68744 at commit 6abf1f4

Last updated on: 2026-04-25T00:08:29Z

@masatohoshino masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from 098b4b0 to c0b5777 Compare April 24, 2026 11:36
@masatohoshino

Copy link
Copy Markdown
Contributor Author

Second rebase and security review update (2026-04-24):

Rebase

Rebased onto the latest main after #69813 (canonicalize outbound media
delivery) landed. The media normalization that previously lived inline in
sendMessageWhatsApp / deliverWebReply is now delegated to
normalizeWhatsAppLoadedMedia in outbound-media-contract.ts. This PR's
audioAsVoice routing and security-hardening helpers sit on top of that
canonical layer:

  • sendMessageWhatsApp: keeps the audioAsVoice option; forceVoiceDelivery
    is gated on isVerifiedAudioSource({ kind, contentType: media.mimetype })
    after normalization; PTT mimetype uses the opus fallback with
    sanitizeMediaMime(..., { preserveCodecsParam: true }).
  • outbound-base.ts (createWhatsAppOutboundBase): propagates audioAsVoice
    through sendMedia. The refactored outbound-adapter.ts consumes this
    factory directly, so the earlier per-adapter wiring is removed.
  • deliver-reply.ts: audio and audioAsVoice-verified paths route to PTT with
    sanitized opus mimetype; image/video/document branches apply sanitized
    allowlisted mimetypes; document fileName is sanitized.

Tests: pnpm test extensions/whatsapp → 567 / 0; pnpm tsgo:prod → green.

Aisle response

  1. Incomplete Unicode invisible character stripping in sanitizeFileName
    addressed in the latest commit. sanitizeFileName now strips every
    Unicode General Category Cf character via /\p{Cf}/gu, covering the
    previously-missed ZWSP (U+200B), ZWJ (U+200D), BOM (U+FEFF), and Soft
    Hyphen (U+00AD, itself Cf) as well as the bidi marks already handled.
    Regression tests added in src/media/mime.test.ts.
  2. Voice-note sending can be coerced by trusting file extension/headers
    left as a follow-up, consistent with the earlier comment on this thread.
    Reasoning: this is not WhatsApp-specific. The upstream auto-reply paths
    that land ptt: true based on media.kind === "audio" have the same
    trust boundary (kind is derived from detectMime, which falls back to
    URL/path extension and HTTP Content-Type). Fixing it well means
    plumbing a sniffedContentType/sniffedIsAudio through loadWebMedia
    and having Telegram (resolveTelegramVoiceSend), Discord
    (sendVoiceMessageDiscord), Matrix, and WhatsApp all adopt the verified
    source together. Happy to open a separate PR for that refactor if
    maintainers agree — otherwise flagged here for the record.

Cross-channel note (reiterated)

The shared helpers introduced here — isVerifiedAudioSource,
sanitizeMediaMime, sanitizeFileName in src/media/mime.ts — use a
channel-agnostic signature ({ kind, contentType }) so Telegram, Discord,
and Matrix can adopt them in a follow-up for consistent voice-note
hardening across channels.

@masatohoshino

Copy link
Copy Markdown
Contributor Author

Aisle CWE-345 follow-up (2026-04-24):

The Medium finding flagged in the latest scan
("Content-type spoofing allows non-audio payloads to be sent as WhatsApp
voice notes (PTT)") is addressed in the latest commit.

isVerifiedAudioSource now requires a caller-classified kind === "audio".
The contentType parameter is retained on the signature so future work can
extend the helper with a sniffedMime argument once magic-byte sniffing is
plumbed through loadWebMedia / detectMime — but it is no longer consulted
on its own. This closes the header-spoofing path Aisle identified: an
attacker-controlled Content-Type: audio/... response no longer bypasses
the kind classifier.

Tests updated in src/media/mime.test.ts:

  • Flip the previously-accepted { kind: "document", contentType: "audio/ogg" }
    case expectation to false.
  • Add a dedicated CWE-345 regression test.

Verification:

  • pnpm test src/media/mime.test.ts → 107 passed / 0 failed
  • pnpm test extensions/whatsapp → 567 passed / 0 failed
  • pnpm tsgo:prod → exit 0

The broader "plumb sniffed content type through loadWebMedia and have
Telegram/Discord/Matrix all adopt a verified source" refactor remains a
follow-up PR as discussed earlier in this thread — the current change is the
in-scope defensive tightening Aisle's example implementation suggested.

@masatohoshino masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from 57a349a to ab8bbda Compare April 24, 2026 23:43
@masatohoshino masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from ab8bbda to 76f1fdd Compare April 24, 2026 23:52
…leName utilities

Channel-agnostic helpers in src/media/mime.ts for voice-note delivery
safety, exported through openclaw/plugin-sdk/media-runtime.

- isVerifiedAudioSource(media): predicate gating voice-note (PTT)
  routing on media.kind === 'audio' or sanitized audio/* mime
- sanitizeMediaMime(input, opts): validates MIME for outbound headers,
  rejects ASCII control characters (CWE-93), normalizes to lowercase
  base type, optionally preserves a sanitized codecs= param for audio
- sanitizeFileName(input): strips ASCII control characters and Unicode
  bidirectional/invisible format characters (CWE-451), replaces path
  separators and quotes, caps length at 128 chars, linear-time
  iteration bounded against attacker-controlled input

Signatures are kept minimal ({ kind, contentType } / string | null)
so Telegram (resolveTelegramVoiceSend), Discord
(sendVoiceMessageDiscord), and Matrix can adopt the same helpers in a
follow-up for consistent cross-channel voice-note hardening.
Fixes openclaw#66053. The WhatsApp adapter previously dropped the audioAsVoice
flag from ChannelOutboundContext, so replies carrying the
[[audio_as_voice]] directive were delivered as document attachments
instead of PTT voice notes.

Layers

- sendMessageWhatsApp: adds audioAsVoice option. When true and the
  loaded media is a verified audio source (isVerifiedAudioSource),
  mediaType is rebuilt from a sanitized MIME with opus codec
  preservation/fallback so the outbound frame reaches WhatsApp as a
  voice note. Non-audio-verified payloads stay on the document path.
- createWhatsAppOutboundBase (outbound-base.ts): forwards audioAsVoice
  through the sendMedia factory option so outbound-adapter.ts no
  longer needs per-adapter wiring (upstream canonicalization landed
  in openclaw#69813).
- deliverWebReply (auto-reply/deliver-reply.ts): routes kind='audio'
  and audioAsVoice-verified media to PTT with sanitized opus
  mimetype; image/video/document branches apply sanitized allowlisted
  mimetypes; document fileName is passed through sanitizeFileName.

Security hardening (applied on top of upstream's
normalizeWhatsAppLoadedMedia helper)

- sanitizeMediaMime rejects control characters (CWE-93) and preserves
  codecs params only in the audio path.
- sanitizeFileName strips ASCII control and bidi/invisible Unicode to
  prevent filename UI spoofing (CWE-451).
- isVerifiedAudioSource gates forceVoiceDelivery so an audioAsVoice
  reply cannot coerce non-audio bytes into a voice-note payload.

Tests (extensions/whatsapp)

- send.test.ts: audioAsVoice=true+audio routing, forceVoiceDelivery
  override guard, mimetype sanitization across kinds, document
  fileName sanitization.
- deliver-reply.test.ts: voice-coercion rejection, control-character
  fallbacks for audio/image/video/document, fileName sanitization,
  audioAsVoice-unset document path.
- outbound-base.test.ts / outbound-adapter.sendpayload.test.ts:
  audioAsVoice propagation through the factory.

Rebased onto upstream/main after openclaw#69813. Audio mimetype canonicalization
("audio/ogg" -> "audio/ogg; codecs=opus") is now owned by
normalizeWhatsAppLoadedMedia; this change layers PTT routing and
security hardening on top.

Closes openclaw#66053
isVerifiedAudioSource previously returned true for any sanitized
contentType starting with audio/, allowing an attacker who controls
a mediaUrl response to spoof Content-Type: audio/... and coerce
WhatsApp PTT delivery of non-audio bytes (Aisle CWE-345).

Tighten the helper to only accept media.kind === "audio". The
contentType parameter is retained on the signature as a seam for a
future sniffedMime-based extension once magic-byte sniffing is
plumbed through loadWebMedia / detectMime cross-channel.

Addresses Aisle re-scan finding on PR openclaw#68744.
@masatohoshino masatohoshino force-pushed the fix/whatsapp-audio-as-voice-consumption branch from 76f1fdd to 6abf1f4 Compare April 25, 2026 00:06
@steipete

Copy link
Copy Markdown
Contributor

Thanks @masatohoshino. I landed a narrower mainline fix for the issue in c2a2a481b2

What landed:

  • sendTextMediaPayload now preserves payload.audioAsVoice when it fans out media sends.
  • The WhatsApp outbound adapter forwards audioAsVoice through sendMedia to sendMessageWhatsApp.
  • WhatsApp docs and changelog now document that reply payloads preserve audioAsVoice; WhatsApp audio media continues to go out as Baileys PTT voice notes.

I intentionally did not land the broader MIME/filename helper bundle from this PR in this pass. The current root bug was the dropped shared payload flag, and the existing WhatsApp send path already maps audio MIME payloads to Baileys { audio, ptt: true }. New public SDK helper surface and cross-channel MIME hardening should be split/reviewed separately.

Dependency contract checked locally: installed Baileys types expose AnyMediaMessageContent audio payloads with ptt?: boolean, and createWebSendApi already sends audio as ptt: true.

Verification:

  • pnpm test src/plugin-sdk/reply-payload.test.ts extensions/whatsapp/src/outbound-adapter.sendpayload.test.ts extensions/whatsapp/src/outbound-base.test.ts extensions/whatsapp/src/send.test.ts extensions/whatsapp/src/auto-reply/deliver-reply.test.ts extensions/whatsapp/src/inbound/send-api.test.ts
  • pnpm format:check -- CHANGELOG.md docs/channels/whatsapp.md src/plugin-sdk/reply-payload.ts src/plugin-sdk/reply-payload.test.ts extensions/whatsapp/src/outbound-base.ts extensions/whatsapp/src/outbound-base.test.ts extensions/whatsapp/src/outbound-adapter.sendpayload.test.ts extensions/whatsapp/src/send.ts
  • pnpm docs:list
  • git diff --check
  • pnpm check:changed passed, including core/extension typecheck, lint, import-cycle guards, changed tests, and full extension test shards.

Closing this PR as superseded by the landed main commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WhatsApp: [[audio_as_voice]] / PTT voice-note delivery not working

2 participants