Skip to content

fix(media): normalize MIME kind detection for WhatsApp audio transcription#32280

Merged
steipete merged 4 commits intoopenclaw:mainfrom
Lucenx9:fix/whatsapp-voice-transcription-32200
Mar 2, 2026
Merged

fix(media): normalize MIME kind detection for WhatsApp audio transcription#32280
steipete merged 4 commits intoopenclaw:mainfrom
Lucenx9:fix/whatsapp-voice-transcription-32200

Conversation

@Lucenx9
Copy link
Contributor

@Lucenx9 Lucenx9 commented Mar 2, 2026

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: media attachment kind detection used raw MIME strings, which could fail on MIME values with mixed casing/whitespace and parameters.
  • Why it matters: when MIME classification fails and filename extension is missing/ambiguous, audio attachments can be skipped and transcription never starts.
  • What changed: normalized MIME before kindFromMime classification, and added a regression test for WhatsApp-style audio/ogg; codecs=opus with scope rules (chatType: "dm", channel: "whatsapp").
  • What did NOT change (scope boundary): no change to provider selection, execution model order, or WhatsApp transport/download path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Audio transcription trigger is now more robust when inbound MIME contains casing/whitespace/parameter variations.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Ubuntu 24.04 (dev env)
  • Runtime/container: Node 22 + pnpm
  • Model/provider: mocked provider in tests (Vitest)
  • Integration/channel (if any): WhatsApp-style MIME scenario in media-understanding tests
  • Relevant config (redacted): tools.media.audio.scope.rules with chatType: "dm" and channel: "whatsapp"

Steps

  1. Provide an audio attachment with MIME " Audio/Ogg; codecs=opus " and non-audio filename (voice-note).
  2. Run media understanding with audio enabled and WhatsApp-like scope rules.
  3. Confirm audio transcription is applied.

Expected

  • Audio is classified correctly and transcription runs.

Actual

  • With this patch: transcription runs and transcript is applied.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • pnpm vitest run src/media-understanding/apply.test.ts src/media/mime.test.ts
    • Added regression case: transcribes WhatsApp audio with parameterized MIME despite casing/whitespace
  • Edge cases checked:
    • MIME with parameters and mixed casing/whitespace
    • scope rules including chatType: "dm" + channel: "whatsapp"
  • What you did not verify:
    • Full end-to-end WhatsApp runtime on a live account in this environment

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • Revert commit fix(media): normalize MIME kind detection for audio transcription.
  • Files/config to restore:
    • src/media/mime.ts
    • src/media-understanding/apply.test.ts (test-only)
  • Known bad symptoms reviewers should watch for:
    • Unexpected media kind changes for non-audio MIME inputs.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk:
    • MIME normalization could alter classification behavior for unusual mixed-case MIME strings.
    • Mitigation:
      • Added regression coverage and kept behavior scoped to kindFromMime normalization only.

AI Assistance

  • AI-assisted: Yes (Codex)
  • Testing level: Lightly tested (targeted Vitest suites above)

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR normalizes MIME types before kindFromMime classification in src/media/mime.ts, fixing audio attachment kind detection for WhatsApp-style MIME strings with mixed casing, leading/trailing whitespace, and codec parameters (e.g. " Audio/Ogg; codecs=opus "). The companion regression test in apply.test.ts verifies the MIME normalization path, though there is a scope rule shadowing issue in the test setup (see inline comment).

Changes:

  • src/media/mime.ts: kindFromMime now wraps its input in normalizeMimeType before calling mediaKindFromMime — consistent with the normalization already applied in detectMimeImpl for headerMime.
  • src/media-understanding/apply.test.ts: Adds a regression test for WhatsApp audio with parameterized MIME (" Audio/Ogg; codecs=opus ").

Issue found:

  • The regression test sets ctx.ChatType = "direct" alongside scope rules that include { chatType: "dm" }. Since "dm" normalizes to "direct", Rule 1 matches before the intended channel: "whatsapp" Rule 2 is ever evaluated. The WhatsApp channel scope rule is not actually exercised by this test, weakening its value as a WhatsApp-specific regression case.

Confidence Score: 4/5

  • Safe to merge — the production fix is correct and minimal; the only issue is a test design flaw that doesn't affect runtime behaviour.
  • The one-line change to kindFromMime is correct, well-scoped, and consistent with existing normalization in detectMimeImpl. The only concern is the regression test which does not fully exercise the WhatsApp channel scope rule it implies, reducing test coverage confidence for that specific path. No production behaviour is broken.
  • src/media-understanding/apply.test.ts — the WhatsApp channel scope rule (Rule 2) is shadowed by the chatType rule (Rule 1) and is never evaluated in the new test case.

Last reviewed commit: 154703a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +369 to +382
ctx.ChatType = "direct";
ctx.Surface = "whatsapp";

const cfg: OpenClawConfig = {
tools: {
media: {
audio: {
enabled: true,
maxBytes: 1024 * 1024,
scope: {
default: "deny",
rules: [
{ action: "allow", match: { chatType: "direct" } },
{ action: "allow", match: { channel: "whatsapp" } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WhatsApp channel scope rule is never evaluated in this test

The test is titled "transcribes WhatsApp audio with parameterized MIME despite casing/whitespace" and includes two scope rules to represent a WhatsApp-like configuration:

  1. { action: "allow", match: { chatType: "dm" } } — normalizes to "direct"
  2. { action: "allow", match: { channel: "whatsapp" } } — intended to cover WhatsApp surface

However, ctx.ChatType is set to "direct", and normalizeChatType("dm") also resolves to "direct" (see src/channels/chat-type.ts). The scope resolver iterates rules in order, so Rule 1 matches first and returns "allow" — Rule 2 (channel: "whatsapp") is never reached.

This means the test passes regardless of whether the WhatsApp channel rule (or even ctx.Surface) is present. If the intent is to specifically validate that the channel: "whatsapp" rule enables transcription, the chatType on the context should NOT match Rule 1. For example, omitting ctx.ChatType (leaving it undefined) would force evaluation to fall through to the channel rule:

Suggested change
ctx.ChatType = "direct";
ctx.Surface = "whatsapp";
const cfg: OpenClawConfig = {
tools: {
media: {
audio: {
enabled: true,
maxBytes: 1024 * 1024,
scope: {
default: "deny",
rules: [
{ action: "allow", match: { chatType: "direct" } },
{ action: "allow", match: { channel: "whatsapp" } },
const ctx = await createAudioCtx({
fileName: "voice-note",
mediaType: " Audio/Ogg; codecs=opus ",
});
ctx.Surface = "whatsapp";
// ChatType intentionally not set — scope relies on channel: "whatsapp" rule
const cfg: OpenClawConfig = {
tools: {
media: {
audio: {
enabled: true,
maxBytes: 1024 * 1024,
scope: {
default: "deny",
rules: [
{ action: "allow", match: { channel: "whatsapp" } },
],
},
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.test.ts
Line: 369-382

Comment:
**WhatsApp channel scope rule is never evaluated in this test**

The test is titled "transcribes WhatsApp audio with parameterized MIME despite casing/whitespace" and includes two scope rules to represent a WhatsApp-like configuration:
1. `{ action: "allow", match: { chatType: "dm" } }` — normalizes to `"direct"`
2. `{ action: "allow", match: { channel: "whatsapp" } }` — intended to cover WhatsApp surface

However, `ctx.ChatType` is set to `"direct"`, and `normalizeChatType("dm")` also resolves to `"direct"` (see `src/channels/chat-type.ts`). The scope resolver iterates rules in order, so Rule 1 matches first and returns `"allow"` — Rule 2 (`channel: "whatsapp"`) is never reached.

This means the test passes regardless of whether the WhatsApp channel rule (or even `ctx.Surface`) is present. If the intent is to specifically validate that the `channel: "whatsapp"` rule enables transcription, the `chatType` on the context should NOT match Rule 1. For example, omitting `ctx.ChatType` (leaving it `undefined`) would force evaluation to fall through to the channel rule:

```suggestion
    const ctx = await createAudioCtx({
      fileName: "voice-note",
      mediaType: " Audio/Ogg; codecs=opus ",
    });
    ctx.Surface = "whatsapp";
    // ChatType intentionally not set — scope relies on channel: "whatsapp" rule

    const cfg: OpenClawConfig = {
      tools: {
        media: {
          audio: {
            enabled: true,
            maxBytes: 1024 * 1024,
            scope: {
              default: "deny",
              rules: [
                { action: "allow", match: { channel: "whatsapp" } },
              ],
            },
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — updated. The test now relies only on the channel: "whatsapp" scope rule (no matching chatType rule), so it explicitly exercises the WhatsApp channel path.

@steipete steipete force-pushed the fix/whatsapp-voice-transcription-32200 branch from bb3e053 to fae7947 Compare March 2, 2026 23:31
@steipete steipete merged commit de77a36 into openclaw:main Mar 2, 2026
@steipete
Copy link
Contributor

steipete commented Mar 2, 2026

Landed via temp rebase onto main.

  • Gate: pnpm -s vitest run src/media/mime.test.ts src/media-understanding/apply.test.ts && pnpm -s tsgo
  • Land commit: fae7947
  • Merge commit: de77a36

Thanks @Lucenx9!

Copy link

@mizoz mizoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅ - Fixes MIME kind detection for WhatsApp audio. The fix normalizes MIME type before classification, handling parameterized MIME strings like audio/ogg; codecs=opus with mixed casing/whitespace. The regression test covers the WhatsApp-specific scenario well.

dawi369 pushed a commit to dawi369/davis that referenced this pull request Mar 3, 2026
OWALabuy pushed a commit to kcinzgg/openclaw that referenced this pull request Mar 4, 2026
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants