feat(media): native multimodal ingestion for inbound voice notes#4
Conversation
Feed inbound WhatsApp voice notes to the model as a native audio part on the agent's own turn instead of pre-transcribing to text, preserving tone/pacing/ambient cues for multimodal models (Gemini). STT remains the pre-turn fallback (non-multimodal model, flag off, or bytes unloadable). Gated behind tools.media.audio.nativeIngestion (default off). - AudioContent type + widened user-message content union + Model.input "audio" (both src/llm and agent-core type systems) - images.ts: detectAudioReferences / loadAudioFromRef / detectAndLoadPromptAudio / modelSupportsAudioInput, mirroring the image prompt-detection path (convertMessages already emits inlineData) - attempt.ts: detect+load audio, thread through prompt options; audio-only turns no longer count as a blank prompt - on-message.ts: skip the STT preflight when nativeIngestion is enabled so the [media attached: ... (audio/...)] note survives to the prompt - anthropic / openai converters drop audio parts (native audio is the Google-only path); google-shared needs no change - config: tools.media.audio.nativeIngestion (+ zod schema) Refs imperfect-co/tulgey#214. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- on-message.ts: when native ingestion skips the STT preflight, set preflightAudioTranscript = null (not undefined) and run the ack/status reaction first. Leaving it undefined let processMessage's internal STT fallback (gated on `=== undefined`) re-transcribe the audio and strip the [media attached: ... (audio/...)] note — defeating native ingestion. - images.ts: add "caf" to AUDIO_EXTENSION_NAMES (iOS voice notes). - test: assert preflightAudioTranscript is null (regression guard for the re-transcription path above). Refs imperfect-co/tulgey#214. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
Warning Review limit reached
More reviews will be available in 46 minutes and 20 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds first-class native audio input support across the platform: new AudioContent type, agent/harness/session audio parameters, prompt audio detection/loading, provider-specific message conversions, attempt-runner wiring, and WhatsApp native ingestion handling with tests. ChangesNative audio input support
Sequence Diagram: Audio flow through agent and LLM providerssequenceDiagram
participant Agent as Agent/Harness
participant Attempt as Attempt Runner
participant AudioLoader as PromptAudioLoader
participant Provider as LLM Provider
Agent->>Agent: prompt(text, images, audio)
Agent->>Attempt: submit prompt with audio
Attempt->>Attempt: detect & load audio refs
Attempt->>AudioLoader: detectAndLoadPromptAudio
AudioLoader->>AudioLoader: modelSupportsAudioInput?
AudioLoader->>AudioLoader: detectAudioReferences
AudioLoader->>AudioLoader: loadAudioFromRef (per ref)
AudioLoader-->>Attempt: { audio: AudioContent[] }
Attempt->>Attempt: build message content (text/images/audio)
Attempt->>Provider: convertMessages(UserMessage)
Provider->>Provider: handle AudioContent block
Note over Provider: Anthropic: drop audio<br/>Google: inlineData parts<br/>OpenAI: drop audio
Provider-->>Attempt: converted message params
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
extensions/whatsapp/src/auto-reply/monitor/on-message.ts (1)
279-299:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftFix group voice-note spoken-mention gating when nativeIngestion is enabled
With
tools.media.audio.nativeIngestion === true,runAudioPreflightOnce()setspreflightAudioTranscript = null, and the secondapplyGroupGatingcall only suppliesmentionTextwhentypeof preflightAudioTranscript === "string"—so nomentionTextis provided. Since mention detection then runs against the originalmsg.body(e.g.,"<media:audio>"), the retry pass still results inshouldProcess: false, and the handler returns beforeprocessMessageruns, silently dropping group voice notes meant to activate via spoken mentions.Split the “audio STT skip” behavior from “mention gating needs STT”: either run a lightweight STT/transcription only to produce
mentionTextwhenneedsMentionTextis true (even under native ingestion), or explicitly document/guard this limitation for spoken-mention activation in groups with native ingestion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/whatsapp/src/auto-reply/monitor/on-message.ts` around lines 279 - 299, The second applyGroupGating call is missing mentionText when runAudioPreflightOnce() sets preflightAudioTranscript = null under tools.media.audio.nativeIngestion, causing spoken-mention gating to fail; modify the flow around runAudioPreflightOnce(), preflightAudioTranscript and applyGroupGating so that when nativeIngestion is enabled but the gating/mention logic requires a spoken mention, you still obtain a lightweight STT/transcription or a dedicated mentionText (e.g., call a minimal transcription helper or preserve the preflight result) and pass it into applyGroupGating via the mentionText property (referencing runAudioPreflightOnce, preflightAudioTranscript, and applyGroupGating) so the retry pass supplies mentionText even with native ingestion enabled.src/agents/agent-hooks/context-pruning/pruner.ts (1)
150-165:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winConsider adding character estimation for audio content.
The function now accepts
AudioContentbut doesn't estimate any characters for audio blocks (they silently contribute 0). If audio is sent inline to multimodal models, it likely consumes context window quota. Consider adding anAUDIO_CHAR_ESTIMATEconstant similar toIMAGE_CHAR_ESTIMATE(line 14) to account for audio in context pruning calculations.💡 Suggested addition
const IMAGE_CHAR_ESTIMATE = 8_000; +const AUDIO_CHAR_ESTIMATE = 12_000; // Rough estimate for audio content const PRUNED_CONTEXT_IMAGE_MARKER = "[image removed during context pruning]";function estimateTextAndImageChars( content: ReadonlyArray<TextContent | ImageContent | AudioContent>, ): number { let chars = 0; for (const block of content) { const text = coerceTextBlock(block); if (text !== null) { chars += estimateWeightedTextChars(text); continue; } if (isImageBlock(block)) { chars += IMAGE_CHAR_ESTIMATE; + continue; } + if (block.type === "audio") { + chars += AUDIO_CHAR_ESTIMATE; + } } return chars; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agents/agent-hooks/context-pruning/pruner.ts` around lines 150 - 165, The function estimateTextAndImageChars ignores AudioContent so audio blocks contribute 0; add a constant AUDIO_CHAR_ESTIMATE (similar to IMAGE_CHAR_ESTIMATE) and update estimateTextAndImageChars to detect audio blocks (e.g., via an existing isAudioBlock helper or a new check for AudioContent) and add AUDIO_CHAR_ESTIMATE to chars when encountered; keep existing handling for text via coerceTextBlock and images via isImageBlock, and ensure AUDIO_CHAR_ESTIMATE is exported/defined alongside IMAGE_CHAR_ESTIMATE so pruning uses a conservative audio token budget.src/agents/embedded-agent-runner/run/attempt.ts (1)
3991-4012:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThread media options through the runtime-only submission path.
Line 3991 still submits the runtime-only branch with only
preflightResult. With the newaudioCountskip logic, an audio-only turn can now reach this branch and still lose its loadedaudiopayload before the model call.Proposed fix
- try { - if (promptSubmission.runtimeOnly) { - await promptActiveSession(promptForSession, { - preflightResult: armModelPromptTransform, - }); - } else { - const cleanupRuntimeContextMessage = installRuntimeContextMessageForPrompt({ - session: activeSession, - message: runtimeContextMessageForCurrentTurn, - }); - try { - // Only attach images/audio options when present so models - // that don't expect those parameters aren't handed empties. - const promptOptions: Parameters<typeof promptActiveSession>[1] = { - preflightResult: armModelPromptTransform, - }; - if (imageResult.images.length > 0) { - promptOptions.images = imageResult.images; - } - if (audioResult.audio.length > 0) { - promptOptions.audio = audioResult.audio; - } - await promptActiveSession(promptForSession, promptOptions); - } finally { - cleanupRuntimeContextMessage(); - } - } + try { + // Only attach images/audio options when present so models + // that don't expect those parameters aren't handed empties. + const promptOptions: Parameters<typeof promptActiveSession>[1] = { + preflightResult: armModelPromptTransform, + }; + if (imageResult.images.length > 0) { + promptOptions.images = imageResult.images; + } + if (audioResult.audio.length > 0) { + promptOptions.audio = audioResult.audio; + } + if (promptSubmission.runtimeOnly) { + await promptActiveSession(promptForSession, promptOptions); + } else { + const cleanupRuntimeContextMessage = installRuntimeContextMessageForPrompt({ + session: activeSession, + message: runtimeContextMessageForCurrentTurn, + }); + try { + await promptActiveSession(promptForSession, promptOptions); + } finally { + cleanupRuntimeContextMessage(); + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agents/embedded-agent-runner/run/attempt.ts` around lines 3991 - 4012, The runtime-only branch currently calls promptActiveSession(promptForSession, { preflightResult: armModelPromptTransform }) and therefore drops image/audio payloads for audio-only or image-containing turns; modify the runtime-only path in the promptSubmission.runtimeOnly branch to build the same promptOptions object used in the non-runtime path (include preflightResult and conditionally add images from imageResult.images and audio from audioResult.audio) and pass that promptOptions to promptActiveSession so media is preserved; reference promptSubmission, promptActiveSession, promptForSession, promptOptions, imageResult, and audioResult to locate and update the code.
🧹 Nitpick comments (1)
extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts (1)
475-510: ⚡ Quick winAdd group coverage for native ingestion.
These tests only exercise the DM path. Given the group mention-gating interaction flagged in
on-message.ts, consider adding amakeGroupAudioMsg()test withnativeIngestion: trueto lock in the intended behavior for spoken-mention voice notes in groups.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts` around lines 475 - 510, Add a new test variant that covers group messages when native audio ingestion is enabled: create a test similar to the existing "runs STT when native audio ingestion is disabled (fallback path)" but call createWebOnMessageHandler with cfg.tools.media.audio.nativeIngestion: true and invoke makeGroupAudioMsg() instead of makeAudioMsg(); assert the expected behavior for group spoken-mention voice notes (e.g., whether transcribeFirstAudioMock is called, events includes "stt" or not, and the preflightAudioTranscript on processMessage depending on on-message.ts mention-gating logic) so the group path and mention gating implemented in on-message.ts are exercised. Ensure you reuse the same handler config fields (connectionId, maxMediaBytes, groupHistoryLimit, groupHistories, groupMemberNames, echoTracker, replyResolver, replyLogger, baseMentionConfig, account) to mirror the DM test and make assertions against the same mocks (transcribeFirstAudioMock, processMessageMock, events).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/agents/embedded-agent-runner/run/images.ts`:
- Around line 719-775: detectAudioReferences currently only parses
MEDIA_ATTACHED_PATTERN blocks and misses standalone local audio refs; update
detectAudioReferences to also scan the entire prompt for plain local/file URIs
by adding a second pass (after the existing MEDIA_ATTACHED loop) that uses a
regex to find file:// URIs and path-like tokens (e.g., ./, ../, ~/ or bare
filenames with audio extensions) and then reuse isAudioExtension,
assertNoWindowsNetworkPath, resolveUserPath and normalizeRefForDedupe to
validate, dedupe (using seen) and push entries with appropriate type ("path" or
"media-uri" for file://). Ensure you also treat file:///... as a media-uri or
convert to a resolved local path consistently and skip http/https, keeping
existing behavior for MEDIA_URI_REGEX handling.
In `@src/agents/sessions/agent-session.ts`:
- Around line 1198-1205: Queued prompts are losing audio because audio is only
appended to userContent in the non-streaming path; update the streaming/queued
path (the code paths that call queueSteer() / queueFollowUp() from prompt()) to
include options?.audio the same way as currentImages is included so queued turns
keep their audio attachments. Locate where currentImages is passed into
queueSteer/queueFollowUp (and where queued message content is constructed) and
push ...options.audio or otherwise include the AudioContent entries into the
queued user content before calling queueSteer/queueFollowUp.
---
Outside diff comments:
In `@extensions/whatsapp/src/auto-reply/monitor/on-message.ts`:
- Around line 279-299: The second applyGroupGating call is missing mentionText
when runAudioPreflightOnce() sets preflightAudioTranscript = null under
tools.media.audio.nativeIngestion, causing spoken-mention gating to fail; modify
the flow around runAudioPreflightOnce(), preflightAudioTranscript and
applyGroupGating so that when nativeIngestion is enabled but the gating/mention
logic requires a spoken mention, you still obtain a lightweight
STT/transcription or a dedicated mentionText (e.g., call a minimal transcription
helper or preserve the preflight result) and pass it into applyGroupGating via
the mentionText property (referencing runAudioPreflightOnce,
preflightAudioTranscript, and applyGroupGating) so the retry pass supplies
mentionText even with native ingestion enabled.
In `@src/agents/agent-hooks/context-pruning/pruner.ts`:
- Around line 150-165: The function estimateTextAndImageChars ignores
AudioContent so audio blocks contribute 0; add a constant AUDIO_CHAR_ESTIMATE
(similar to IMAGE_CHAR_ESTIMATE) and update estimateTextAndImageChars to detect
audio blocks (e.g., via an existing isAudioBlock helper or a new check for
AudioContent) and add AUDIO_CHAR_ESTIMATE to chars when encountered; keep
existing handling for text via coerceTextBlock and images via isImageBlock, and
ensure AUDIO_CHAR_ESTIMATE is exported/defined alongside IMAGE_CHAR_ESTIMATE so
pruning uses a conservative audio token budget.
In `@src/agents/embedded-agent-runner/run/attempt.ts`:
- Around line 3991-4012: The runtime-only branch currently calls
promptActiveSession(promptForSession, { preflightResult: armModelPromptTransform
}) and therefore drops image/audio payloads for audio-only or image-containing
turns; modify the runtime-only path in the promptSubmission.runtimeOnly branch
to build the same promptOptions object used in the non-runtime path (include
preflightResult and conditionally add images from imageResult.images and audio
from audioResult.audio) and pass that promptOptions to promptActiveSession so
media is preserved; reference promptSubmission, promptActiveSession,
promptForSession, promptOptions, imageResult, and audioResult to locate and
update the code.
---
Nitpick comments:
In
`@extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts`:
- Around line 475-510: Add a new test variant that covers group messages when
native audio ingestion is enabled: create a test similar to the existing "runs
STT when native audio ingestion is disabled (fallback path)" but call
createWebOnMessageHandler with cfg.tools.media.audio.nativeIngestion: true and
invoke makeGroupAudioMsg() instead of makeAudioMsg(); assert the expected
behavior for group spoken-mention voice notes (e.g., whether
transcribeFirstAudioMock is called, events includes "stt" or not, and the
preflightAudioTranscript on processMessage depending on on-message.ts
mention-gating logic) so the group path and mention gating implemented in
on-message.ts are exercised. Ensure you reuse the same handler config fields
(connectionId, maxMediaBytes, groupHistoryLimit, groupHistories,
groupMemberNames, echoTracker, replyResolver, replyLogger, baseMentionConfig,
account) to mirror the DM test and make assertions against the same mocks
(transcribeFirstAudioMock, processMessageMock, events).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 897a088d-f759-4b9e-af37-ea4cb2e0c2c0
📒 Files selected for processing (22)
extensions/openai/openai-codex-provider.tsextensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.tsextensions/whatsapp/src/auto-reply/monitor/on-message.tspackages/agent-core/src/agent.tspackages/agent-core/src/harness/agent-harness.tspackages/agent-core/src/llm.tssrc/agents/agent-hooks/context-pruning/pruner.tssrc/agents/embedded-agent-runner/run/attempt.prompt-helpers.tssrc/agents/embedded-agent-runner/run/attempt.tssrc/agents/embedded-agent-runner/run/images.test.tssrc/agents/embedded-agent-runner/run/images.tssrc/agents/sessions/agent-session.tssrc/commands/models/list.model-row.tssrc/commands/models/list.rows.tssrc/config/types.tools.tssrc/config/zod-schema.core.tssrc/llm/providers/anthropic.tssrc/llm/providers/google-shared.convert.test.tssrc/llm/providers/openai-completions.tssrc/llm/providers/openai-responses-shared.tssrc/llm/providers/transform-messages.tssrc/llm/types.ts
…io + test mock Fixes the CI regression and both CodeRabbit Major findings on #4. - test mock (CI fix): the shared attempt.spawn-workspace test-support mock for ./images.js only stubbed detectAndLoadPromptImages, so attempt.ts's new detectAndLoadPromptAudio import threw "No export is defined" under vitest, cascading into 27 failures in the embedded-agent shard. Add the audio mock. - detectAudioReferences (CR finding): only scanned [media attached: ...] blocks, so plain refs (./voice.ogg, ~/memo.caf, file:///tmp/note.wav) silently fell back to text. Add audio-extension variants of the file://, Windows-drive, and bare-path passes the image detector already runs, gated on isAudioExtension. PATH_PATTERN's leading-boundary requirement keeps media://inbound/<id> URIs from being misparsed as filesystem paths. +6 detectAudioReferences tests. - queued-prompt audio (CR finding): prompt() while streaming routed through queueSteer/queueFollowUp with currentImages only, dropping options.audio. Thread audio through both queue methods and the streaming call site. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agents/embedded-agent-runner/run/images.ts (1)
778-795:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftVerify inbound audio claim-check URIs always carry an audio MIME or extension.
A claim-check URI is classified as audio only when the bracket contains
(audio/...)or the id has an audio extension. UnlikedetectImageReferences(which accepts anymedia://URI), this requires a discriminator. Sinceon-message.tsskips the STT preflight whennativeIngestionis enabled, an inbound voice-note URI that lacks both signals would be detected as neither image nor audio — and with STT already skipped, the audio cue is lost entirely (the exact failure this PR aims to prevent).Confirm the Gateway/store format for inbound audio attachments retains an audio extension on the media id or emits the
(audio/...)annotation.#!/bin/bash # Inspect how inbound media claim-check entries are formatted, and whether # audio attachments retain an extension or MIME annotation in the bracket. rg -nP -C4 'media attached' --type=ts -g '!**/*.test.ts' rg -nP -C3 'media://inbound' --type=ts -g '!**/*.test.ts' # Filename sanitization / extension retention in store fd -i 'store.ts' --exec rg -nP -C3 'sanitizeFilename|extname|\.ogg|audio/'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/agents/embedded-agent-runner/run/images.ts` around lines 778 - 795, The current block using MEDIA_URI_REGEX only adds a media claim-check when the bracket contains an (audio/...) MIME or the id has an audio extension, causing inbound audio URIs without those signals to be skipped; update the logic in the MEDIA_URI_REGEX handling so that any matched media://inbound/<id> is normalized (use normalizeRefForDedupe) and added to refs (and seen) as type "media-uri" regardless of isAudio, instead of continuing early when !isAudio; keep isAudio detection (via isAudioExtension and the /\(audio\// check) for downstream STT decision but do not prevent adding the reference, and ensure this change aligns with detectImageReferences behavior that accepts any media:// URIs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/agents/embedded-agent-runner/run/images.ts`:
- Around line 778-795: The current block using MEDIA_URI_REGEX only adds a media
claim-check when the bracket contains an (audio/...) MIME or the id has an audio
extension, causing inbound audio URIs without those signals to be skipped;
update the logic in the MEDIA_URI_REGEX handling so that any matched
media://inbound/<id> is normalized (use normalizeRefForDedupe) and added to refs
(and seen) as type "media-uri" regardless of isAudio, instead of continuing
early when !isAudio; keep isAudio detection (via isAudioExtension and the
/\(audio\// check) for downstream STT decision but do not prevent adding the
reference, and ensure this change aligns with detectImageReferences behavior
that accepts any media:// URIs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e874d210-2906-4165-aa58-28519c0e5659
📒 Files selected for processing (4)
src/agents/embedded-agent-runner/run/attempt.spawn-workspace.test-support.tssrc/agents/embedded-agent-runner/run/images.test.tssrc/agents/embedded-agent-runner/run/images.tssrc/agents/sessions/agent-session.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/agents/embedded-agent-runner/run/images.test.ts
- src/agents/sessions/agent-session.ts
Addresses CodeRabbit re-review on #4 (images.ts:778-795). The audio media-uri discriminator (audio MIME or audio extension on the id) is deliberate — it keeps image and audio media-uri detection independent. CR proposed accepting any media:// URI like detectImageReferences does; that would misclassify every inbound image URI as audio. Verified the underlying assumption instead: inbound audio URIs always carry the discriminator. store.ts appends a MIME-derived extension to the saved id, and the WhatsApp ingest note carries (audio/...). Add two tests pinning this: a MIME-less id-extension-only URI is still detected as audio, and image URIs stay out of audio detection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Re: the outside-diff finding on The discriminator ( The worried failure (an audio URI carrying neither signal, with STT skipped) does not arise in the pipeline — inbound audio URIs always carry the discriminator:
Locked both properties in with tests: a MIME-less, id-extension-only URI ( |
Consolidates the still-needed local /opt/openclaw hot-patches onto fork-main (= upstream 94db48d + native audio #4) so the membrane VM can cut over to the fork and ship native audio ingestion. These three files were untouched by upstream in the 99d96c1→94db48d0 window, so they transplant verbatim: - extensions/google/video-generation-provider.ts — the openclaw#172 Vertex REST-bearer bypass (load-bearing per tulgey#194; SDK auth path is the openclaw#175 bug) + the #3 default-1080p resolution. - extensions/google/generation-provider-metadata.ts — Veo companion. - src/cli/program/message/register.send.ts — companion. Dropped: the session-lock patch (openclaw#195) — upstream made waitForSessionEventQueue a no-op by 94db48d, so it is obsolete. Deferred (fast-follow, refs tulgey#218): src/auto-reply/dispatch.ts (the ADR 0015 inbound-message-sequencing coalescing rewrite) and src/infra/dotenv.ts — both conflict structurally with fork-main and need a careful port + review. Refs imperfect-co/tulgey#218. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(media): native multimodal ingestion for inbound voice notes Feed inbound WhatsApp voice notes to the model as a native audio part on the agent's own turn instead of pre-transcribing to text, preserving tone/pacing/ambient cues for multimodal models (Gemini). STT remains the pre-turn fallback (non-multimodal model, flag off, or bytes unloadable). Gated behind tools.media.audio.nativeIngestion (default off). - AudioContent type + widened user-message content union + Model.input "audio" (both src/llm and agent-core type systems) - images.ts: detectAudioReferences / loadAudioFromRef / detectAndLoadPromptAudio / modelSupportsAudioInput, mirroring the image prompt-detection path (convertMessages already emits inlineData) - attempt.ts: detect+load audio, thread through prompt options; audio-only turns no longer count as a blank prompt - on-message.ts: skip the STT preflight when nativeIngestion is enabled so the [media attached: ... (audio/...)] note survives to the prompt - anthropic / openai converters drop audio parts (native audio is the Google-only path); google-shared needs no change - config: tools.media.audio.nativeIngestion (+ zod schema) Refs imperfect-co/tulgey#214. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(media): address review — null transcript on native gate, add caf - on-message.ts: when native ingestion skips the STT preflight, set preflightAudioTranscript = null (not undefined) and run the ack/status reaction first. Leaving it undefined let processMessage's internal STT fallback (gated on `=== undefined`) re-transcribe the audio and strip the [media attached: ... (audio/...)] note — defeating native ingestion. - images.ts: add "caf" to AUDIO_EXTENSION_NAMES (iOS voice notes). - test: assert preflightAudioTranscript is null (regression guard for the re-transcription path above). Refs imperfect-co/tulgey#214. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(media): address review — audio detection gaps + queued-prompt audio + test mock Fixes the CI regression and both CodeRabbit Major findings on #4. - test mock (CI fix): the shared attempt.spawn-workspace test-support mock for ./images.js only stubbed detectAndLoadPromptImages, so attempt.ts's new detectAndLoadPromptAudio import threw "No export is defined" under vitest, cascading into 27 failures in the embedded-agent shard. Add the audio mock. - detectAudioReferences (CR finding): only scanned [media attached: ...] blocks, so plain refs (./voice.ogg, ~/memo.caf, file:///tmp/note.wav) silently fell back to text. Add audio-extension variants of the file://, Windows-drive, and bare-path passes the image detector already runs, gated on isAudioExtension. PATH_PATTERN's leading-boundary requirement keeps media://inbound/<id> URIs from being misparsed as filesystem paths. +6 detectAudioReferences tests. - queued-prompt audio (CR finding): prompt() while streaming routed through queueSteer/queueFollowUp with currentImages only, dropping options.audio. Thread audio through both queue methods and the streaming call site. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(media): lock in audio claim-check discriminator behavior Addresses CodeRabbit re-review on #4 (images.ts:778-795). The audio media-uri discriminator (audio MIME or audio extension on the id) is deliberate — it keeps image and audio media-uri detection independent. CR proposed accepting any media:// URI like detectImageReferences does; that would misclassify every inbound image URI as audio. Verified the underlying assumption instead: inbound audio URIs always carry the discriminator. store.ts appends a MIME-derived extension to the saved id, and the WhatsApp ingest note carries (audio/...). Add two tests pinning this: a MIME-less id-extension-only URI is still detected as audio, and image URIs stay out of audio detection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Consolidates the still-needed local /opt/openclaw hot-patches onto fork-main (= upstream 94db48d + native audio #4) so the membrane VM can cut over to the fork and ship native audio ingestion. These three files were untouched by upstream in the 99d96c1→94db48d0 window, so they transplant verbatim: - extensions/google/video-generation-provider.ts — the openclaw#172 Vertex REST-bearer bypass (load-bearing per tulgey#194; SDK auth path is the openclaw#175 bug) + the #3 default-1080p resolution. - extensions/google/generation-provider-metadata.ts — Veo companion. - src/cli/program/message/register.send.ts — companion. Dropped: the session-lock patch (openclaw#195) — upstream made waitForSessionEventQueue a no-op by 94db48d, so it is obsolete. Deferred (fast-follow, refs tulgey#218): src/auto-reply/dispatch.ts (the ADR 0015 inbound-message-sequencing coalescing rewrite) and src/infra/dotenv.ts — both conflict structurally with fork-main and need a careful port + review. Refs imperfect-co/tulgey#218. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What and why
Inbound WhatsApp voice notes are pre-transcribed to text before the agent sees them, which discards tone, pacing, and ambient cues. The agent runs Gemini (natively multimodal), so this feeds the raw audio to the model as a native
audio/oggpart on its own turn instead. Speech-to-text stays as the pre-turn fallback.Framework decision + spec: imperfect-co/tulgey#214 and imperfect-co/tulgey#215 (ADR 0022). Verified before coding that
gemini-3.5-flashon this deployment's Vertex genuinely transcribes native audio (probe in the tulgey issue).How
Mirrors openclaw's existing image path: an inbound
[media attached: media://inbound/<id>.ogg (audio/ogg) | url]note is re-detected in the prompt, loaded off disk, base64'd, and attached as aninlineDatapart.convertMessagesalready emitsinlineDatafor any non-text content item, so no provider change was needed there.AudioContenttype + widened user-message content union +Model.input"audio" (bothsrc/llmandagent-coretype systems).images.ts:detectAudioReferences/loadAudioFromRef/detectAndLoadPromptAudio/modelSupportsAudioInput. Audio-only detection (media:// only when(audio/…)or audio extension), so it never collides with image detection.attempt.ts: detect + load audio, thread through the prompt options; audio-only turns no longer count as a blank prompt.on-message.ts: whentools.media.audio.nativeIngestionis on, skip the STT preflight (after the ack/status reaction) so the audio note survives to the prompt. SetspreflightAudioTranscript = nullsoprocessMessage's internal STT fallback doesn't re-transcribe.anthropic/openaiconverters drop audio parts (native audio is the Google-only path).tools.media.audio.nativeIngestion(default off).Gating / rollout
Default-off and additive — no behavior change until an operator sets
tools.media.audio.nativeIngestion: trueon a deployment whose agent runs a multimodal model. STT remains the fallback when the flag is off or the model isn't multimodal (decided pre-turn in the embedded runner viamodelSupportsAudioInput).Validation
tsgocore + extensions: clean.inlineData),images(capability predicate, load, detect), whatsapp preflight gate (skip + null transcript + fallback). 72 cases across the three suites.Review dispositions (CodeRabbit)
cafextension — fixed (added; iOS voice notes).nullvsundefinedtranscript on the native gate — fixed, and it was a real bug: leaving itundefinedletprocess-message.tsre-run STT internally (gated on=== undefined), which would strip the audio note and defeat native ingestion. Nownull, with a regression-guard test. Also moved the gate after the ack/status reaction so voice notes still get acknowledged.detectAudioReferencesdoesn't scan bare paths /file://— declined. Inbound voice notes always arrive as a[media attached: … (audio/…)]note (media-note.ts); unlike images, audio isn't referenced by a typed bare path in the inbound flow. ScanningMEDIA_ATTACHED_PATTERNis the intended scope; bare-path audio detection would be speculative surface for an input that doesn't occur.Deploy (after merge)
Build at
/opt/openclawfrom this branch, settools.media.audio.nativeIngestion: trueinopenclaw.json,systemctl restart openclaw-gateway.service, then send a real voice note to the deployment and confirm via Logfire (audioinlineDatapart on the turn) + a forced-fallback check.Summary by CodeRabbit
New Features
Configuration
Providers
Tests