Skip to content

feat(media): native multimodal ingestion for inbound voice notes#4

Merged
matin merged 4 commits into
mainfrom
native-audio-ingest
Jun 2, 2026
Merged

feat(media): native multimodal ingestion for inbound voice notes#4
matin merged 4 commits into
mainfrom
native-audio-ingest

Conversation

@matin

@matin matin commented Jun 1, 2026

Copy link
Copy Markdown
Owner

What and why

Inbound WhatsApp voice notes are pre-transcribed to text before the agent sees them, which discards tone, pacing, and ambient cues. The agent runs Gemini (natively multimodal), so this feeds the raw audio to the model as a native audio/ogg part on its own turn instead. Speech-to-text stays as the pre-turn fallback.

Framework decision + spec: imperfect-co/tulgey#214 and imperfect-co/tulgey#215 (ADR 0022). Verified before coding that gemini-3.5-flash on this deployment's Vertex genuinely transcribes native audio (probe in the tulgey issue).

How

Mirrors openclaw's existing image path: an inbound [media attached: media://inbound/<id>.ogg (audio/ogg) | url] note is re-detected in the prompt, loaded off disk, base64'd, and attached as an inlineData part. convertMessages already emits inlineData for any non-text content item, so no provider change was needed there.

  • AudioContent type + widened user-message content union + Model.input "audio" (both src/llm and agent-core type systems).
  • images.ts: detectAudioReferences / loadAudioFromRef / detectAndLoadPromptAudio / modelSupportsAudioInput. Audio-only detection (media:// only when (audio/…) or audio extension), so it never collides with image detection.
  • attempt.ts: detect + load audio, thread through the prompt options; audio-only turns no longer count as a blank prompt.
  • on-message.ts: when tools.media.audio.nativeIngestion is on, skip the STT preflight (after the ack/status reaction) so the audio note survives to the prompt. Sets preflightAudioTranscript = null so processMessage's internal STT fallback doesn't re-transcribe.
  • anthropic / openai converters drop audio parts (native audio is the Google-only path).
  • Config: tools.media.audio.nativeIngestion (default off).

Gating / rollout

Default-off and additive — no behavior change until an operator sets tools.media.audio.nativeIngestion: true on a deployment whose agent runs a multimodal model. STT remains the fallback when the flag is off or the model isn't multimodal (decided pre-turn in the embedded runner via modelSupportsAudioInput).

Validation

  • tsgo core + extensions: clean.
  • Unit tests pass: provider convert (audio → inlineData), images (capability predicate, load, detect), whatsapp preflight gate (skip + null transcript + fallback). 72 cases across the three suites.

Review dispositions (CodeRabbit)

  • Missing caf extension — fixed (added; iOS voice notes).
  • null vs undefined transcript on the native gate — fixed, and it was a real bug: leaving it undefined let process-message.ts re-run STT internally (gated on === undefined), which would strip the audio note and defeat native ingestion. Now null, with a regression-guard test. Also moved the gate after the ack/status reaction so voice notes still get acknowledged.
  • detectAudioReferences doesn't scan bare paths / file:// — declined. Inbound voice notes always arrive as a [media attached: … (audio/…)] note (media-note.ts); unlike images, audio isn't referenced by a typed bare path in the inbound flow. Scanning MEDIA_ATTACHED_PATTERN is the intended scope; bare-path audio detection would be speculative surface for an input that doesn't occur.

Deploy (after merge)

Build at /opt/openclaw from this branch, set tools.media.audio.nativeIngestion: true in openclaw.json, systemctl restart openclaw-gateway.service, then send a real voice note to the deployment and confirm via Logfire (audio inlineData part on the turn) + a forced-fallback check.

Summary by CodeRabbit

  • New Features

    • Agents, sessions, and harnesses now accept native audio attachments in prompts; prompt submission and loading handle audio alongside text/images
    • WhatsApp auto-reply supports native audio ingestion (skips STT when enabled)
  • Configuration

    • Added a nativeIngestion flag to media-understanding tools config and schema
  • Providers

    • Message conversion now treats audio appropriately (kept, converted, or dropped per provider/model capabilities)
  • Tests

    • Added extensive audio detection/loading and provider conversion tests

matin and others added 2 commits May 31, 2026 19:03
Feed inbound WhatsApp voice notes to the model as a native audio part on
the agent's own turn instead of pre-transcribing to text, preserving
tone/pacing/ambient cues for multimodal models (Gemini). STT remains the
pre-turn fallback (non-multimodal model, flag off, or bytes unloadable).

Gated behind tools.media.audio.nativeIngestion (default off).

- AudioContent type + widened user-message content union + Model.input
  "audio" (both src/llm and agent-core type systems)
- images.ts: detectAudioReferences / loadAudioFromRef /
  detectAndLoadPromptAudio / modelSupportsAudioInput, mirroring the image
  prompt-detection path (convertMessages already emits inlineData)
- attempt.ts: detect+load audio, thread through prompt options; audio-only
  turns no longer count as a blank prompt
- on-message.ts: skip the STT preflight when nativeIngestion is enabled so
  the [media attached: ... (audio/...)] note survives to the prompt
- anthropic / openai converters drop audio parts (native audio is the
  Google-only path); google-shared needs no change
- config: tools.media.audio.nativeIngestion (+ zod schema)

Refs imperfect-co/tulgey#214.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- on-message.ts: when native ingestion skips the STT preflight, set
  preflightAudioTranscript = null (not undefined) and run the ack/status
  reaction first. Leaving it undefined let processMessage's internal STT
  fallback (gated on `=== undefined`) re-transcribe the audio and strip the
  [media attached: ... (audio/...)] note — defeating native ingestion.
- images.ts: add "caf" to AUDIO_EXTENSION_NAMES (iOS voice notes).
- test: assert preflightAudioTranscript is null (regression guard for the
  re-transcription path above).

Refs imperfect-co/tulgey#214.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matin

matin commented Jun 1, 2026

Copy link
Copy Markdown
Owner Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
✅ Actions performed

Full review triggered.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@matin, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 46 minutes and 20 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 69440d7b-02ac-4eed-8aa0-0291a101082f

📥 Commits

Reviewing files that changed from the base of the PR and between 1f4d132 and 5035bce.

📒 Files selected for processing (1)
  • src/agents/embedded-agent-runner/run/images.test.ts
📝 Walkthrough

Walkthrough

This PR adds first-class native audio input support across the platform: new AudioContent type, agent/harness/session audio parameters, prompt audio detection/loading, provider-specific message conversions, attempt-runner wiring, and WhatsApp native ingestion handling with tests.

Changes

Native audio input support

Layer / File(s) Summary
Core audio type definitions and schemas
src/llm/types.ts, packages/agent-core/src/llm.ts, src/config/types.tools.ts, src/config/zod-schema.core.ts
AudioContent interface added with type: "audio", base64 data, and mimeType; UserMessage.content expanded to include audio alongside text and images; Model.input modalities now include "audio"; MediaUnderstandingConfig and its Zod schema gain optional nativeIngestion flag.
Agent and session audio input methods
packages/agent-core/src/agent.ts, packages/agent-core/src/harness/agent-harness.ts, src/agents/sessions/agent-session.ts
Agent.prompt overloads accept optional audio?: AudioContent[]; AgentHarness turn methods and executeTurn accept audio options; createUserMessage appends audio to user message content; AgentSession.PromptOptions adds audio and audio is forwarded into queueing and non-stream prompt construction.
Audio input capability detection and model listing
src/agents/embedded-agent-runner/run/images.ts, src/commands/models/list.model-row.ts, src/commands/models/list.rows.ts, extensions/openai/openai-codex-provider.ts
modelSupportsAudioInput added for native audio capability (including Gemini v3+ hint); ListRowModel.input broadened to include "audio" and toListRowInput preserves it; Codex provider dedupes capability list using `("text"
Audio reference detection and loading from prompts
src/agents/embedded-agent-runner/run/images.ts, src/agents/embedded-agent-runner/run/images.test.ts
Added audio extension set and regexes; detectAudioReferences finds claim-check and local audio refs; loadAudioFromRef resolves and loads media, returning AudioContent when media.kind === "audio"; detectAndLoadPromptAudio gates loading on model capability and returns collected audio; tests added for detection/loading behavior and fixtures.
Audio detection and loading in embedded agent attempt flow
src/agents/embedded-agent-runner/run/attempt.ts, src/agents/embedded-agent-runner/run/attempt.prompt-helpers.ts, src/agents/agent-hooks/context-pruning/pruner.ts
Import MAX_AUDIO_BYTES and detectAndLoadPromptAudio; add prompt-local audioResult step with size/sandbox checks; propagate audioCount into prompt submission skip-reason; resolvePromptSubmissionSkipReason treats non-zero audioCount as non-empty; runtime prompt submission includes images/audio only when present; pruning helper types widened to accept AudioContent.
LLM provider message conversion for audio content
src/llm/providers/anthropic.ts, src/llm/providers/google-shared.convert.test.ts, src/llm/providers/openai-completions.ts, src/llm/providers/openai-responses-shared.ts, src/llm/providers/transform-messages.ts
Anthropic: convertContentBlocks accepts AudioContent and drops audio before building blocks. Google/Gemini: test verifies audio becomes inlineData part. OpenAI Completions & Responses: conversions now use flatMap, explicitly map text/images and drop audio. transform-messages: placeholder replacement generalized to preserve non-image audio blocks.
WhatsApp native audio ingestion support
extensions/whatsapp/src/auto-reply/monitor/on-message.ts, extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts
createWebOnMessageHandler checks cfg.tools?.media?.audio?.nativeIngestion; when enabled for audio media, sets preflightAudioTranscript to null and skips STT preflight; tests added for native-ingestion-enabled (no STT, null transcript) and disabled (STT runs, transcript passed) behaviors.

Sequence Diagram: Audio flow through agent and LLM providers

sequenceDiagram
  participant Agent as Agent/Harness
  participant Attempt as Attempt Runner
  participant AudioLoader as PromptAudioLoader
  participant Provider as LLM Provider
  Agent->>Agent: prompt(text, images, audio)
  Agent->>Attempt: submit prompt with audio
  Attempt->>Attempt: detect & load audio refs
  Attempt->>AudioLoader: detectAndLoadPromptAudio
  AudioLoader->>AudioLoader: modelSupportsAudioInput?
  AudioLoader->>AudioLoader: detectAudioReferences
  AudioLoader->>AudioLoader: loadAudioFromRef (per ref)
  AudioLoader-->>Attempt: { audio: AudioContent[] }
  Attempt->>Attempt: build message content (text/images/audio)
  Attempt->>Provider: convertMessages(UserMessage)
  Provider->>Provider: handle AudioContent block
  Note over Provider: Anthropic: drop audio<br/>Google: inlineData parts<br/>OpenAI: drop audio
  Provider-->>Attempt: converted message params
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🎵 A rabbit heard audio could sing,
Through prompts it carries this new thing.
From agent calls to LLM's ear,
Each provider listens loud and clear.
WhatsApp skips the STT song — hooray! 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(media): native multimodal ingestion for inbound voice notes' accurately summarizes the main change: adding native audio ingestion for WhatsApp voice notes to multimodal AI models instead of only using transcribed text.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the problem statement, solution approach, gating/rollout strategy, validation results, and deployment instructions. It addresses all key template sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch native-audio-ingest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
extensions/whatsapp/src/auto-reply/monitor/on-message.ts (1)

279-299: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Fix group voice-note spoken-mention gating when nativeIngestion is enabled

With tools.media.audio.nativeIngestion === true, runAudioPreflightOnce() sets preflightAudioTranscript = null, and the second applyGroupGating call only supplies mentionText when typeof preflightAudioTranscript === "string"—so no mentionText is provided. Since mention detection then runs against the original msg.body (e.g., "<media:audio>"), the retry pass still results in shouldProcess: false, and the handler returns before processMessage runs, silently dropping group voice notes meant to activate via spoken mentions.

Split the “audio STT skip” behavior from “mention gating needs STT”: either run a lightweight STT/transcription only to produce mentionText when needsMentionText is true (even under native ingestion), or explicitly document/guard this limitation for spoken-mention activation in groups with native ingestion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/whatsapp/src/auto-reply/monitor/on-message.ts` around lines 279 -
299, The second applyGroupGating call is missing mentionText when
runAudioPreflightOnce() sets preflightAudioTranscript = null under
tools.media.audio.nativeIngestion, causing spoken-mention gating to fail; modify
the flow around runAudioPreflightOnce(), preflightAudioTranscript and
applyGroupGating so that when nativeIngestion is enabled but the gating/mention
logic requires a spoken mention, you still obtain a lightweight
STT/transcription or a dedicated mentionText (e.g., call a minimal transcription
helper or preserve the preflight result) and pass it into applyGroupGating via
the mentionText property (referencing runAudioPreflightOnce,
preflightAudioTranscript, and applyGroupGating) so the retry pass supplies
mentionText even with native ingestion enabled.
src/agents/agent-hooks/context-pruning/pruner.ts (1)

150-165: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Consider adding character estimation for audio content.

The function now accepts AudioContent but doesn't estimate any characters for audio blocks (they silently contribute 0). If audio is sent inline to multimodal models, it likely consumes context window quota. Consider adding an AUDIO_CHAR_ESTIMATE constant similar to IMAGE_CHAR_ESTIMATE (line 14) to account for audio in context pruning calculations.

💡 Suggested addition
 const IMAGE_CHAR_ESTIMATE = 8_000;
+const AUDIO_CHAR_ESTIMATE = 12_000; // Rough estimate for audio content
 const PRUNED_CONTEXT_IMAGE_MARKER = "[image removed during context pruning]";
 function estimateTextAndImageChars(
   content: ReadonlyArray<TextContent | ImageContent | AudioContent>,
 ): number {
   let chars = 0;
   for (const block of content) {
     const text = coerceTextBlock(block);
     if (text !== null) {
       chars += estimateWeightedTextChars(text);
       continue;
     }
     if (isImageBlock(block)) {
       chars += IMAGE_CHAR_ESTIMATE;
+      continue;
     }
+    if (block.type === "audio") {
+      chars += AUDIO_CHAR_ESTIMATE;
+    }
   }
   return chars;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agents/agent-hooks/context-pruning/pruner.ts` around lines 150 - 165, The
function estimateTextAndImageChars ignores AudioContent so audio blocks
contribute 0; add a constant AUDIO_CHAR_ESTIMATE (similar to
IMAGE_CHAR_ESTIMATE) and update estimateTextAndImageChars to detect audio blocks
(e.g., via an existing isAudioBlock helper or a new check for AudioContent) and
add AUDIO_CHAR_ESTIMATE to chars when encountered; keep existing handling for
text via coerceTextBlock and images via isImageBlock, and ensure
AUDIO_CHAR_ESTIMATE is exported/defined alongside IMAGE_CHAR_ESTIMATE so pruning
uses a conservative audio token budget.
src/agents/embedded-agent-runner/run/attempt.ts (1)

3991-4012: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Thread media options through the runtime-only submission path.

Line 3991 still submits the runtime-only branch with only preflightResult. With the new audioCount skip logic, an audio-only turn can now reach this branch and still lose its loaded audio payload before the model call.

Proposed fix
-            try {
-              if (promptSubmission.runtimeOnly) {
-                await promptActiveSession(promptForSession, {
-                  preflightResult: armModelPromptTransform,
-                });
-              } else {
-                const cleanupRuntimeContextMessage = installRuntimeContextMessageForPrompt({
-                  session: activeSession,
-                  message: runtimeContextMessageForCurrentTurn,
-                });
-                try {
-                  // Only attach images/audio options when present so models
-                  // that don't expect those parameters aren't handed empties.
-                  const promptOptions: Parameters<typeof promptActiveSession>[1] = {
-                    preflightResult: armModelPromptTransform,
-                  };
-                  if (imageResult.images.length > 0) {
-                    promptOptions.images = imageResult.images;
-                  }
-                  if (audioResult.audio.length > 0) {
-                    promptOptions.audio = audioResult.audio;
-                  }
-                  await promptActiveSession(promptForSession, promptOptions);
-                } finally {
-                  cleanupRuntimeContextMessage();
-                }
-              }
+            try {
+              // Only attach images/audio options when present so models
+              // that don't expect those parameters aren't handed empties.
+              const promptOptions: Parameters<typeof promptActiveSession>[1] = {
+                preflightResult: armModelPromptTransform,
+              };
+              if (imageResult.images.length > 0) {
+                promptOptions.images = imageResult.images;
+              }
+              if (audioResult.audio.length > 0) {
+                promptOptions.audio = audioResult.audio;
+              }
+              if (promptSubmission.runtimeOnly) {
+                await promptActiveSession(promptForSession, promptOptions);
+              } else {
+                const cleanupRuntimeContextMessage = installRuntimeContextMessageForPrompt({
+                  session: activeSession,
+                  message: runtimeContextMessageForCurrentTurn,
+                });
+                try {
+                  await promptActiveSession(promptForSession, promptOptions);
+                } finally {
+                  cleanupRuntimeContextMessage();
+                }
+              }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agents/embedded-agent-runner/run/attempt.ts` around lines 3991 - 4012,
The runtime-only branch currently calls promptActiveSession(promptForSession, {
preflightResult: armModelPromptTransform }) and therefore drops image/audio
payloads for audio-only or image-containing turns; modify the runtime-only path
in the promptSubmission.runtimeOnly branch to build the same promptOptions
object used in the non-runtime path (include preflightResult and conditionally
add images from imageResult.images and audio from audioResult.audio) and pass
that promptOptions to promptActiveSession so media is preserved; reference
promptSubmission, promptActiveSession, promptForSession, promptOptions,
imageResult, and audioResult to locate and update the code.
🧹 Nitpick comments (1)
extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts (1)

475-510: ⚡ Quick win

Add group coverage for native ingestion.

These tests only exercise the DM path. Given the group mention-gating interaction flagged in on-message.ts, consider adding a makeGroupAudioMsg() test with nativeIngestion: true to lock in the intended behavior for spoken-mention voice notes in groups.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts`
around lines 475 - 510, Add a new test variant that covers group messages when
native audio ingestion is enabled: create a test similar to the existing "runs
STT when native audio ingestion is disabled (fallback path)" but call
createWebOnMessageHandler with cfg.tools.media.audio.nativeIngestion: true and
invoke makeGroupAudioMsg() instead of makeAudioMsg(); assert the expected
behavior for group spoken-mention voice notes (e.g., whether
transcribeFirstAudioMock is called, events includes "stt" or not, and the
preflightAudioTranscript on processMessage depending on on-message.ts
mention-gating logic) so the group path and mention gating implemented in
on-message.ts are exercised. Ensure you reuse the same handler config fields
(connectionId, maxMediaBytes, groupHistoryLimit, groupHistories,
groupMemberNames, echoTracker, replyResolver, replyLogger, baseMentionConfig,
account) to mirror the DM test and make assertions against the same mocks
(transcribeFirstAudioMock, processMessageMock, events).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/agents/embedded-agent-runner/run/images.ts`:
- Around line 719-775: detectAudioReferences currently only parses
MEDIA_ATTACHED_PATTERN blocks and misses standalone local audio refs; update
detectAudioReferences to also scan the entire prompt for plain local/file URIs
by adding a second pass (after the existing MEDIA_ATTACHED loop) that uses a
regex to find file:// URIs and path-like tokens (e.g., ./, ../, ~/ or bare
filenames with audio extensions) and then reuse isAudioExtension,
assertNoWindowsNetworkPath, resolveUserPath and normalizeRefForDedupe to
validate, dedupe (using seen) and push entries with appropriate type ("path" or
"media-uri" for file://). Ensure you also treat file:///... as a media-uri or
convert to a resolved local path consistently and skip http/https, keeping
existing behavior for MEDIA_URI_REGEX handling.

In `@src/agents/sessions/agent-session.ts`:
- Around line 1198-1205: Queued prompts are losing audio because audio is only
appended to userContent in the non-streaming path; update the streaming/queued
path (the code paths that call queueSteer() / queueFollowUp() from prompt()) to
include options?.audio the same way as currentImages is included so queued turns
keep their audio attachments. Locate where currentImages is passed into
queueSteer/queueFollowUp (and where queued message content is constructed) and
push ...options.audio or otherwise include the AudioContent entries into the
queued user content before calling queueSteer/queueFollowUp.

---

Outside diff comments:
In `@extensions/whatsapp/src/auto-reply/monitor/on-message.ts`:
- Around line 279-299: The second applyGroupGating call is missing mentionText
when runAudioPreflightOnce() sets preflightAudioTranscript = null under
tools.media.audio.nativeIngestion, causing spoken-mention gating to fail; modify
the flow around runAudioPreflightOnce(), preflightAudioTranscript and
applyGroupGating so that when nativeIngestion is enabled but the gating/mention
logic requires a spoken mention, you still obtain a lightweight
STT/transcription or a dedicated mentionText (e.g., call a minimal transcription
helper or preserve the preflight result) and pass it into applyGroupGating via
the mentionText property (referencing runAudioPreflightOnce,
preflightAudioTranscript, and applyGroupGating) so the retry pass supplies
mentionText even with native ingestion enabled.

In `@src/agents/agent-hooks/context-pruning/pruner.ts`:
- Around line 150-165: The function estimateTextAndImageChars ignores
AudioContent so audio blocks contribute 0; add a constant AUDIO_CHAR_ESTIMATE
(similar to IMAGE_CHAR_ESTIMATE) and update estimateTextAndImageChars to detect
audio blocks (e.g., via an existing isAudioBlock helper or a new check for
AudioContent) and add AUDIO_CHAR_ESTIMATE to chars when encountered; keep
existing handling for text via coerceTextBlock and images via isImageBlock, and
ensure AUDIO_CHAR_ESTIMATE is exported/defined alongside IMAGE_CHAR_ESTIMATE so
pruning uses a conservative audio token budget.

In `@src/agents/embedded-agent-runner/run/attempt.ts`:
- Around line 3991-4012: The runtime-only branch currently calls
promptActiveSession(promptForSession, { preflightResult: armModelPromptTransform
}) and therefore drops image/audio payloads for audio-only or image-containing
turns; modify the runtime-only path in the promptSubmission.runtimeOnly branch
to build the same promptOptions object used in the non-runtime path (include
preflightResult and conditionally add images from imageResult.images and audio
from audioResult.audio) and pass that promptOptions to promptActiveSession so
media is preserved; reference promptSubmission, promptActiveSession,
promptForSession, promptOptions, imageResult, and audioResult to locate and
update the code.

---

Nitpick comments:
In
`@extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts`:
- Around line 475-510: Add a new test variant that covers group messages when
native audio ingestion is enabled: create a test similar to the existing "runs
STT when native audio ingestion is disabled (fallback path)" but call
createWebOnMessageHandler with cfg.tools.media.audio.nativeIngestion: true and
invoke makeGroupAudioMsg() instead of makeAudioMsg(); assert the expected
behavior for group spoken-mention voice notes (e.g., whether
transcribeFirstAudioMock is called, events includes "stt" or not, and the
preflightAudioTranscript on processMessage depending on on-message.ts
mention-gating logic) so the group path and mention gating implemented in
on-message.ts are exercised. Ensure you reuse the same handler config fields
(connectionId, maxMediaBytes, groupHistoryLimit, groupHistories,
groupMemberNames, echoTracker, replyResolver, replyLogger, baseMentionConfig,
account) to mirror the DM test and make assertions against the same mocks
(transcribeFirstAudioMock, processMessageMock, events).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 897a088d-f759-4b9e-af37-ea4cb2e0c2c0

📥 Commits

Reviewing files that changed from the base of the PR and between 94db48d and 26c50bc.

📒 Files selected for processing (22)
  • extensions/openai/openai-codex-provider.ts
  • extensions/whatsapp/src/auto-reply/monitor/on-message.audio-preflight.test.ts
  • extensions/whatsapp/src/auto-reply/monitor/on-message.ts
  • packages/agent-core/src/agent.ts
  • packages/agent-core/src/harness/agent-harness.ts
  • packages/agent-core/src/llm.ts
  • src/agents/agent-hooks/context-pruning/pruner.ts
  • src/agents/embedded-agent-runner/run/attempt.prompt-helpers.ts
  • src/agents/embedded-agent-runner/run/attempt.ts
  • src/agents/embedded-agent-runner/run/images.test.ts
  • src/agents/embedded-agent-runner/run/images.ts
  • src/agents/sessions/agent-session.ts
  • src/commands/models/list.model-row.ts
  • src/commands/models/list.rows.ts
  • src/config/types.tools.ts
  • src/config/zod-schema.core.ts
  • src/llm/providers/anthropic.ts
  • src/llm/providers/google-shared.convert.test.ts
  • src/llm/providers/openai-completions.ts
  • src/llm/providers/openai-responses-shared.ts
  • src/llm/providers/transform-messages.ts
  • src/llm/types.ts

Comment thread src/agents/embedded-agent-runner/run/images.ts
Comment thread src/agents/sessions/agent-session.ts
…io + test mock

Fixes the CI regression and both CodeRabbit Major findings on #4.

- test mock (CI fix): the shared attempt.spawn-workspace test-support mock for
  ./images.js only stubbed detectAndLoadPromptImages, so attempt.ts's new
  detectAndLoadPromptAudio import threw "No export is defined" under vitest,
  cascading into 27 failures in the embedded-agent shard. Add the audio mock.

- detectAudioReferences (CR finding): only scanned [media attached: ...] blocks,
  so plain refs (./voice.ogg, ~/memo.caf, file:///tmp/note.wav) silently fell
  back to text. Add audio-extension variants of the file://, Windows-drive, and
  bare-path passes the image detector already runs, gated on isAudioExtension.
  PATH_PATTERN's leading-boundary requirement keeps media://inbound/<id> URIs
  from being misparsed as filesystem paths. +6 detectAudioReferences tests.

- queued-prompt audio (CR finding): prompt() while streaming routed through
  queueSteer/queueFollowUp with currentImages only, dropping options.audio.
  Thread audio through both queue methods and the streaming call site.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/agents/embedded-agent-runner/run/images.ts (1)

778-795: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Verify inbound audio claim-check URIs always carry an audio MIME or extension.

A claim-check URI is classified as audio only when the bracket contains (audio/...) or the id has an audio extension. Unlike detectImageReferences (which accepts any media:// URI), this requires a discriminator. Since on-message.ts skips the STT preflight when nativeIngestion is enabled, an inbound voice-note URI that lacks both signals would be detected as neither image nor audio — and with STT already skipped, the audio cue is lost entirely (the exact failure this PR aims to prevent).

Confirm the Gateway/store format for inbound audio attachments retains an audio extension on the media id or emits the (audio/...) annotation.

#!/bin/bash
# Inspect how inbound media claim-check entries are formatted, and whether
# audio attachments retain an extension or MIME annotation in the bracket.
rg -nP -C4 'media attached' --type=ts -g '!**/*.test.ts'
rg -nP -C3 'media://inbound' --type=ts -g '!**/*.test.ts'
# Filename sanitization / extension retention in store
fd -i 'store.ts' --exec rg -nP -C3 'sanitizeFilename|extname|\.ogg|audio/'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/agents/embedded-agent-runner/run/images.ts` around lines 778 - 795, The
current block using MEDIA_URI_REGEX only adds a media claim-check when the
bracket contains an (audio/...) MIME or the id has an audio extension, causing
inbound audio URIs without those signals to be skipped; update the logic in the
MEDIA_URI_REGEX handling so that any matched media://inbound/<id> is normalized
(use normalizeRefForDedupe) and added to refs (and seen) as type "media-uri"
regardless of isAudio, instead of continuing early when !isAudio; keep isAudio
detection (via isAudioExtension and the /\(audio\// check) for downstream STT
decision but do not prevent adding the reference, and ensure this change aligns
with detectImageReferences behavior that accepts any media:// URIs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/agents/embedded-agent-runner/run/images.ts`:
- Around line 778-795: The current block using MEDIA_URI_REGEX only adds a media
claim-check when the bracket contains an (audio/...) MIME or the id has an audio
extension, causing inbound audio URIs without those signals to be skipped;
update the logic in the MEDIA_URI_REGEX handling so that any matched
media://inbound/<id> is normalized (use normalizeRefForDedupe) and added to refs
(and seen) as type "media-uri" regardless of isAudio, instead of continuing
early when !isAudio; keep isAudio detection (via isAudioExtension and the
/\(audio\// check) for downstream STT decision but do not prevent adding the
reference, and ensure this change aligns with detectImageReferences behavior
that accepts any media:// URIs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e874d210-2906-4165-aa58-28519c0e5659

📥 Commits

Reviewing files that changed from the base of the PR and between 26c50bc and 1f4d132.

📒 Files selected for processing (4)
  • src/agents/embedded-agent-runner/run/attempt.spawn-workspace.test-support.ts
  • src/agents/embedded-agent-runner/run/images.test.ts
  • src/agents/embedded-agent-runner/run/images.ts
  • src/agents/sessions/agent-session.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/agents/embedded-agent-runner/run/images.test.ts
  • src/agents/sessions/agent-session.ts

Addresses CodeRabbit re-review on #4 (images.ts:778-795). The audio
media-uri discriminator (audio MIME or audio extension on the id) is
deliberate — it keeps image and audio media-uri detection independent.
CR proposed accepting any media:// URI like detectImageReferences does;
that would misclassify every inbound image URI as audio.

Verified the underlying assumption instead: inbound audio URIs always
carry the discriminator. store.ts appends a MIME-derived extension to
the saved id, and the WhatsApp ingest note carries (audio/...). Add two
tests pinning this: a MIME-less id-extension-only URI is still detected
as audio, and image URIs stay out of audio detection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matin

matin commented Jun 2, 2026

Copy link
Copy Markdown
Owner Author

Re: the outside-diff finding on images.ts (audio claim-check discriminator) — kept as-is; verified the underlying assumption instead. Tests added in 5035bce.

The discriminator ((audio/...) MIME or audio extension on the id) is deliberate and load-bearing: detectAudioReferences must stay independent from detectImageReferences, which accepts any media://inbound/<id>. Adopting the suggested "accept any media:// URI" behavior would classify every inbound image URI as audio — the function comment calls this out explicitly ("otherwise an image URI would be misclassified").

The worried failure (an audio URI carrying neither signal, with STT skipped) does not arise in the pipeline — inbound audio URIs always carry the discriminator:

  • src/media/store.ts (~315–361): the saved id is ${sanitized}---${baseId}${ext}, where ext is resolved from the authoritative header MIME via extensionForMime(contentType). An inbound audio file (e.g. audio/ogg) therefore yields an id ending in .ogg/.caf/etc — the extension discriminator fires even when the bracket omits the MIME (e.g. the chat-attachments.ts producer that emits a bare [media attached: media://inbound/<id>]).
  • extensions/whatsapp/.../on-message.ts (189–200) + src/auto-reply/media-note.ts:52: the WhatsApp ingest note is built with (audio/...) by design — the comment states the [media attached: ... (audio/...)] note is what survives to the prompt once STT is skipped under nativeIngestion.

Locked both properties in with tests: a MIME-less, id-extension-only URI (media://inbound/voice.ogg) is still detected as audio, and image URIs (media://inbound/photo.png, with or without (image/png)) stay out of audio detection.

@matin matin merged commit f837a17 into main Jun 2, 2026
72 of 85 checks passed
matin added a commit that referenced this pull request Jun 2, 2026
Consolidates the still-needed local /opt/openclaw hot-patches onto fork-main
(= upstream 94db48d + native audio #4) so the membrane VM can cut over to the
fork and ship native audio ingestion. These three files were untouched by
upstream in the 99d96c1→94db48d0 window, so they transplant verbatim:

- extensions/google/video-generation-provider.ts — the openclaw#172 Vertex REST-bearer
  bypass (load-bearing per tulgey#194; SDK auth path is the openclaw#175 bug) + the #3
  default-1080p resolution.
- extensions/google/generation-provider-metadata.ts — Veo companion.
- src/cli/program/message/register.send.ts — companion.

Dropped: the session-lock patch (openclaw#195) — upstream made waitForSessionEventQueue
a no-op by 94db48d, so it is obsolete.

Deferred (fast-follow, refs tulgey#218): src/auto-reply/dispatch.ts (the ADR
0015 inbound-message-sequencing coalescing rewrite) and src/infra/dotenv.ts —
both conflict structurally with fork-main and need a careful port + review.

Refs imperfect-co/tulgey#218.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matin added a commit that referenced this pull request Jun 3, 2026
* feat(media): native multimodal ingestion for inbound voice notes

Feed inbound WhatsApp voice notes to the model as a native audio part on
the agent's own turn instead of pre-transcribing to text, preserving
tone/pacing/ambient cues for multimodal models (Gemini). STT remains the
pre-turn fallback (non-multimodal model, flag off, or bytes unloadable).

Gated behind tools.media.audio.nativeIngestion (default off).

- AudioContent type + widened user-message content union + Model.input
  "audio" (both src/llm and agent-core type systems)
- images.ts: detectAudioReferences / loadAudioFromRef /
  detectAndLoadPromptAudio / modelSupportsAudioInput, mirroring the image
  prompt-detection path (convertMessages already emits inlineData)
- attempt.ts: detect+load audio, thread through prompt options; audio-only
  turns no longer count as a blank prompt
- on-message.ts: skip the STT preflight when nativeIngestion is enabled so
  the [media attached: ... (audio/...)] note survives to the prompt
- anthropic / openai converters drop audio parts (native audio is the
  Google-only path); google-shared needs no change
- config: tools.media.audio.nativeIngestion (+ zod schema)

Refs imperfect-co/tulgey#214.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(media): address review — null transcript on native gate, add caf

- on-message.ts: when native ingestion skips the STT preflight, set
  preflightAudioTranscript = null (not undefined) and run the ack/status
  reaction first. Leaving it undefined let processMessage's internal STT
  fallback (gated on `=== undefined`) re-transcribe the audio and strip the
  [media attached: ... (audio/...)] note — defeating native ingestion.
- images.ts: add "caf" to AUDIO_EXTENSION_NAMES (iOS voice notes).
- test: assert preflightAudioTranscript is null (regression guard for the
  re-transcription path above).

Refs imperfect-co/tulgey#214.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(media): address review — audio detection gaps + queued-prompt audio + test mock

Fixes the CI regression and both CodeRabbit Major findings on #4.

- test mock (CI fix): the shared attempt.spawn-workspace test-support mock for
  ./images.js only stubbed detectAndLoadPromptImages, so attempt.ts's new
  detectAndLoadPromptAudio import threw "No export is defined" under vitest,
  cascading into 27 failures in the embedded-agent shard. Add the audio mock.

- detectAudioReferences (CR finding): only scanned [media attached: ...] blocks,
  so plain refs (./voice.ogg, ~/memo.caf, file:///tmp/note.wav) silently fell
  back to text. Add audio-extension variants of the file://, Windows-drive, and
  bare-path passes the image detector already runs, gated on isAudioExtension.
  PATH_PATTERN's leading-boundary requirement keeps media://inbound/<id> URIs
  from being misparsed as filesystem paths. +6 detectAudioReferences tests.

- queued-prompt audio (CR finding): prompt() while streaming routed through
  queueSteer/queueFollowUp with currentImages only, dropping options.audio.
  Thread audio through both queue methods and the streaming call site.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(media): lock in audio claim-check discriminator behavior

Addresses CodeRabbit re-review on #4 (images.ts:778-795). The audio
media-uri discriminator (audio MIME or audio extension on the id) is
deliberate — it keeps image and audio media-uri detection independent.
CR proposed accepting any media:// URI like detectImageReferences does;
that would misclassify every inbound image URI as audio.

Verified the underlying assumption instead: inbound audio URIs always
carry the discriminator. store.ts appends a MIME-derived extension to
the saved id, and the WhatsApp ingest note carries (audio/...). Add two
tests pinning this: a MIME-less id-extension-only URI is still detected
as audio, and image URIs stay out of audio detection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matin added a commit that referenced this pull request Jun 3, 2026
Consolidates the still-needed local /opt/openclaw hot-patches onto fork-main
(= upstream 94db48d + native audio #4) so the membrane VM can cut over to the
fork and ship native audio ingestion. These three files were untouched by
upstream in the 99d96c1→94db48d0 window, so they transplant verbatim:

- extensions/google/video-generation-provider.ts — the openclaw#172 Vertex REST-bearer
  bypass (load-bearing per tulgey#194; SDK auth path is the openclaw#175 bug) + the #3
  default-1080p resolution.
- extensions/google/generation-provider-metadata.ts — Veo companion.
- src/cli/program/message/register.send.ts — companion.

Dropped: the session-lock patch (openclaw#195) — upstream made waitForSessionEventQueue
a no-op by 94db48d, so it is obsolete.

Deferred (fast-follow, refs tulgey#218): src/auto-reply/dispatch.ts (the ADR
0015 inbound-message-sequencing coalescing rewrite) and src/infra/dotenv.ts —
both conflict structurally with fork-main and need a careful port + review.

Refs imperfect-co/tulgey#218.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant