Summary
Voice notes (and likely other media) that arrive while the agent is mid-turn are queued as "followup" messages. The followup runner (createFollowupRunner) calls runEmbeddedPiAgent directly without first calling applyMediaUnderstanding. This means audio transcription, image understanding, and video understanding are silently skipped for all queued messages.
Steps to Reproduce
- Configure
tools.media.audio with a provider model (e.g., openai/gpt-4o-transcribe)
- Send a text message to the agent via Signal (or any channel)
- While the agent is still generating its reply, send a voice note
- The voice note arrives as a followup-queued message
- The agent receives
<media:audio> but no transcript — applyMediaUnderstanding was never called
Expected Behavior
applyMediaUnderstanding should run on followup-queued messages before they are passed to runEmbeddedPiAgent, just as it does in the primary getReplyFromConfig path.
Root Cause
In the source (e.g., sessions-DRG4gFa3.js):
- Line ~124182:
applyMediaUnderstanding is called in getReplyFromConfig for the initial message ✅
- Line ~125007: A second call exists in the ACP dispatch path with a guard (
if (!params.ctx.MediaUnderstanding?.length)) ✅
createFollowupRunner (~line 121676): Calls runEmbeddedPiAgent with queued.prompt directly — no applyMediaUnderstanding call ❌
Suggested Fix
Add an applyMediaUnderstanding call inside the followup runner before runEmbeddedPiAgent is invoked, similar to the guard pattern used in the ACP path:
if (!queued.ctx?.MediaUnderstanding?.length) {
await applyMediaUnderstanding({
ctx: queued.ctx, // or reconstruct from queued.run
cfg: queued.run.config,
});
}
Workaround
Manually transcribe audio using the OpenAI Whisper API when <media:audio> is received without a transcript.
Environment
- OpenClaw version: 2026.3.11
- Channel: Signal
- Audio model:
openai/gpt-4o-transcribe
- Agent model:
anthropic/claude-opus-4-6
Summary
Voice notes (and likely other media) that arrive while the agent is mid-turn are queued as "followup" messages. The followup runner (
createFollowupRunner) callsrunEmbeddedPiAgentdirectly without first callingapplyMediaUnderstanding. This means audio transcription, image understanding, and video understanding are silently skipped for all queued messages.Steps to Reproduce
tools.media.audiowith a provider model (e.g.,openai/gpt-4o-transcribe)<media:audio>but no transcript —applyMediaUnderstandingwas never calledExpected Behavior
applyMediaUnderstandingshould run on followup-queued messages before they are passed torunEmbeddedPiAgent, just as it does in the primarygetReplyFromConfigpath.Root Cause
In the source (e.g.,
sessions-DRG4gFa3.js):applyMediaUnderstandingis called ingetReplyFromConfigfor the initial message ✅if (!params.ctx.MediaUnderstanding?.length)) ✅createFollowupRunner(~line 121676): CallsrunEmbeddedPiAgentwithqueued.promptdirectly — noapplyMediaUnderstandingcall ❌Suggested Fix
Add an
applyMediaUnderstandingcall inside the followup runner beforerunEmbeddedPiAgentis invoked, similar to the guard pattern used in the ACP path:Workaround
Manually transcribe audio using the OpenAI Whisper API when
<media:audio>is received without a transcript.Environment
openai/gpt-4o-transcribeanthropic/claude-opus-4-6