Problem
Even with multimodal AgentRuntime contract (#385 ✅) and per-runtime implementations (#386 ), media still won't flow end-to-end because the middleware layers don't propagate it:
Inbound : Channel plugins produce media URLs/paths → `ChannelMessage.mediaUrls` exists but was never populated by `buildChannelMessage` (fix(auto-reply): buildChannelMessage never populates mediaUrls field #384 ✅, now fixed) → `ChannelBridge` never passes media to `AgentExecuteParams.media`
Outbound : `AgentRunResult` will have `media` field (feat(middleware): add multimodal media attachments to AgentRuntime contract #385 ✅) → but `ChannelBridge` only extracts `text` → `ReplyPayload` only gets text, media is lost
Scope
Inbound path (channel → runtime)
ChannelBridge media resolution : When `ChannelMessage.mediaUrls` is populated:
Download/resolve media URLs to local file paths (temp files)
Build `MediaAttachment[]` with MIME type detection
Check runtime's `mediaCapabilities.acceptsInbound`
For supported types: pass through as `AgentExecuteParams.media`
For unsupported types: delegate to middleware fallback (STT for audio, vision API for images — see below)
Middleware fallback layer : For runtimes that can't handle certain media types:
Audio → STT : Use `src/stt/` module (refactor(media): extract STT from media-understanding into src/stt/ #424 ). Convert voice messages to text, prepend to prompt. This is a middleware concern — every runtime needs text.
Image/video → text description : Thin fallback using the runtime's own API key (from auth profiles). Only for runtimes that declare no image support.
Outbound path (runtime → channel)
ChannelBridge : Handle `AgentMediaEvent` and `AgentRunResult.media`:
Convert `MediaAttachment` to `ReplyPayload.mediaUrl` / `ReplyPayload.mediaUrls`
Temp file management: serve from local path, clean up after delivery
Auto-reply delivery : Already handles `ReplyPayload.mediaUrl` — should work once ChannelBridge populates it
TTS integration point : Outbound audio can come from either:
AgentRuntime (native media emission — future)
TTS module (text → speech conversion — existing)
Both paths produce `ReplyPayload.mediaUrl` — delivery is unified
Architecture diagram
```
Inbound:
Channel plugin → ChannelMessage { mediaUrls }
→ ChannelBridge
→ runtime.mediaCapabilities check
→ supported: MediaAttachment[] → AgentExecuteParams.media
→ unsupported audio: STT middleware → text in prompt
→ unsupported image/video: fallback vision API → text in prompt
→ runtime.execute(params)
Outbound:
runtime.execute() yields AgentMediaEvent / AgentRunResult.media
→ ChannelBridge
→ ReplyPayload { mediaUrl, mediaUrls }
→ channel delivery
```
Depends on
Related
Problem
Even with multimodal AgentRuntime contract (#385 ✅) and per-runtime implementations (#386), media still won't flow end-to-end because the middleware layers don't propagate it:
Scope
Inbound path (channel → runtime)
ChannelBridge media resolution: When `ChannelMessage.mediaUrls` is populated:
Middleware fallback layer: For runtimes that can't handle certain media types:
Outbound path (runtime → channel)
ChannelBridge: Handle `AgentMediaEvent` and `AgentRunResult.media`:
Auto-reply delivery: Already handles `ReplyPayload.mediaUrl` — should work once ChannelBridge populates it
TTS integration point: Outbound audio can come from either:
Architecture diagram
```
Inbound:
Channel plugin → ChannelMessage { mediaUrls }
→ ChannelBridge
→ runtime.mediaCapabilities check
→ supported: MediaAttachment[] → AgentExecuteParams.media
→ unsupported audio: STT middleware → text in prompt
→ unsupported image/video: fallback vision API → text in prompt
→ runtime.execute(params)
Outbound:
runtime.execute() yields AgentMediaEvent / AgentRunResult.media
→ ChannelBridge
→ ReplyPayload { mediaUrl, mediaUrls }
→ channel delivery
```
Depends on
Related