Skip to content

feat(middleware): propagate multimodal media through ChannelBridge and auto-reply #387

@alexey-pelykh

Description

@alexey-pelykh

Problem

Even with multimodal AgentRuntime contract (#385 ✅) and per-runtime implementations (#386), media still won't flow end-to-end because the middleware layers don't propagate it:

  1. Inbound: Channel plugins produce media URLs/paths → `ChannelMessage.mediaUrls` exists but was never populated by `buildChannelMessage` (fix(auto-reply): buildChannelMessage never populates mediaUrls field #384 ✅, now fixed) → `ChannelBridge` never passes media to `AgentExecuteParams.media`
  2. Outbound: `AgentRunResult` will have `media` field (feat(middleware): add multimodal media attachments to AgentRuntime contract #385 ✅) → but `ChannelBridge` only extracts `text` → `ReplyPayload` only gets text, media is lost

Scope

Inbound path (channel → runtime)

  1. ChannelBridge media resolution: When `ChannelMessage.mediaUrls` is populated:

    • Download/resolve media URLs to local file paths (temp files)
    • Build `MediaAttachment[]` with MIME type detection
    • Check runtime's `mediaCapabilities.acceptsInbound`
    • For supported types: pass through as `AgentExecuteParams.media`
    • For unsupported types: delegate to middleware fallback (STT for audio, vision API for images — see below)
  2. Middleware fallback layer: For runtimes that can't handle certain media types:

    • Audio → STT: Use `src/stt/` module (refactor(media): extract STT from media-understanding into src/stt/ #424). Convert voice messages to text, prepend to prompt. This is a middleware concern — every runtime needs text.
    • Image/video → text description: Thin fallback using the runtime's own API key (from auth profiles). Only for runtimes that declare no image support.

Outbound path (runtime → channel)

  1. ChannelBridge: Handle `AgentMediaEvent` and `AgentRunResult.media`:

    • Convert `MediaAttachment` to `ReplyPayload.mediaUrl` / `ReplyPayload.mediaUrls`
    • Temp file management: serve from local path, clean up after delivery
  2. Auto-reply delivery: Already handles `ReplyPayload.mediaUrl` — should work once ChannelBridge populates it

  3. TTS integration point: Outbound audio can come from either:

    • AgentRuntime (native media emission — future)
    • TTS module (text → speech conversion — existing)
    • Both paths produce `ReplyPayload.mediaUrl` — delivery is unified

Architecture diagram

```
Inbound:
Channel plugin → ChannelMessage { mediaUrls }
→ ChannelBridge
→ runtime.mediaCapabilities check
→ supported: MediaAttachment[] → AgentExecuteParams.media
→ unsupported audio: STT middleware → text in prompt
→ unsupported image/video: fallback vision API → text in prompt
→ runtime.execute(params)

Outbound:
runtime.execute() yields AgentMediaEvent / AgentRunResult.media
→ ChannelBridge
→ ReplyPayload { mediaUrl, mediaUrls }
→ channel delivery
```

Depends on

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions