Skip to content

feat(middleware): add multimodal media attachments to AgentRuntime contract #385

@alexey-pelykh

Description

@alexey-pelykh

Problem

The AgentRuntime interface is text-only. Both inbound (AgentExecuteParams.prompt: string) and outbound (AgentRunResult.text: string) are plain strings. There is no way to pass media (images, audio, video, documents) to or from CLI runtimes at the contract level.

This forces media conversion to happen outside the runtime contract:

  • Inbound: applyMediaUnderstanding() converts media to text descriptions before the prompt reaches the runtime
  • Outbound: Media is only emitted via MCP side effects (sentMediaUrls), not as first-class runtime output

Proposed contract changes

Inbound: AgentExecuteParams

export type MediaAttachment = {
  /** MIME type (e.g., "image/jpeg", "audio/ogg", "video/mp4"). */
  mimeType: string;
  /** Local file path to the media (preferred for CLI runtimes that accept file paths). */
  filePath?: string;
  /** Base64-encoded content (for runtimes that accept inline data). */
  base64?: string;
  /** Original URL (for reference/logging; runtimes should prefer filePath or base64). */
  sourceUrl?: string;
  /** Original filename (for display/logging). */
  fileName?: string;
};

export type AgentExecuteParams = {
  prompt: string;
  /** Media attachments to include with the prompt. */
  media?: MediaAttachment[];
  // ... existing fields
};

Outbound: AgentEvent / AgentRunResult

export type AgentMediaEvent = {
  type: "media";
  media: MediaAttachment;
};

export type AgentEvent =
  | AgentTextEvent
  | AgentMediaEvent   // ← new
  | AgentToolUseEvent
  | AgentToolResultEvent
  | AgentErrorEvent
  | AgentDoneEvent;

export type AgentRunResult = {
  text: string;
  /** Media attachments produced by the agent (non-MCP path). */
  media?: MediaAttachment[];
  // ... existing fields
};

Runtime capability declaration

export interface AgentRuntime {
  execute(params: AgentExecuteParams): AsyncIterable<AgentEvent>;

  /** Declare which media types this runtime can handle natively. */
  readonly mediaCapabilities?: {
    /** MIME type prefixes accepted as inbound media (e.g., ["image/", "audio/", "video/"]). */
    acceptsInbound?: string[];
    /** Whether the runtime can emit media in responses. */
    emitsOutbound?: boolean;
  };
}

Design notes

  • Runtimes that don't support media can ignore the media field — prompt text still works as before
  • ChannelBridge uses mediaCapabilities to decide: pass media through natively vs. fall back to text description (STT for audio, vision API for images)
  • filePath is preferred over base64 for disk-based CLIs (Gemini uses @path, Claude could use file references)
  • base64 is available for runtimes that need inline data (Claude's --input-format stream-json)
  • Outbound AgentMediaEvent enables runtimes to produce media directly (e.g., generated images) without relying on MCP tools

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions