Design audio/video egress around how multimodal models ingest them (not ReadAllBytes + inline base64)

## Context

As we add audio/video input (#1266), the image egress path — read the whole file, inline base64 every turn (#1296) — must not be extended to them. It doesn't match how multimodal models actually ingest audio and video, and it doesn't survive the file sizes involved. This issue is to settle the egress model before building.

## How these models actually consume audio and video

Worth being precise, because it dictates the design:

- **Audio.** Native audio-capable models tokenize the audio directly into audio tokens. You send the audio bytes — via a provider file/upload API, or base64 for short clips — and the model does the rest. Transcribing to text first is a *lossy fallback* for text-only models, not the native path: it discards prosody, speaker turns, overlapping speech, laughter, background sound. So the design should send audio to audio-capable models as audio, and only fall back to transcription when the active model genuinely can't take audio input.

- **Video.** Models consume video as **sampled frames plus the audio track** — nobody feeds a raw H.264/H.265 bitstream into the transformer. Either the provider samples server-side (upload via a file API, the model samples at roughly ~1 fps) or the client extracts keyframes and sends them as images, with the audio handled as audio. "Keyframe extraction" isn't a workaround; it's how video understanding works.

- **It's token-expensive.** Media burns context fast — once frames and audio are counted, video runs on the order of a few hundred tokens per second, so a few minutes is tens of thousands of tokens. Sampling rate and duration have to be bounded, and that's a context-budget concern, not just a memory one.

## Egress principle

Large media goes to the provider as a **streamed upload referenced by ID**, not inlined as base64 in every chat turn. Bytes stream from disk to the upload (or to a frame extractor) — never `ReadAllBytes` of a multi-hundred-MB file.

Concretely:

- **Audio** → provider file/upload API (or streamed base64 for short clips); native audio tokens; transcription only as an explicit text-only-model fallback.
- **Video** → provider file API with server-side sampling where available; otherwise client-side keyframe extraction → images, plus the audio stream as audio. Bound fps and duration.
- Never extend the image path's `ReadAllBytes` + inline base64 to audio or video.

## Suggested direction

- Add a media egress abstraction that chooses upload-by-reference vs inline based on size, modality, and provider capability — rather than the single `ReadAllBytes` + `DataContent` path images use today.
- Decide per provider whether frame sampling is server-side (file API) or client-side (we extract).
- Treat transcription as an explicit, optional fallback, not the default audio path.
- Carry duration/sampling limits into context accounting, since A/V dominates the token budget.

## Related

#1266 (native A/V input for `file_read`). #1296 (image egress: downscale + streamed encode). #1293 (shared "don't `ReadAllBytes` a huge thing" lesson).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design audio/video egress around how multimodal models ingest them (not ReadAllBytes + inline base64) #1297

Context

How these models actually consume audio and video

Egress principle

Suggested direction

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Design audio/video egress around how multimodal models ingest them (not ReadAllBytes + inline base64) #1297

Description

Context

How these models actually consume audio and video

Egress principle

Suggested direction

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions