Skip to content

Design audio/video egress around how multimodal models ingest them (not ReadAllBytes + inline base64) #1297

@Aaronontheweb

Description

@Aaronontheweb

Context

As we add audio/video input (#1266), the image egress path — read the whole file, inline base64 every turn (#1296) — must not be extended to them. It doesn't match how multimodal models actually ingest audio and video, and it doesn't survive the file sizes involved. This issue is to settle the egress model before building.

How these models actually consume audio and video

Worth being precise, because it dictates the design:

  • Audio. Native audio-capable models tokenize the audio directly into audio tokens. You send the audio bytes — via a provider file/upload API, or base64 for short clips — and the model does the rest. Transcribing to text first is a lossy fallback for text-only models, not the native path: it discards prosody, speaker turns, overlapping speech, laughter, background sound. So the design should send audio to audio-capable models as audio, and only fall back to transcription when the active model genuinely can't take audio input.

  • Video. Models consume video as sampled frames plus the audio track — nobody feeds a raw H.264/H.265 bitstream into the transformer. Either the provider samples server-side (upload via a file API, the model samples at roughly ~1 fps) or the client extracts keyframes and sends them as images, with the audio handled as audio. "Keyframe extraction" isn't a workaround; it's how video understanding works.

  • It's token-expensive. Media burns context fast — once frames and audio are counted, video runs on the order of a few hundred tokens per second, so a few minutes is tens of thousands of tokens. Sampling rate and duration have to be bounded, and that's a context-budget concern, not just a memory one.

Egress principle

Large media goes to the provider as a streamed upload referenced by ID, not inlined as base64 in every chat turn. Bytes stream from disk to the upload (or to a frame extractor) — never ReadAllBytes of a multi-hundred-MB file.

Concretely:

  • Audio → provider file/upload API (or streamed base64 for short clips); native audio tokens; transcription only as an explicit text-only-model fallback.
  • Video → provider file API with server-side sampling where available; otherwise client-side keyframe extraction → images, plus the audio stream as audio. Bound fps and duration.
  • Never extend the image path's ReadAllBytes + inline base64 to audio or video.

Suggested direction

  • Add a media egress abstraction that chooses upload-by-reference vs inline based on size, modality, and provider capability — rather than the single ReadAllBytes + DataContent path images use today.
  • Decide per provider whether frame sampling is server-side (file API) or client-side (we extract).
  • Treat transcription as an explicit, optional fallback, not the default audio path.
  • Carry duration/sampling limits into context accounting, since A/V dominates the token budget.

Related

#1266 (native A/V input for file_read). #1296 (image egress: downscale + streamed encode). #1293 (shared "don't ReadAllBytes a huge thing" lesson).

Metadata

Metadata

Assignees

No one assigned

    Labels

    context-pipelineLLM context assembly: prompt layers, dynamic injection, memory recall, temporal groundingenhancementNew feature or requestprovidersProvider integrations and capability detection across OpenAI-compatible backends.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions