Context
As we add audio/video input (#1266), the image egress path — read the whole file, inline base64 every turn (#1296) — must not be extended to them. It doesn't match how multimodal models actually ingest audio and video, and it doesn't survive the file sizes involved. This issue is to settle the egress model before building.
How these models actually consume audio and video
Worth being precise, because it dictates the design:
-
Audio. Native audio-capable models tokenize the audio directly into audio tokens. You send the audio bytes — via a provider file/upload API, or base64 for short clips — and the model does the rest. Transcribing to text first is a lossy fallback for text-only models, not the native path: it discards prosody, speaker turns, overlapping speech, laughter, background sound. So the design should send audio to audio-capable models as audio, and only fall back to transcription when the active model genuinely can't take audio input.
-
Video. Models consume video as sampled frames plus the audio track — nobody feeds a raw H.264/H.265 bitstream into the transformer. Either the provider samples server-side (upload via a file API, the model samples at roughly ~1 fps) or the client extracts keyframes and sends them as images, with the audio handled as audio. "Keyframe extraction" isn't a workaround; it's how video understanding works.
-
It's token-expensive. Media burns context fast — once frames and audio are counted, video runs on the order of a few hundred tokens per second, so a few minutes is tens of thousands of tokens. Sampling rate and duration have to be bounded, and that's a context-budget concern, not just a memory one.
Egress principle
Large media goes to the provider as a streamed upload referenced by ID, not inlined as base64 in every chat turn. Bytes stream from disk to the upload (or to a frame extractor) — never ReadAllBytes of a multi-hundred-MB file.
Concretely:
- Audio → provider file/upload API (or streamed base64 for short clips); native audio tokens; transcription only as an explicit text-only-model fallback.
- Video → provider file API with server-side sampling where available; otherwise client-side keyframe extraction → images, plus the audio stream as audio. Bound fps and duration.
- Never extend the image path's
ReadAllBytes + inline base64 to audio or video.
Suggested direction
- Add a media egress abstraction that chooses upload-by-reference vs inline based on size, modality, and provider capability — rather than the single
ReadAllBytes + DataContent path images use today.
- Decide per provider whether frame sampling is server-side (file API) or client-side (we extract).
- Treat transcription as an explicit, optional fallback, not the default audio path.
- Carry duration/sampling limits into context accounting, since A/V dominates the token budget.
Related
#1266 (native A/V input for file_read). #1296 (image egress: downscale + streamed encode). #1293 (shared "don't ReadAllBytes a huge thing" lesson).
Context
As we add audio/video input (#1266), the image egress path — read the whole file, inline base64 every turn (#1296) — must not be extended to them. It doesn't match how multimodal models actually ingest audio and video, and it doesn't survive the file sizes involved. This issue is to settle the egress model before building.
How these models actually consume audio and video
Worth being precise, because it dictates the design:
Audio. Native audio-capable models tokenize the audio directly into audio tokens. You send the audio bytes — via a provider file/upload API, or base64 for short clips — and the model does the rest. Transcribing to text first is a lossy fallback for text-only models, not the native path: it discards prosody, speaker turns, overlapping speech, laughter, background sound. So the design should send audio to audio-capable models as audio, and only fall back to transcription when the active model genuinely can't take audio input.
Video. Models consume video as sampled frames plus the audio track — nobody feeds a raw H.264/H.265 bitstream into the transformer. Either the provider samples server-side (upload via a file API, the model samples at roughly ~1 fps) or the client extracts keyframes and sends them as images, with the audio handled as audio. "Keyframe extraction" isn't a workaround; it's how video understanding works.
It's token-expensive. Media burns context fast — once frames and audio are counted, video runs on the order of a few hundred tokens per second, so a few minutes is tens of thousands of tokens. Sampling rate and duration have to be bounded, and that's a context-budget concern, not just a memory one.
Egress principle
Large media goes to the provider as a streamed upload referenced by ID, not inlined as base64 in every chat turn. Bytes stream from disk to the upload (or to a frame extractor) — never
ReadAllBytesof a multi-hundred-MB file.Concretely:
ReadAllBytes+ inline base64 to audio or video.Suggested direction
ReadAllBytes+DataContentpath images use today.Related
#1266 (native A/V input for
file_read). #1296 (image egress: downscale + streamed encode). #1293 (shared "don'tReadAllBytesa huge thing" lesson).