Skip to content

Add audio and video as native model input to file_read (currently images only) #1266

@Aaronontheweb

Description

@Aaronontheweb

Summary

file_read can hand images (png/jpeg/gif/webp) to multimodal models, but audio and video are recognized yet never reach the model — they're delivered path-only. Add a delivery path so audio/video files can be supplied as native model input to providers that accept them.

Current state

Audio and video MIME types are already catalogued and categorized (MediaKind.Audio / MediaKind.Video, ModelModality.Audio / ModelModality.Video exist), but every model-input gate is image-only:

  • MimeTypeCatalog flags only the four image formats SupportsModelInput = true; audio (mp3/m4a/wav/ogg) and video (mp4/mov/webm/mkv/avi) are false.
  • SessionMediaStore.TryGetSupportedModelInput hardcodes MediaModality.Image / ModelModality.Image.
  • AttachmentInlineDecision.Resolve only inlines images.
  • FileReadTool.HandleNonTextFile returns "Audio transcription and video keyframe extraction are not built into file_read" for media.
  • OpenAiCompatibleChatClient.ToMessage only emits image_url content parts and throws on any non-image DataContent.

So even on a model that advertises input: Text, Image, Video, Netclaw never feeds it audio/video.

Goal

When the active model accepts audio/video, file_read on a supported audio/video file should load it as model-visible input on the next call — the same flow images use today.

Touchpoints (for whoever implements)

  1. MimeTypeCatalog — flip SupportsModelInput = true for the specific audio/video formats we intend to deliver (selectively; not every codec is provider-ingestible).
  2. SessionMediaStore.TryGetSupportedModelInput — stop hardcoding Image; map MIME → correct MediaModality + required ModelModality (Audio/Video).
  3. AttachmentInlineDecision.Resolve — add audio/video branches (currently only images inline; everything else is path-only).
  4. FileReadTool.HandleNonTextFile — register audio/video as model input on capable models instead of returning the "not built in" guidance.
  5. Chat client serialization (OpenAiCompatibleChatClient + any other provider clients) — emit the right content parts. This is the hard part and is provider-specific:
    • Audio: OpenAI-compatible chat completions use input_audio (base64 + format). vLLM/others vary.
    • Video: OpenAI chat completions has no native video part; typically needs frame extraction. Gemini accepts native video. So video likely needs a provider-capability branch (or a frame-sampling fallback).
  6. Size/budgetMaxModelInputFileBytes / ModelInputBatchBudget are tuned for ~images. Audio/video are far larger; revisit limits, and consider duration/sampling caps.
  7. Security — magic-byte validation already exists (IsFileMagicCompatible); extend coverage to the new formats.
  8. Capability detectionModelModality.Audio/Video already flow through context.ModelInputModalities; confirm the resolvers populate them correctly per provider.
  9. Tests — extend the actor-level delivery test (added in fix(sessions): stop tool-loaded images from being dropped before the next LLM call #1265) to cover audio/video reaching the chat client.

Notes

Related: #1264, #1265.

Metadata

Metadata

Assignees

No one assigned

    Labels

    context-pipelineLLM context assembly: prompt layers, dynamic injection, memory recall, temporal groundingenhancementNew feature or requestsessionsLLM session actor, turn lifecycle, pipelines

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions