You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
file_read can hand images (png/jpeg/gif/webp) to multimodal models, but audio and video are recognized yet never reach the model — they're delivered path-only. Add a delivery path so audio/video files can be supplied as native model input to providers that accept them.
Current state
Audio and video MIME types are already catalogued and categorized (MediaKind.Audio / MediaKind.Video, ModelModality.Audio / ModelModality.Video exist), but every model-input gate is image-only:
MimeTypeCatalog flags only the four image formats SupportsModelInput = true; audio (mp3/m4a/wav/ogg) and video (mp4/mov/webm/mkv/avi) are false.
AttachmentInlineDecision.Resolve only inlines images.
FileReadTool.HandleNonTextFile returns "Audio transcription and video keyframe extraction are not built into file_read" for media.
OpenAiCompatibleChatClient.ToMessage only emits image_url content parts and throws on any non-image DataContent.
So even on a model that advertises input: Text, Image, Video, Netclaw never feeds it audio/video.
Goal
When the active model accepts audio/video, file_read on a supported audio/video file should load it as model-visible input on the next call — the same flow images use today.
Touchpoints (for whoever implements)
MimeTypeCatalog — flip SupportsModelInput = true for the specific audio/video formats we intend to deliver (selectively; not every codec is provider-ingestible).
AttachmentInlineDecision.Resolve — add audio/video branches (currently only images inline; everything else is path-only).
FileReadTool.HandleNonTextFile — register audio/video as model input on capable models instead of returning the "not built in" guidance.
Chat client serialization (OpenAiCompatibleChatClient + any other provider clients) — emit the right content parts. This is the hard part and is provider-specific:
Video: OpenAI chat completions has no native video part; typically needs frame extraction. Gemini accepts native video. So video likely needs a provider-capability branch (or a frame-sampling fallback).
Size/budget — MaxModelInputFileBytes / ModelInputBatchBudget are tuned for ~images. Audio/video are far larger; revisit limits, and consider duration/sampling caps.
Security — magic-byte validation already exists (IsFileMagicCompatible); extend coverage to the new formats.
Capability detection — ModelModality.Audio/Video already flow through context.ModelInputModalities; confirm the resolvers populate them correctly per provider.
Suggest a phased approach: audio first (tractable via OpenAI input_audio), then video (provider-specific, may need frame extraction).
Capability gating must stay strict: never send audio/video to a model that doesn't accept it — fall back to path-only + a model-visible note (consistent with the image path).
Summary
file_readcan hand images (png/jpeg/gif/webp) to multimodal models, but audio and video are recognized yet never reach the model — they're delivered path-only. Add a delivery path so audio/video files can be supplied as native model input to providers that accept them.Current state
Audio and video MIME types are already catalogued and categorized (
MediaKind.Audio/MediaKind.Video,ModelModality.Audio/ModelModality.Videoexist), but every model-input gate is image-only:MimeTypeCatalogflags only the four image formatsSupportsModelInput = true; audio (mp3/m4a/wav/ogg) and video (mp4/mov/webm/mkv/avi) arefalse.SessionMediaStore.TryGetSupportedModelInputhardcodesMediaModality.Image/ModelModality.Image.AttachmentInlineDecision.Resolveonly inlines images.FileReadTool.HandleNonTextFilereturns "Audio transcription and video keyframe extraction are not built into file_read" for media.OpenAiCompatibleChatClient.ToMessageonly emitsimage_urlcontent parts and throws on any non-imageDataContent.So even on a model that advertises
input: Text, Image, Video, Netclaw never feeds it audio/video.Goal
When the active model accepts audio/video,
file_readon a supported audio/video file should load it as model-visible input on the next call — the same flow images use today.Touchpoints (for whoever implements)
MimeTypeCatalog— flipSupportsModelInput = truefor the specific audio/video formats we intend to deliver (selectively; not every codec is provider-ingestible).SessionMediaStore.TryGetSupportedModelInput— stop hardcodingImage; map MIME → correctMediaModality+ requiredModelModality(Audio/Video).AttachmentInlineDecision.Resolve— add audio/video branches (currently only images inline; everything else is path-only).FileReadTool.HandleNonTextFile— register audio/video as model input on capable models instead of returning the "not built in" guidance.OpenAiCompatibleChatClient+ any other provider clients) — emit the right content parts. This is the hard part and is provider-specific:input_audio(base64 + format). vLLM/others vary.MaxModelInputFileBytes/ModelInputBatchBudgetare tuned for ~images. Audio/video are far larger; revisit limits, and consider duration/sampling caps.IsFileMagicCompatible); extend coverage to the new formats.ModelModality.Audio/Videoalready flow throughcontext.ModelInputModalities; confirm the resolvers populate them correctly per provider.Notes
input_audio), then video (provider-specific, may need frame extraction).SessionState/media-nudge layer, so the persistence/hand-off plumbing won't need rework — the gaps are entirely in the catalog gates,file_read, and provider serialization listed above.Related: #1264, #1265.